Document Parser Cloud API

Browse our Products

Are you looking for an evaluation version of a product?

If so you can download any of the below versions for testing. The product will function as normal except for an evaluation limitation. At the time of purchase we provide a license file via email that will allow the product to work in its full capacity. If you would also like an evaluation license to test without any restrictions for 30 days, please follow the directions provided here.

Are you having troubles in downloading?

If you experience errors, when you try to download a file, make sure your network policies (enforced by your company or ISP) allow downloading ZIP and/or MSI files.

Installation
The package is available at PyPI and it can be installed via pip by executing following command:
pip install groupdocs-parser-cloud

Requirements

Python 2.7 or 3.4+
pip package manager
GroupDocs Cloud credentials — Client ID and Client Secret from the Dashboard

Dependencies

The SDK automatically installs the following packages:

Package	Constraint
urllib3	>= 1.15
six	>= 1.10
certifi	—
python-dateutil	—

Document Parsing & Data Extraction Python Cloud REST API

GroupDocs.Parser Cloud SDK for Python empowers developers to integrate advanced document parsing and data extraction into Python web apps, scripts, and automation workflows. Extract text, images, metadata, and structured data from over 70 file formats — including Word, Excel, PDF, presentations, emails, archives, and eBooks. Define custom extraction templates to pull text fields, numbers, and tables from invoices, forms, and business documents. Whether parsing a single file or processing container items from ZIP archives, PST/OST mail stores, or PDF portfolios, GroupDocs.Parser delivers accurate, scalable tools for cloud-based document intelligence.

Text Extraction

Extract plain text - Extract text content from documents in a simple form.
Extract formatted text - Extract text while preserving original formatting.
Extract text by page range - Extract text from specific pages only.
Extract text from containers - Extract text from documents inside ZIP archives, PST/OST files, and PDF portfolios.

Image Extraction

Extract all images - Extract every embedded image from a whole document.
Extract images by page range - Extract images from specific pages based on a page range.
Extract images from containers - Extract images from documents inside container files.

Template-Based Parsing

Parse by template - Parse documents using user-defined templates for structured data extraction.
Create or update templates - Define and store extraction templates in cloud storage.
Get and delete templates - Retrieve or remove templates stored in user storage.
Parse by template object - Pass a template definition directly in the API request.

Document Information

Get document information - Retrieve file extension, size in bytes, and page count.
Get container items information - List items within ZIP archives, PDF portfolios, and mail stores.
Get supported file formats - Retrieve the full list of supported parsing formats.

File Operations

Upload Files to Cloud - Upload files to cloud storage via the API.
Download Files from Cloud - Download files from cloud storage to local systems.
Copy Files - Copy files within the cloud storage to different locations.
Move Files - Move files between folders in cloud storage.
Delete Files - Delete specific files from cloud storage.

Folder Operations

Create Folder - Create new folders in the cloud storage.
Copy Folder - Duplicate folders within the cloud storage.
Move Folder - Move folders between directories in cloud storage.
Delete Folder - Remove entire folders from cloud storage.

Licensing and Authentication

Evaluation Mode - Try the API with a free trial account.
Secure Authentication - Use Client ID and Client Secret for secure API access.
MIT License - The Python SDK is licensed under the MIT License.

Supported Document Formats

GroupDocs.Parser Cloud supports 70+ file formats with text extraction, image extraction, and template-based parsing capabilities:

Word Processing: DOC, DOCX, DOCM, DOT, DOTX, DOTM, TXT, RTF, ODT, OTT
PDF: PDF
Markup: HTML, XHTML, MHTML, MD, XML
eBooks: CHM, EPUB, FB2
Spreadsheets: XLS, XLT, XLSX, XLSM, XLSB, XLTX, XLTM, ODS, OTS, CSV, XLA, XLAM, NUMBERS
Presentations: PPT, PPS, POT, PPTX, PPTM, POTX, POTM, PPSX, PPSM, ODP, OTP
Emails: PST, OST, EML, EMLX, MSG
Notes: ONE (Microsoft OneNote)
Archives: ZIP

Supported operations vary by format. For the complete format matrix, see the documentation.

Quick Start

Get your API credentials

To use GroupDocs.Parser Cloud, sign up at GroupDocs.Cloud Dashboard and get your Client ID and Client Secret.

Initialize the API

Use the following code to start using the GroupDocs.Parser Cloud SDK for Python:

import groupdocs_parser_cloud

# Get your ClientId and ClientSecret at https://dashboard.groupdocs.cloud
client_id = "YourClientId"
client_secret = "YourClientSecret"

# Create API configuration
configuration = groupdocs_parser_cloud.Configuration(client_id, client_secret)
configuration.api_base_url = "https://api.groupdocs.cloud"

# Create instance of the Parse API
parse_api = groupdocs_parser_cloud.ParseApi.from_config(configuration)

Extract text from a document

Once initialized, use this basic example to extract text from a document in cloud storage:

import groupdocs_parser_cloud

client_id = "YourClientId"
client_secret = "YourClientSecret"

parse_api = groupdocs_parser_cloud.ParseApi.from_keys(client_id, client_secret)

options = groupdocs_parser_cloud.TextOptions()
options.file_info = groupdocs_parser_cloud.FileInfo()
options.file_info.file_path = "email/eml/embedded-image-and-attachment.eml"

request = groupdocs_parser_cloud.TextRequest(options)
result = parse_api.text(request)

print("Text: " + result.text)

With this quick start guide, you’re all set to begin parsing documents using GroupDocs.Parser Cloud in your Python applications. For more details, visit the documentation.

Get Supported File Formats

Retrieve the full list of supported file formats available through the Parser API.

import groupdocs_parser_cloud

info_api = groupdocs_parser_cloud.InfoApi.from_keys("YourClientId", "YourClientSecret")

result = info_api.get_supported_file_formats()

for fmt in result.formats:
    print(fmt.file_format)

Parse Document by Template

Parse a document using a user-defined template stored in cloud storage to extract structured fields and tables.

import groupdocs_parser_cloud

parse_api = groupdocs_parser_cloud.ParseApi.from_keys("YourClientId", "YourClientSecret")

options = groupdocs_parser_cloud.ParseOptions()
options.file_info = groupdocs_parser_cloud.FileInfo()
options.file_info.file_path = "words-processing/docx/companies.docx"
options.template_path = "templates/companies.json"

request = groupdocs_parser_cloud.ParseRequest(options)
result = parse_api.parse(request)

for data in result.fields_data:
    if data.page_area.page_text_area is not None:
        print("Field name: " + data.name + ". Text: " + data.page_area.page_text_area.text)

    if data.page_area.page_table_area is not None:
        print("Table name: " + data.name)
        for cell in data.page_area.page_table_area.page_table_area_cells:
            print("Row " + str(cell.row_index) + " column " + str(cell.column_index) + ": " + cell.page_area.page_text_area.text)

Extract Images from a Document

Extract all embedded images from a document and retrieve their cloud storage paths and download URLs.

import groupdocs_parser_cloud

parse_api = groupdocs_parser_cloud.ParseApi.from_keys("YourClientId", "YourClientSecret")

options = groupdocs_parser_cloud.ImagesOptions()
options.file_info = groupdocs_parser_cloud.FileInfo()
options.file_info.file_path = "slides/three-slides.pptx"

request = groupdocs_parser_cloud.ImagesRequest(options)
result = parse_api.images(request)

for image in result.images:
    print("Image path: " + image.path + ". Download url: " + image.download_url)
    print("Format: " + image.file_format + ". Page index: " + str(image.page_index))

Get Document Information

Retrieve metadata about a document such as page count and file properties.

import groupdocs_parser_cloud

info_api = groupdocs_parser_cloud.InfoApi.from_keys("YourClientId", "YourClientSecret")

options = groupdocs_parser_cloud.InfoOptions()
options.file_info = groupdocs_parser_cloud.FileInfo()
options.file_info.file_path = "words-processing/docx/password-protected.docx"
options.file_info.password = "password"

request = groupdocs_parser_cloud.GetInfoRequest(options)
result = info_api.get_info(request)

print("Page count: " + str(result.page_count))

Get Container Items Information

List items within container files such as ZIP archives or mail stores.

import groupdocs_parser_cloud

info_api = groupdocs_parser_cloud.InfoApi.from_keys("YourClientId", "YourClientSecret")

options = groupdocs_parser_cloud.ContainerOptions()
options.file_info = groupdocs_parser_cloud.FileInfo()
options.file_info.file_path = "containers/archive/zip.zip"

request = groupdocs_parser_cloud.ContainerRequest(options)
result = info_api.container(request)

for item in result.container_items:
    print("Name: " + item.name + ". FilePath: " + item.file_path)

Sample Projects on GitHub

The GroupDocs.Parser Cloud Python Samples repository includes ready-to-run examples covering:

Category	Examples
Info Operations	Supported file formats, document information, container items information
Parse Operations — Extract Text	Extract text from whole document, formatted text, text by page range, text from container
Parse Operations — Extract Images	Extract images from whole document, images by page range, images from container
Parse Operations — Parse by Template	Parse by template in user storage, template defined as object, parse document inside container
Template Operations	Create or update template, get template, delete template

How to run the examples

Clone or download the samples repository
Edit RunExamples.py and set your app_sid and app_key
Go to the Examples directory
Run pip install groupdocs-parser-cloud -U
Execute python RunExamples.py

For more details, visit Getting Started.