Universal Document Format Transformer avatar
Universal Document Format Transformer

Pricing

from $5.00 / 1,000 results

Go to Apify Store
Universal Document Format Transformer

Universal Document Format Transformer

Universal Document Format Transformer: a cloud-based Apify Actor that converts documents (PDF, DOCX, PPTX, HTML, TXT) into Markdown, JSON, CSV, HTML or TXT using Pandoc. Easy REST API for automations (n8n, Zapier, Make), production-ready error handling, and security controls.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

fanio zilla

fanio zilla

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

9 hours ago

Last modified

Share

Apify Actor for cloud-based document format conversion using Pandoc. Convert documents from one format to another via a simple API without local installations.

Features

  • πŸ”„ Convert between multiple document formats
  • ☁️ Cloud-based - no local installation needed
  • πŸ”— Works with URLs from S3, Google Drive, OneDrive, etc.
  • ⚑ Fast processing with 60-second timeout protection
  • πŸ“Š Structured JSON output with download URLs
  • πŸ”„ Automatic retry for network failures
  • πŸ’‘ Clear error messages with actionable suggestions

Supported Formats

Input Formats (From)

  • DOCX - Microsoft Word documents
  • PPTX - Microsoft PowerPoint presentations
  • HTML - Web pages and HTML documents
  • TXT - Plain text files

Output Formats (To)

  • Markdown - Lightweight markup format
  • JSON - Structured data format
  • CSV - Comma-separated values
  • HTML - Web format
  • TXT - Plain text
  • PDF - Portable Document Format

Important Notes

  • ❌ PDF cannot be used as input format - This is a Pandoc limitation
  • βœ… PDF is supported as output format only
  • πŸ“„ Format compatibility varies - see table below

Format Compatibility Matrix

Input \ OutputMarkdownJSONCSVHTMLTXTPDF
DOCXβœ…βœ…βœ…βœ…βœ…βœ…
PPTXβœ…βœ…βœ…βœ…βœ…βœ…
HTMLβœ…βš οΈβš οΈβœ…βœ…βœ…
TXTβœ…βš οΈβš οΈβœ…βœ…βœ…

Legend:

  • βœ… Fully Supported - Good conversion quality
  • ⚠️ Limited Support - May lose formatting/structure

Input

The actor accepts the following input parameters:

{
"fileUrl": "https://example.com/document.docx",
"fromFormat": "docx",
"toFormat": "markdown"
}

Parameters

  • fileUrl (string, required): Public URL to the document (S3, Google Drive, OneDrive, etc.)

    • Must be HTTP or HTTPS protocol
    • File must be publicly accessible
    • Maximum file size: 50MB
  • fromFormat (string, required): Source document format

    • Options: docx, pptx, html, txt
    • ❌ PDF is NOT supported as input
  • toFormat (string, required): Target document format

    • Options: markdown, json, csv, html, txt, pdf

Output

The actor returns the following result:

{
"downloadUrl": "https://api.apify.com/v2/...",
"inputFormat": "docx",
"outputFormat": "markdown",
"fileSize": 12345,
"processingTime": 2.5,
"status": "success"
}

Output Fields

  • downloadUrl: URL to download the converted file (valid for 7 days)
  • inputFormat: The format of the input file
  • outputFormat: The format of the output file
  • fileSize: Size of the converted file in bytes
  • processingTime: Time taken for conversion in seconds
  • status: Either "success" or "error"

Usage Examples

Example 1: Convert DOCX to Markdown

{
"fileUrl": "https://example.com/report.docx",
"fromFormat": "docx",
"toFormat": "markdown"
}

Example 2: Convert PPTX to PDF

{
"fileUrl": "https://example.com/presentation.pptx",
"fromFormat": "pptx",
"toFormat": "pdf"
}

Example 3: Convert HTML to TXT

{
"fileUrl": "https://example.com/page.html",
"fromFormat": "html",
"toFormat": "txt"
}

Example 4: Convert TXT to JSON

{
"fileUrl": "https://example.com/data.txt",
"fromFormat": "txt",
"toFormat": "json"
}

API Usage

You can run this actor via the Apify API:

curl -X POST "https://api.apify.com/v2/acts/WgRQY2Ta2VKQE5NgO/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"fileUrl": "https://example.com/document.docx",
"fromFormat": "docx",
"toFormat": "markdown"
}'

Error Handling

The actor includes comprehensive error handling with clear, actionable error messages:

Common Errors & Solutions

Invalid URL

Error: Invalid URL format
Solution: Check your fileUrl starts with http:// or https://

Unsupported Input Format

Error: PDF cannot be used as input format
Solution: Use docx, pptx, html, or txt as input. PDF is output-only.

File Not Found (404)

Error: Download failed: File not found
Solution: Verify the URL is correct and the file exists

Access Denied (403)

Error: Download failed: Access denied
Solution: Use a public URL or one with proper access permissions

Timeout

Error: Pandoc conversion timed out
Solution: Try with a smaller file or simpler document format

Development

Local Testing

# Install dependencies
npm install
# Run local tests (requires Pandoc for full testing)
node test-local.js
# Test with Docker (recommended for full testing)
docker build -t universal-document-format-transformer .
docker run universal-document-format-transformer

Environment Variables

  • MAX_FILE_SIZE: Maximum file size in bytes (default: 52428800 = 50MB)
  • DOWNLOAD_TIMEOUT: Download timeout in milliseconds (default: 30000 = 30s)
  • MAX_DOWNLOAD_RETRIES: Number of download retry attempts (default: 3)
  • PANDOC_TIMEOUT: Pandoc conversion timeout in milliseconds (default: 55000 = 55s)

Conversion Quality Notes

  • DOCX/PPTX to Markdown: Excellent conversion, preserves most formatting
  • PDF Output: High quality, but requires Pandoc with PDF support
  • HTML to Structured Formats: May lose CSS styling and complex layouts
  • TXT to JSON/CSV: Limited support, best for simple structured text
  • Large Files: May timeout - consider splitting or using smaller files

Troubleshooting

Actor fails with timeout

  • Reduce file size
  • Use simpler input format (TXT is fastest)
  • Avoid complex conversions (e.g., PPTX to CSV)

Download fails repeatedly

  • Check URL accessibility in browser
  • Verify file is publicly accessible
  • Ensure URL uses HTTP/HTTPS protocol

Conversion produces empty output

  • Verify input file is not corrupted
  • Check fromFormat matches actual file type
  • Try different toFormat option

Support

For issues, questions, or feature requests:

  1. Check the troubleshooting section above
  2. Review the error messages for specific guidance
  3. Test with the provided examples first
  4. Contact support with your input parameters and error details