Fast Pdf Processor avatar
Fast Pdf Processor

Pricing

$4.99/month + usage

Go to Apify Store
Fast Pdf Processor

Fast Pdf Processor

This API is a PDF Processing Service allowing users to upload a PDF to: Extract Text: Reads all text from the PDF and returns it as structured JSON data per page. Merge Pages: Creates a new PDF containing only the specific pages selected by the user. (260 characters)

Pricing

$4.99/month + usage

Rating

0.0

(0)

Developer

Andric

Andric

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

5 days ago

Last modified

Share

PDF Processor - Apify Actor Deployment Guide

Overview

This PDF Processor provides four main operations via Apify Actor:

  1. Extract Text - Extract text content from all PDF pages
  2. Merge Pages - Create new PDFs with selected pages only
  3. HTML to PDF - Convert HTML content to PDF using Playwright
  4. URL to PDF - Convert web pages to PDF using Playwright

Files Structure

pdf-processor-actor/
├── main.py # Apify Actor wrapper (main entry point)
├── requirements.txt # Dependencies for Apify deployment
├── requirements_apify.txt # Alternative requirements file
├── Dockerfile # Docker configuration for Apify
├── actor.json # Apify Actor configuration
├── INPUT_SCHEMA.json # Input schema definition
├── apify_input_schema.json # Legacy input schema
├── apify_output_schema.json # Output schema definition
├── sample_inputs.json # Example inputs for testing
├── test_local.py # Local testing script
├── n8n_workflow_example.json # n8n integration example
├── n8n_direct_api_workflow.json # n8n direct API workflow
├── QUICK_START.md # Quick start guide
├── apify.json # Apify configuration
├── actor/ # Actor configuration directory
│ ├── actor.json
│ └── dataset_schema.json
└── README.md # This file

Deployment Steps

1. Prepare Your Repository

# Create a new directory for your actor
mkdir pdf-processor-actor
cd pdf-processor-actor
# Copy all the provided files
cp /path/to/main.py .
cp /path/to/app.py .
cp /path/to/requirements_apify.txt .
cp /path/to/Dockerfile .
cp /path/to/actor.json .
cp /path/to/apify_input_schema.json .
cp /path/to/apify_output_schema.json .
cp /path/to/sample_inputs.json .

2. Deploy to Apify

Option A: Using Apify CLI

# Install Apify CLI
npm install -g apify-cli
# Login to your Apify account
apify login
# Initialize the actor
apify init
# Push to Apify platform
apify push

Option B: Using GitHub Integration

  1. Push your code to a GitHub repository
  2. Go to Apify Console
  3. Click "Actors" → "Create new"
  4. Choose "From GitHub repository"
  5. Connect your GitHub repo
  6. Apify will automatically build and deploy

3. Configure the Actor

In Apify Console:

  1. Navigate to your actor
  2. Go to "Settings" tab
  3. Set the following:
    • Build tag: latest
    • Memory: 512 MB (minimum, increase for complex webpages or large PDFs)
    • Timeout: 300 seconds (adjust based on PDF size and webpage complexity)

4. Test Your Actor

  1. Go to the "Input" tab
  2. Use one of the sample inputs:

Extract Text:

{
"action": "extract-text",
"pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
}

Merge Pages:

{
"action": "merge-pages",
"pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"pageNumbers": [0, 2, 4]
}

HTML to PDF:

{
"action": "html-to-pdf",
"html": "<html><body><h1>Hello World</h1><p>This is a test PDF.</p></body></html>"
}

URL to PDF:

{
"action": "url-to-pdf",
"pdfUrl": "https://example.com"
}
  1. Click "Run"
  2. Check the output in the "Dataset" tab

Usage Examples

Via Apify API

from apify_client import ApifyClient
client = ApifyClient('YOUR_API_TOKEN')
actor = client.actor('YOUR_USERNAME/pdf-processor')
# Extract text
run = actor.call(run_input={
"action": "extract-text",
"pdfUrl": "https://example.com/document.pdf"
})
# HTML to PDF
run = actor.call(run_input={
"action": "html-to-pdf",
"html": "<html><body><h1>Invoice</h1><p>Amount: $100</p></body></html>"
})
# URL to PDF
run = actor.call(run_input={
"action": "url-to-pdf",
"pdfUrl": "https://example.com"
})
# Get results
dataset = client.dataset(run['defaultDatasetId'])
results = list(dataset.iterate_items())

Via REST API

# Extract text
curl -X POST https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-processor/runs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"action": "extract-text",
"pdfUrl": "https://example.com/document.pdf"
}'
# HTML to PDF
curl -X POST https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-processor/runs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"action": "html-to-pdf",
"html": "<html><body><h1>Invoice</h1></body></html>"
}'
# URL to PDF
curl -X POST https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-processor/runs \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"action": "url-to-pdf",
"pdfUrl": "https://example.com"
}'

Monitoring

  • Check logs in the "Runs" tab for debugging
  • Monitor performance in the "Analytics" tab
  • Set up webhooks for run completion notifications

Cost Estimation

  • Compute Units:
    • Text extraction: ~0.001 CU per page
    • Page merging: ~0.002 CU per page
    • HTML/URL to PDF: ~0.005-0.02 CU (depends on complexity and load time)
  • Storage: Minimal for text, ~1 MB per 100 pages for generated PDFs
  • Bandwidth: Depends on PDF/webpage size (input + output)

Limitations

  • Maximum PDF size: 100 MB (configurable)
  • Maximum pages to process: 1000 (configurable)
  • Timeout: 5 minutes default (configurable)
  • HTML/URL to PDF: Requires Playwright/Chrome (included in Docker image)
  • Complex JavaScript sites may need additional wait time

Support

For issues or questions:

  1. Check the actor logs for error details
  2. Verify PDF URL is publicly accessible
  3. Ensure page numbers are within valid range

License

MIT