Fast Pdf Processor
Pricing
$4.99/month + usage
Go to Apify Store

Fast Pdf Processor
This API is a PDF Processing Service allowing users to upload a PDF to: Extract Text: Reads all text from the PDF and returns it as structured JSON data per page. Merge Pages: Creates a new PDF containing only the specific pages selected by the user. (260 characters)
Pricing
$4.99/month + usage
Rating
0.0
(0)
Developer

Andric
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
0
Monthly active users
5 days ago
Last modified
Categories
Share
PDF Processor - Apify Actor Deployment Guide
Overview
This PDF Processor provides four main operations via Apify Actor:
- Extract Text - Extract text content from all PDF pages
- Merge Pages - Create new PDFs with selected pages only
- HTML to PDF - Convert HTML content to PDF using Playwright
- URL to PDF - Convert web pages to PDF using Playwright
Files Structure
pdf-processor-actor/├── main.py # Apify Actor wrapper (main entry point)├── requirements.txt # Dependencies for Apify deployment├── requirements_apify.txt # Alternative requirements file├── Dockerfile # Docker configuration for Apify├── actor.json # Apify Actor configuration├── INPUT_SCHEMA.json # Input schema definition├── apify_input_schema.json # Legacy input schema├── apify_output_schema.json # Output schema definition├── sample_inputs.json # Example inputs for testing├── test_local.py # Local testing script├── n8n_workflow_example.json # n8n integration example├── n8n_direct_api_workflow.json # n8n direct API workflow├── QUICK_START.md # Quick start guide├── apify.json # Apify configuration├── actor/ # Actor configuration directory│ ├── actor.json│ └── dataset_schema.json└── README.md # This file
Deployment Steps
1. Prepare Your Repository
# Create a new directory for your actormkdir pdf-processor-actorcd pdf-processor-actor# Copy all the provided filescp /path/to/main.py .cp /path/to/app.py .cp /path/to/requirements_apify.txt .cp /path/to/Dockerfile .cp /path/to/actor.json .cp /path/to/apify_input_schema.json .cp /path/to/apify_output_schema.json .cp /path/to/sample_inputs.json .
2. Deploy to Apify
Option A: Using Apify CLI
# Install Apify CLInpm install -g apify-cli# Login to your Apify accountapify login# Initialize the actorapify init# Push to Apify platformapify push
Option B: Using GitHub Integration
- Push your code to a GitHub repository
- Go to Apify Console
- Click "Actors" → "Create new"
- Choose "From GitHub repository"
- Connect your GitHub repo
- Apify will automatically build and deploy
3. Configure the Actor
In Apify Console:
- Navigate to your actor
- Go to "Settings" tab
- Set the following:
- Build tag:
latest - Memory:
512 MB(minimum, increase for complex webpages or large PDFs) - Timeout:
300 seconds(adjust based on PDF size and webpage complexity)
- Build tag:
4. Test Your Actor
- Go to the "Input" tab
- Use one of the sample inputs:
Extract Text:
{"action": "extract-text","pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}
Merge Pages:
{"action": "merge-pages","pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf","pageNumbers": [0, 2, 4]}
HTML to PDF:
{"action": "html-to-pdf","html": "<html><body><h1>Hello World</h1><p>This is a test PDF.</p></body></html>"}
URL to PDF:
{"action": "url-to-pdf","pdfUrl": "https://example.com"}
- Click "Run"
- Check the output in the "Dataset" tab
Usage Examples
Via Apify API
from apify_client import ApifyClientclient = ApifyClient('YOUR_API_TOKEN')actor = client.actor('YOUR_USERNAME/pdf-processor')# Extract textrun = actor.call(run_input={"action": "extract-text","pdfUrl": "https://example.com/document.pdf"})# HTML to PDFrun = actor.call(run_input={"action": "html-to-pdf","html": "<html><body><h1>Invoice</h1><p>Amount: $100</p></body></html>"})# URL to PDFrun = actor.call(run_input={"action": "url-to-pdf","pdfUrl": "https://example.com"})# Get resultsdataset = client.dataset(run['defaultDatasetId'])results = list(dataset.iterate_items())
Via REST API
# Extract textcurl -X POST https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-processor/runs \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_API_TOKEN" \-d '{"action": "extract-text","pdfUrl": "https://example.com/document.pdf"}'# HTML to PDFcurl -X POST https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-processor/runs \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_API_TOKEN" \-d '{"action": "html-to-pdf","html": "<html><body><h1>Invoice</h1></body></html>"}'# URL to PDFcurl -X POST https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-processor/runs \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_API_TOKEN" \-d '{"action": "url-to-pdf","pdfUrl": "https://example.com"}'
Monitoring
- Check logs in the "Runs" tab for debugging
- Monitor performance in the "Analytics" tab
- Set up webhooks for run completion notifications
Cost Estimation
- Compute Units:
- Text extraction: ~0.001 CU per page
- Page merging: ~0.002 CU per page
- HTML/URL to PDF: ~0.005-0.02 CU (depends on complexity and load time)
- Storage: Minimal for text, ~1 MB per 100 pages for generated PDFs
- Bandwidth: Depends on PDF/webpage size (input + output)
Limitations
- Maximum PDF size: 100 MB (configurable)
- Maximum pages to process: 1000 (configurable)
- Timeout: 5 minutes default (configurable)
- HTML/URL to PDF: Requires Playwright/Chrome (included in Docker image)
- Complex JavaScript sites may need additional wait time
Support
For issues or questions:
- Check the actor logs for error details
- Verify PDF URL is publicly accessible
- Ensure page numbers are within valid range
License
MIT