
PDF Scraper
Pricing
Pay per usage
Go to Store

PDF Scraper
Scrape text from PDF files via URLs - no file upload required! Simply provide a PDF URL and get text data.
5.0 (1)
Pricing
Pay per usage
1
2
2
Last modified
3 days ago
PDF Scraper - Extract Text from PDF URLs
This Apify actor scrapes text content from PDF files via URLs using OCR technology. No file upload required - simply provide a PDF URL and get structured text data with page-by-page breakdown. Cheaper usage based pricing.
Features
- URL-based scraping - no file upload required (unique advantage)
- Cost-effective - cheaper than alternatives requiring file uploads
- OCR-based text scraping from PDF URLs
- Automatic retry logic for failed requests
- Page-by-page text breakdown with metadata
- Error handling with detailed error messages
- Processing time tracking for performance monitoring
Input
The scraper accepts the following input parameter:
pdfUrl
(required): The URL of the PDF file to scrape (no file upload needed!)
Example Input
{"pdfUrl": "https://example.com/document.pdf"}
Output
The scraper returns the pages array as the following output:
Success Response
[{"pageNumber": 1,"text": "Extracted text content from page 1..."},{"pageNumber": 2,"text": "Extracted text content from page 2..."}]
Error Response
If an error occurs, a single output item is returned:
{"pageNumber": 0,"text": "Error: Text extraction failed: Invalid PDF format"}
Usage
Via Apify API
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const input = {pdfUrl: 'https://example.com/document.pdf',};// Run the scraper and wait for it to finishconst run = await client.actor('joeextract/pdf-scraper').call(input);// Fetch the resultsconst { items } = await client.dataset(run.defaultDatasetId).listItems();// Process the pages arrayconst pages = items[0]; // First (and only) resultif (Array.isArray(pages)) {pages.forEach((page) => {console.log(`Page ${page.pageNumber}:`, page.text);});} else {console.error('Error:', pages.text);}
Via Apify Console
- Go to your scraper in the Apify Console
- Enter the input JSON with your PDF URL
- Click "Start"
- Wait for the run to complete
- View results in the dataset - the pages array will be a single item
Why Choose This Scraper?
- No file uploads: Simply provide a PDF URL - no need to upload files
- Cost-effective: Cheaper than alternatives that require file uploads
- Simple integration: Just pass a URL, get structured text data
- Reliable: Built-in retry logic and error handling
- Clean output: Direct access to the pages array
Error Handling
The scraper handles various error scenarios:
- Invalid URLs: Returns validation error
- Network timeouts: Retries with exponential backoff
- OCR service errors: Provides detailed error messages
- Invalid PDF files: Returns extraction error with details
Performance
- Memory usage: Optimized for large PDFs
- Concurrent processing: Supports multiple simultaneous runs
- Retry mechanism: Automatic retry with exponential backoff
Limitations
- PDF files must be accessible via HTTP/HTTPS URLs
- Supported languages: English (default)
- Processing time depends on PDF complexity and size