PDF Scraper avatar
PDF Scraper

Pricing

Pay per usage

Go to Store
PDF Scraper

PDF Scraper

Developed by

Joe

Joe

Maintained by Community

Scrape text from PDF files via URLs - no file upload required! Simply provide a PDF URL and get text data.

5.0 (1)

Pricing

Pay per usage

1

2

2

Last modified

3 days ago

PDF Scraper - Extract Text from PDF URLs

This Apify actor scrapes text content from PDF files via URLs using OCR technology. No file upload required - simply provide a PDF URL and get structured text data with page-by-page breakdown. Cheaper usage based pricing.

Features

  • URL-based scraping - no file upload required (unique advantage)
  • Cost-effective - cheaper than alternatives requiring file uploads
  • OCR-based text scraping from PDF URLs
  • Automatic retry logic for failed requests
  • Page-by-page text breakdown with metadata
  • Error handling with detailed error messages
  • Processing time tracking for performance monitoring

Input

The scraper accepts the following input parameter:

  • pdfUrl (required): The URL of the PDF file to scrape (no file upload needed!)

Example Input

{
"pdfUrl": "https://example.com/document.pdf"
}

Output

The scraper returns the pages array as the following output:

Success Response

[
{
"pageNumber": 1,
"text": "Extracted text content from page 1..."
},
{
"pageNumber": 2,
"text": "Extracted text content from page 2..."
}
]

Error Response

If an error occurs, a single output item is returned:

{
"pageNumber": 0,
"text": "Error: Text extraction failed: Invalid PDF format"
}

Usage

Via Apify API

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const input = {
pdfUrl: 'https://example.com/document.pdf',
};
// Run the scraper and wait for it to finish
const run = await client.actor('joeextract/pdf-scraper').call(input);
// Fetch the results
const { items } = await client.dataset(run.defaultDatasetId).listItems();
// Process the pages array
const pages = items[0]; // First (and only) result
if (Array.isArray(pages)) {
pages.forEach((page) => {
console.log(`Page ${page.pageNumber}:`, page.text);
});
} else {
console.error('Error:', pages.text);
}

Via Apify Console

  1. Go to your scraper in the Apify Console
  2. Enter the input JSON with your PDF URL
  3. Click "Start"
  4. Wait for the run to complete
  5. View results in the dataset - the pages array will be a single item

Why Choose This Scraper?

  • No file uploads: Simply provide a PDF URL - no need to upload files
  • Cost-effective: Cheaper than alternatives that require file uploads
  • Simple integration: Just pass a URL, get structured text data
  • Reliable: Built-in retry logic and error handling
  • Clean output: Direct access to the pages array

Error Handling

The scraper handles various error scenarios:

  • Invalid URLs: Returns validation error
  • Network timeouts: Retries with exponential backoff
  • OCR service errors: Provides detailed error messages
  • Invalid PDF files: Returns extraction error with details

Performance

  • Memory usage: Optimized for large PDFs
  • Concurrent processing: Supports multiple simultaneous runs
  • Retry mechanism: Automatic retry with exponential backoff

Limitations

  • PDF files must be accessible via HTTP/HTTPS URLs
  • Supported languages: English (default)
  • Processing time depends on PDF complexity and size