This Apify actor scrapes text content from PDF files via URLs using OCR technology. No file upload required - simply provide a PDF URL and get structured text data with page-by-page breakdown. Cheaper usage based pricing.

Features

URL-based scraping - no file upload required (unique advantage)
Cost-effective - cheaper than alternatives requiring file uploads
OCR-based text scraping from PDF URLs
Automatic retry logic for failed requests
Page-by-page text breakdown with metadata
Error handling with detailed error messages
Processing time tracking for performance monitoring

Input

The scraper accepts the following input parameter:

pdfUrl (required): The URL of the PDF file to scrape (no file upload needed!)

Example Input

{
  "pdfUrl": "https://example.com/document.pdf"
}

Output

The scraper returns the pages array as the following output:

Success Response

[
  {
    "pageNumber": 1,
    "text": "Extracted text content from page 1..."
  },
  {
    "pageNumber": 2,
    "text": "Extracted text content from page 2..."
  }
]

Error Response

If an error occurs, a single output item is returned:

{
  "pageNumber": 0,
  "text": "Error: Text extraction failed: Invalid PDF format"
}

Usage

Via Apify API

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const input = {
  pdfUrl: 'https://example.com/document.pdf',
};

// Run the scraper and wait for it to finish
const run = await client.actor('joeextract/pdf-scraper').call(input);

// Fetch the results
const { items } = await client.dataset(run.defaultDatasetId).listItems();

// Process the pages array
const pages = items[0]; // First (and only) result
if (Array.isArray(pages)) {
  pages.forEach((page) => {
    console.log(`Page ${page.pageNumber}:`, page.text);
  });
} else {
  console.error('Error:', pages.text);
}

Via Apify Console

Go to your scraper in the Apify Console
Enter the input JSON with your PDF URL
Click "Start"
Wait for the run to complete
View results in the dataset - the pages array will be a single item

Why Choose This Scraper?

No file uploads: Simply provide a PDF URL - no need to upload files
Cost-effective: Cheaper than alternatives that require file uploads
Simple integration: Just pass a URL, get structured text data
Reliable: Built-in retry logic and error handling
Clean output: Direct access to the pages array

Error Handling

The scraper handles various error scenarios:

Invalid URLs: Returns validation error
Network timeouts: Retries with exponential backoff
OCR service errors: Provides detailed error messages
Invalid PDF files: Returns extraction error with details

Performance

Memory usage: Optimized for large PDFs
Concurrent processing: Supports multiple simultaneous runs
Retry mechanism: Automatic retry with exponential backoff

Limitations

PDF files must be accessible via HTTP/HTTPS URLs
Supported languages: English (default)
Processing time depends on PDF complexity and size

On this page

PDF Scraper - Extract Text from PDF URLs

Share Actor:

PDF Text Extractor

sami_apify/PDF-Text-Extractor

This actor downloads PDFs from provided URLs, extracts text content from them, and saves the extracted data into an Apify dataset. It’s ideal for scraping and processing PDFs available online.

sami

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

376

PDF Text Extractor

jirimoravcik/pdf-text-extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

Jiří Moravčík

726

5.0

PDF Extractor 2.0

jupri/pdf-extractor-2-0

💫 Extract PDF Document Contents including Metadata, Images, Pages, Tables, Attachments, etc.

cat

101

Google Keyword Suggestions by URL Scraper

powerai/google-keywords-suggest-by-url-scraper

Scrape Google keyword suggestions based on a specific URL using our API wrapper service

PowerAI

5.0

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

66K

4.3

Wer Liefert Was B2B Marketplace Scraper

codebyte/wer-liefert-was-b2b-marketplace

Wer-Liefert-Was (WLW) web scraper for Europe's leading B2B marketplace. Extract company profiles, contact information and product listings. Perfect for market research, lead generation, and competitive analysis.

Codebyte

Threads Scraper

red.cars/threads-scraper

Threads Scraper is a powerful Apify actor that extracts public profile and post data from Meta's Threads platform. Get comprehensive user profiles, posts, and engagement metrics without authentication or API keys.

AutomateLab

Europages B2B Scraper

codebyte/europages-b2b-scraper

Europages B2B lead finder. Extract business leads and company information from Europages - Europe's largest B2B platform. Find potential business partners, suppliers, and clients across Europe. Research competitors in your industry.

Codebyte

arXiv Search Scraper 📚

easyapi/arxiv-search-scraper

Extract comprehensive research paper data from arXiv search results. Get detailed metadata including titles, authors, abstracts, categories and more. Perfect for academic research monitoring, trend analysis and building paper databases. 🎓📚