Pricing

$14.00/month + usage

Pdf Scraper

A high-performance Apify Actor that inspects, classifies, and extracts structured data from PDF files. It intelligently detect whether a PDF is text-based or scanned and converts it into clean, formatted Markdown.

Pricing

$14.00/month + usage

Rating

0.0

(0)

Developer

WebScrap

Actor stats

Bookmarked

Total users

Monthly active users

25 days ago

Last modified

📄 PDF Inspector Actor

A high-performance Apify Actor that inspects, classifies, and extracts structured data from PDF files. It intelligently detect whether a PDF is text-based or scanned and converts it into clean, formatted Markdown.

🚀 Features

⚡️ Blazing Fast: Native Rust implementation ensures minimal latency and low memory usage.
🧠 Smart Detection: Automatically classifies PDFs as TextBased, Scanned, ImageBased, or Mixed.
📝 Clean Markdown: Extracts text and formatting (headers, lists, code blocks, bold/italic) into LLM-ready Markdown.
⚙️ Highly Configurable: Fine-tune detection sensitivity, font sizes, and formatting rules.
🔒 Privacy First: All processing happens securely within the Actor container.

📥 Input

The Actor accepts a simple JSON input. You can configure the URL and various processing options.

Example Input

{
    "url": "https://pdfobject.com/pdf/sample.pdf",
    "detect_headers": true,
    "detect_lists": true,
    "fix_hyphenation": true
}

Configuration Options

Field	Type	Default	Description
`url`	`String`	Required	Direct URL to the PDF file.
`detect_headers`	`Boolean`	`true`	Detect headers based on font size hierarchy.
`detect_lists`	`Boolean`	`true`	Detect bullet points and numbered lists.
`detect_code`	`Boolean`	`true`	Detect code blocks using monospace fonts.
`fix_hyphenation`	`Boolean`	`true`	Attempt to rejoin words broken across lines.
`base_font_size`	`Number`	`Auto`	Override base font size (useful if headers aren't detected).
`remove_page_numbers`	`Boolean`	`true`	Cleanup standalone page numbers.
`format_urls`	`Boolean`	`true`	Convert URLs into Markdown links.

📤 Output

The Actor saves the result to the Default Key-Value Store and Dataset.

Example Output JSON

{
    "url": "https://pdfobject.com/pdf/sample.pdf",
    "inspection_result": {
        "pdf_type": "TextBased",
        "text": null,
        "markdown": "# Sample PDF\n\nThis is a header...\n\n- List item 1\n- List item 2",
        "page_count": 1,
        "processing_time_ms": 12
    }
}

Output Fields

pdf_type: one of TextBased, Scanned, ImageBased, Mixed.
markdown: The extracted content formatted as Markdown.
page_count: Total number of pages in the document.
processing_time_ms: Time taken to process the file in milliseconds.

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

494

Pdf API

vivid_astronaut/pdf

Fabio Suizu

Fast Pdf Processor

contemporary_fruit/pdf-processor-actor

This API is a PDF Processing Service allowing users to upload a PDF to: Extract Text: Reads all text from the PDF and returns it as structured JSON data per page. Merge Pages: Creates a new PDF containing only the specific pages selected by the user. (260 characters)

Andric

Extract text from PDF

akash9078/pdf-text-extractor

Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.

Akash Kumar Naik

Pdf Text Extractor Pro

dainty_screw/pdf-text-extractor-pro

PDF Text Extractor lets you quickly extract text from PDF files with high accuracy. Supports text chunking for AI, chatbots, and large language models (LLMs), making PDF-to-text conversion fast, clean, and ready for NLP or machine learning.

codemaster devops

5.0

PDF Text Extractor

jirimoravcik/pdf-text-extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

Jiří Moravčík

984

5.0

HTML to PDF Converter

jancurn/url-to-pdf

Loads a web page in headless Chrome using Puppeteer and prints it to PDF. The input is a JSON object and output is a PDF file.

Jan Čurn

408

Convert Image to PDF and PDF to Image

akash9078/image-pdf-converter

Convert images (JPG, PNG, BMP, and more) into high-quality PDFs, or extract images from PDF files in seconds. Image–PDF Converter Pro delivers fast, reliable, and professional results for all your document and image conversion needs.

Akash Kumar Naik

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.