Pdf Scraper avatar

Pdf Scraper

Pricing

$14.00/month + usage

Go to Apify Store
Pdf Scraper

Pdf Scraper

A high-performance Apify Actor that inspects, classifies, and extracts structured data from PDF files. It intelligently detect whether a PDF is text-based or scanned and converts it into clean, formatted Markdown.

Pricing

$14.00/month + usage

Rating

0.0

(0)

Developer

WebScrap

WebScrap

Maintained by Community

Actor stats

1

Bookmarked

1

Total users

0

Monthly active users

5 days ago

Last modified

Share

📄 PDF Inspector Actor

A high-performance Apify Actor that inspects, classifies, and extracts structured data from PDF files. It intelligently detect whether a PDF is text-based or scanned and converts it into clean, formatted Markdown.

🚀 Features

  • ⚡️ Blazing Fast: Native Rust implementation ensures minimal latency and low memory usage.
  • 🧠 Smart Detection: Automatically classifies PDFs as TextBased, Scanned, ImageBased, or Mixed.
  • 📝 Clean Markdown: Extracts text and formatting (headers, lists, code blocks, bold/italic) into LLM-ready Markdown.
  • ⚙️ Highly Configurable: Fine-tune detection sensitivity, font sizes, and formatting rules.
  • 🔒 Privacy First: All processing happens securely within the Actor container.

📥 Input

The Actor accepts a simple JSON input. You can configure the URL and various processing options.

Example Input

{
"url": "https://pdfobject.com/pdf/sample.pdf",
"detect_headers": true,
"detect_lists": true,
"fix_hyphenation": true
}

Configuration Options

FieldTypeDefaultDescription
urlStringRequiredDirect URL to the PDF file.
detect_headersBooleantrueDetect headers based on font size hierarchy.
detect_listsBooleantrueDetect bullet points and numbered lists.
detect_codeBooleantrueDetect code blocks using monospace fonts.
fix_hyphenationBooleantrueAttempt to rejoin words broken across lines.
base_font_sizeNumberAutoOverride base font size (useful if headers aren't detected).
remove_page_numbersBooleantrueCleanup standalone page numbers.
format_urlsBooleantrueConvert URLs into Markdown links.

📤 Output

The Actor saves the result to the Default Key-Value Store and Dataset.

Example Output JSON

{
"url": "https://pdfobject.com/pdf/sample.pdf",
"inspection_result": {
"pdf_type": "TextBased",
"text": null,
"markdown": "# Sample PDF\n\nThis is a header...\n\n- List item 1\n- List item 2",
"page_count": 1,
"processing_time_ms": 12
}
}

Output Fields

  • pdf_type: one of TextBased, Scanned, ImageBased, Mixed.
  • markdown: The extracted content formatted as Markdown.
  • page_count: Total number of pages in the document.
  • processing_time_ms: Time taken to process the file in milliseconds.