Pdf Scraper
Pricing
$14.00/month + usage
Pdf Scraper
A high-performance Apify Actor that inspects, classifies, and extracts structured data from PDF files. It intelligently detect whether a PDF is text-based or scanned and converts it into clean, formatted Markdown.
Pricing
$14.00/month + usage
Rating
0.0
(0)
Developer

WebScrap
Actor stats
1
Bookmarked
1
Total users
0
Monthly active users
5 days ago
Last modified
Categories
Share
📄 PDF Inspector Actor
A high-performance Apify Actor that inspects, classifies, and extracts structured data from PDF files. It intelligently detect whether a PDF is text-based or scanned and converts it into clean, formatted Markdown.
🚀 Features
- ⚡️ Blazing Fast: Native Rust implementation ensures minimal latency and low memory usage.
- 🧠 Smart Detection: Automatically classifies PDFs as
TextBased,Scanned,ImageBased, orMixed. - 📝 Clean Markdown: Extracts text and formatting (headers, lists, code blocks, bold/italic) into LLM-ready Markdown.
- ⚙️ Highly Configurable: Fine-tune detection sensitivity, font sizes, and formatting rules.
- 🔒 Privacy First: All processing happens securely within the Actor container.
📥 Input
The Actor accepts a simple JSON input. You can configure the URL and various processing options.
Example Input
{"url": "https://pdfobject.com/pdf/sample.pdf","detect_headers": true,"detect_lists": true,"fix_hyphenation": true}
Configuration Options
| Field | Type | Default | Description |
|---|---|---|---|
url | String | Required | Direct URL to the PDF file. |
detect_headers | Boolean | true | Detect headers based on font size hierarchy. |
detect_lists | Boolean | true | Detect bullet points and numbered lists. |
detect_code | Boolean | true | Detect code blocks using monospace fonts. |
fix_hyphenation | Boolean | true | Attempt to rejoin words broken across lines. |
base_font_size | Number | Auto | Override base font size (useful if headers aren't detected). |
remove_page_numbers | Boolean | true | Cleanup standalone page numbers. |
format_urls | Boolean | true | Convert URLs into Markdown links. |
📤 Output
The Actor saves the result to the Default Key-Value Store and Dataset.
Example Output JSON
{"url": "https://pdfobject.com/pdf/sample.pdf","inspection_result": {"pdf_type": "TextBased","text": null,"markdown": "# Sample PDF\n\nThis is a header...\n\n- List item 1\n- List item 2","page_count": 1,"processing_time_ms": 12}}
Output Fields
pdf_type: one ofTextBased,Scanned,ImageBased,Mixed.markdown: The extracted content formatted as Markdown.page_count: Total number of pages in the document.processing_time_ms: Time taken to process the file in milliseconds.