π PDF Text Extractor
Pricing
from $2.99 / 1,000 results
π PDF Text Extractor
πβ¨ PDF Text Extractor pulls clean text from PDF files fast and accurately. Perfect for parsing, indexing, and document search β saving hours on manual copy-paste. ππ Try it now!
Pricing
from $2.99 / 1,000 results
Rating
0.0
(0)
Developer
SimpleAPI
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
π PDF Text Extractor & Chunker
Extract clean, ordered text from any PDF on the web β page-by-page or split into LLM-ready chunks with controllable size and overlap. Point it at one URL or thousands; results stream into your dataset section by section, live.
Perfect for building RAG pipelines, question-answering systems, document search, and any workflow that needs PDF content as plain text. π
π Why Choose This Actor?
- β‘ Live results β every page/chunk is saved the moment it's ready. A long run never leaves you staring at an empty output table.
- π§© LLM-friendly chunking β character-based chunking with overlap, so context isn't cut mid-sentence.
- π¦ Bulk input β drop in a whole list of PDF URLs at once.
- π‘οΈ Smart anti-rate-limit ladder β starts with a direct connection and automatically falls back to datacenter, then residential proxies if a host blocks you.
- π Engaging real-time logs β watch exactly what's happening, page by page.
β¨ Key Features
- Extract text from PDFs provided as URLs.
- Toggle between page mode (one record per page) and chunk mode.
- Configure
chunkSizeandchunkOverlapfor perfect LLM context windows. - Resilient downloads with proxy fallback and retries.
- Output ready for JSON / CSV / XLSX export.
π₯ Input
| Field | Type | Description |
|---|---|---|
urls | array | π Direct URLs of the PDF files (bulk supported). |
performChunking | boolean | βοΈ true β split into chunks. false β one record per page. |
chunkSize | integer | π Max characters per chunk (chunk mode). Default 1000. |
chunkOverlap | integer | π Characters shared between adjacent chunks. Default 0. |
proxyConfiguration | object | π‘οΈ Apify proxy used to power the automatic fallbacks. |
Example input
{"urls": ["https://arxiv.org/pdf/2307.12856"],"performChunking": true,"chunkSize": 1000,"chunkOverlap": 0,"proxyConfiguration": { "useApifyProxy": true }}
π€ Output
Each record is one text section:
{"url": "https://arxiv.org/pdf/2307.12856","index": 0,"text": "A Real-World WebAgent with Planning, Long Context Understandingβ¦"}
| Field | Description |
|---|---|
url | π Source PDF URL. |
index | π’ Position of the section (chunk number, or page number in page mode). |
text | π Extracted text for that section. |
π‘οΈ How the connection ladder works
- π Direct β no proxy; the request goes straight to the PDF host.
- π°οΈ Datacenter proxy β engaged automatically if the host blocks or rate-limits the direct request.
- π Residential proxy β the final fallback, retried up to 3 times. Once residential is engaged, the run sticks with it for every remaining PDF.
Every switch is logged clearly so you always know which path delivered your data.
π How to Use (Apify Console)
- Log in at Apify Console β Actors.
- Open PDF Text Extractor & Chunker.
- Paste your PDF URLs, set chunking options, pick a proxy.
- Click Start and watch the sections roll in live. π‘
- Open the Output tab and export to JSON / CSV / XLSX.
π€ Use via API
curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"urls":["https://arxiv.org/pdf/2307.12856"],"performChunking":true,"chunkSize":1000,"chunkOverlap":0}'
π‘ Best Use Cases
- π Build RAG / knowledge bases from PDF libraries.
- π€ Feed document text into LLMs (chunk mode).
- π Full-text search across PDF collections.
- π§Ύ Convert reports, papers, and manuals to plain text.
β FAQ
Does it work on scanned/image-only PDFs? It extracts the text layer of a PDF. Image-only scans without an embedded text layer will return little or no text (OCR is not performed).
Can I pass many URLs? Yes β urls accepts a bulk list, processed one after another with results saved live.
What if a host rate-limits me? The Actor automatically falls back through datacenter and residential proxies and retries, then sticks with residential.
π Support & Feedback
Found a bug or have a feature request? Open an issue on the Actor's Issues tab in the Apify Console.
βοΈ Use responsibly. Only extract content from PDFs you are authorized to access. You are responsible for compliance with applicable laws and the source site's terms.