πŸ“„ PDF Text Extractor avatar

πŸ“„ PDF Text Extractor

Pricing

from $3.99 / 1,000 results

Go to Apify Store
πŸ“„ PDF Text Extractor

πŸ“„ PDF Text Extractor

πŸ“„ PDF Text Extractor (pdf-text-extractor) extracts clean text from PDF files for faster search, data analysis, and content reuse. ⚑ Saves time & boosts productivity for research, automation, and document workflows.

Pricing

from $3.99 / 1,000 results

Rating

0.0

(0)

Developer

Scrapio

Scrapio

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

πŸ“„ PDF Text Extractor & Chunker

Extract clean, ordered text from any PDF on the web β€” page-by-page or split into LLM-ready chunks with controllable size and overlap. Point it at one URL or thousands; results stream into your dataset section by section, live.

Perfect for building RAG pipelines, question-answering systems, document search, and any workflow that needs PDF content as plain text. πŸš€

🌟 Why Choose This Actor?

  • ⚑ Live results β€” every page/chunk is saved the moment it's ready. A long run never leaves you staring at an empty output table.
  • 🧩 LLM-friendly chunking β€” character-based chunking with overlap, so context isn't cut mid-sentence.
  • πŸ“¦ Bulk input β€” drop in a whole list of PDF URLs at once.
  • πŸ›‘οΈ Smart anti-rate-limit ladder β€” starts with a direct connection and automatically falls back to datacenter, then residential proxies if a host blocks you.
  • πŸŽ‰ Engaging real-time logs β€” watch exactly what's happening, page by page.

✨ Key Features

  • Extract text from PDFs provided as URLs.
  • Toggle between page mode (one record per page) and chunk mode.
  • Configure chunkSize and chunkOverlap for perfect LLM context windows.
  • Resilient downloads with proxy fallback and retries.
  • Output ready for JSON / CSV / XLSX export.

πŸ“₯ Input

FieldTypeDescription
urlsarrayπŸ”— Direct URLs of the PDF files (bulk supported).
performChunkingbooleanβœ‚οΈ true β†’ split into chunks. false β†’ one record per page.
chunkSizeintegerπŸ“ Max characters per chunk (chunk mode). Default 1000.
chunkOverlapintegerπŸ” Characters shared between adjacent chunks. Default 0.
proxyConfigurationobjectπŸ›‘οΈ Apify proxy used to power the automatic fallbacks.

Example input

{
"urls": ["https://arxiv.org/pdf/2307.12856"],
"performChunking": true,
"chunkSize": 1000,
"chunkOverlap": 0,
"proxyConfiguration": { "useApifyProxy": true }
}

πŸ“€ Output

Each record is one text section:

{
"url": "https://arxiv.org/pdf/2307.12856",
"index": 0,
"text": "A Real-World WebAgent with Planning, Long Context Understanding…"
}
FieldDescription
urlπŸ”— Source PDF URL.
indexπŸ”’ Position of the section (chunk number, or page number in page mode).
textπŸ“ Extracted text for that section.

πŸ›‘οΈ How the connection ladder works

  1. 🌐 Direct β€” no proxy; the request goes straight to the PDF host.
  2. πŸ›°οΈ Datacenter proxy β€” engaged automatically if the host blocks or rate-limits the direct request.
  3. 🏠 Residential proxy β€” the final fallback, retried up to 3 times. Once residential is engaged, the run sticks with it for every remaining PDF.

Every switch is logged clearly so you always know which path delivered your data.

πŸš€ How to Use (Apify Console)

  1. Log in at Apify Console β†’ Actors.
  2. Open PDF Text Extractor & Chunker.
  3. Paste your PDF URLs, set chunking options, pick a proxy.
  4. Click Start and watch the sections roll in live. πŸ“‘
  5. Open the Output tab and export to JSON / CSV / XLSX.

πŸ€– Use via API

curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls":["https://arxiv.org/pdf/2307.12856"],"performChunking":true,"chunkSize":1000,"chunkOverlap":0}'

πŸ’‘ Best Use Cases

  • πŸ“š Build RAG / knowledge bases from PDF libraries.
  • πŸ€– Feed document text into LLMs (chunk mode).
  • πŸ” Full-text search across PDF collections.
  • 🧾 Convert reports, papers, and manuals to plain text.

❓ FAQ

Does it work on scanned/image-only PDFs? It extracts the text layer of a PDF. Image-only scans without an embedded text layer will return little or no text (OCR is not performed).

Can I pass many URLs? Yes β€” urls accepts a bulk list, processed one after another with results saved live.

What if a host rate-limits me? The Actor automatically falls back through datacenter and residential proxies and retries, then sticks with residential.

πŸ›Ÿ Support & Feedback

Found a bug or have a feature request? Open an issue on the Actor's Issues tab in the Apify Console.


βš–οΈ Use responsibly. Only extract content from PDFs you are authorized to access. You are responsible for compliance with applicable laws and the source site's terms.