PDF Scraper avatar
PDF Scraper
Try for free

1 day trial then $20.00/month - No credit card required now

View all Actors
PDF Scraper

PDF Scraper

onidivo/pdf-scraper
Try for free

1 day trial then $20.00/month - No credit card required now

Scrape and extract text from PDF links.

Scrape and extract PDF text from PDF files.

Features

  • Scrape multiple files
  • Save the file and extracted text to the key-value store
  • Want more? Let us know here

Cost of usage

When running the actor with memory of 2048 MB and using datacenter proxies, average consumption is $4-8 for 1000 middle sized files.

Bugs, issues, features, and feedback

You can report issues on the Actor tab "Issues" or here and discuss or leave your feedback here.

Input

You can provide input either through the editor on the Apify platform or as a JSON object.

The only mandatory field you need to provide is the PDF URLs (pdfUrls).

An example of minimal input:

1{
2    "pdfUrl": [
3        {
4            "url": "http://www.pdf995.com/samples/pdf.pdf"
5        }
6    ],
7    "proxyConfiguration": {
8        "useApifyProxy": true
9    }
10}

We recommend using the proxies to overcome blocking and detection if required.

Output

The extracted text is saved to the dataset, and it looks like this:

1[
2    {
3        "pdfUrl": "http://www.pdf995.com/samples/pdf.pdf",
4	"extractedText": "\n\n\n\n\n\n\n\n\nThe pdf995 suite of products - Pdf995, PdfEdit995, and Signature995 - is a complete solution for your document publishing needs. It provides ease of use, flexibility in format, and industry-standard security- and all at no cost to you.\nPdf995 makes it easy and affordable to create professional-quality documents in the popular PDF file format. Its easy-to-use interface helps you to create PDF files by simply selecting the \"print\" command from any application, creating documents which can be viewed on any computer with a PDF viewer. Pdf995 supports network file saving, fast user switching on XP, Citrix/Terminal Server, custom page sizes and large format printing. Pdf995 is a printer...",
5        "extractedTextFileUrl": ""
6    }
7]
Developer
Maintained by Community
Actor metrics
  • 18 monthly users
  • 0 stars
  • 93.2% runs succeeded
  • 5 hours response time
  • Created in Apr 2023
  • Modified about 2 months ago