Pricing

$30.00/month + usage

Go to Store

PDF Extractor 2.0

Try for free

Developed by

cat

💫 Extract PDF Document Contents including Metadata, Images, Pages, Tables, Attachments, etc.

0.0 (0)

Pricing

$30.00/month + usage

Total users

Monthly users

Runs succeeded

96%

Last modified

5 months ago

Automation

Developer tools

Welcome to PDF Extractor

🍂 About PDF Format

Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.[2][3] Based on the PostScript language, each PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster images and other information needed to display it. PDF has its roots in "The Camelot Project" initiated by Adobe co-founder John Warnock in 1991.[4] PDF was standardized as ISO 32000 in 2008.[5] The last edition as ISO 32000-2:2020 was published in December 2020.

🍂 About This Actor

💫 Extract contents from PDF documents

Features :

⭐ Extract PDF pages as Text or Image (SVG, PNG, JPEG).
⭐ Extract PDF Metadata.
⭐ Extract PDF Table of Contents
⭐ Extract PDF Tables
⭐ Extract Encrypted PDF (password protected)
⭐ Extract Embedded images.
⭐ Extract Attachments.
⭐ Extract multiple URL files

🍂 Tutorial

Input Parameters

Name	Type	Description
`url`	Array `[String]`	List of PDF document `URL`
`content`	String	Output pages format (`text, svg, png, jpg`)
`images`	Boolean `(true/false)`	Extract embedded images
`attachments`	Boolean `(true/false)`	Extract embedded files
`tables`	Boolean `(true/false)`	Extract tables

Notes : All extracted resources other than TEXT will be saved to default Key-Value storage.

Dataset Output Format :

1[	
2	# URL-1: Metadata
3	{ "metadata": { "headers": { ... }, "url": "...", "mime": "..." } },
4	# URL-1: Page Contents
5	{ "index": 0, "content": "...page-0 contents...", "images": [...], "tables": [...] },
6	{ "index": 1, "content": "...page-1 contents...", "images": [...], "tables": [...] },
7	...
8	# URL-2: Metadata
9	{ "metadata": { "headers": { ... }, "url": "...", "mime": "..." } },
10	# URL-2: Page Contents
11	{ "index": 0, "content": "...page-0 contents...", "images": [...], "tables": [...] },
12	{ "index": 1, "content": "...page-1 contents...", "images": [...], "tables": [...] },	
13	...
14]

🍂 Output Samples

PDF Sample #1

URL : https://www.w3.org/WAI/WCAG21/working-examples/pdf-table/table.pdf

1{
2
3}

PDF Sample #2

URL : https://apify.com/img/web-scraping/beginners-guide-to-web-scraping.pdf

1{
2
3}

✏️ Support

⚡️ Feel free to reach out to the developer for any issues or suggestions for improvement.

Pricing

Pricing model

Rental

To use this Actor, you have to pay a monthly rental fee to the developer. The rent is subtracted from your prepaid usage every month after the free trial period. You also pay for the Apify platform usage.

Free trial

7 days

Price

$30.00

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

311

PDF Text Extractor

jirimoravcik/pdf-text-extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

Jiří Moravčík

557

PDF Text Extractor

sami_apify/PDF-Text-Extractor

This actor downloads PDFs from provided URLs, extracts text content from them, and saves the extracted data into an Apify dataset. It’s ideal for scraping and processing PDFs available online.

sami

A4 PDF Generator from HTML

dainty_screw/a4-pdf-generator-from-html

Convert any HTML string into a neatly formatted A4-sized PDF. Perfect for quick documentation and reports

codemaster devops

Markdown Converter

jindrich.bar/markdown-converter

A simple Actor for converting pdf / doc / docx files to Markdown.

Jindřich Bär

HTML to PDF Converter

jancurn/url-to-pdf

Loads a web page in headless Chrome using Puppeteer and prints it to PDF. The input is a JSON object and output is a PDF file.

Jan Čurn

442

HTML to PDF converter

apify/html-to-pdf-converter

Convert HTML string to A4 PDF.

Apify

Website To PDF Converter

louisdeconinck/website-to-pdf-converter

Convert websites to high-quality PDF documents with customizable options. This powerful actor allows you to transform website pages with both static HTML and dynamic content into professional-grade PDFs, offering a wide range of customization features such as page format, orientation, margins, …

Louis Deconinck

HTML string to PDF

mhamas/html-string-to-pdf

Convert HTML string to A4 PDF.

Matej Hamas

Docling

vancura/docling

Docling Document Parser & Converter – Convert documents into structured data without complexity. This Actor leverages the powerful Docling library to parse and transform various document formats into clean, structured outputs ready for analysis or integration.