Pricing

Pay per usage

Go to Store

PDF Text Extractor

Try for free

Developed by

Jiří Moravčík

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

5.0 (1)

Pricing

Pay per usage

Total users

637

Monthly users

Runs succeeded

>99%

Issues response

21 days

Last modified

16 days ago

Integrations

Automation

Back to issues Create new issue

Data Format Error

Closed

Regscanner opened this issue

when crawling I receive the following error: 2024-02-14T06:07:31.285Z ACTOR: Pulling Docker image of build 5lFVfc3pf7JN70PcE from repository. 2024-02-14T06:07:35.343Z ACTOR: Creating Docker container. 2024-02-14T06:07:35.641Z ACTOR: Starting Docker container. 2024-02-14T06:07:37.360Z INFO Initializing actor... 2024-02-14T06:07:37.363Z INFO System info ({"apify_sdk_version": "1.1.5", "apify_client_version": "1.4.1", "python_version": "3.11.7", "os": "linux"}) 2024-02-14T06:07:37.628Z --- Logging error --- 2024-02-14T06:07:37.629Z Traceback (most recent call last): 2024-02-14T06:07:37.631Z File "/usr/src/app/src/main.py", line 15, in main 2024-02-14T06:07:37.633Z pdf_document = pdfium.PdfDocument(io.BytesIO(pdf.content)) 2024-02-14T06:07:37.634Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.636Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 77, in init 2024-02-14T06:07:37.638Z self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) 2024-02-14T06:07:37.639Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.641Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 744, in _open_pdf 2024-02-14T06:07:37.642Z raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") 2024-02-14T06:07:37.644Z pypdfium2._helpers.misc.Pdfi... [trimmed]

Jiří Moravčík (jirimoravcik)

Hi, sadly it seems that PDFium has problems parsing the PDF file you provided. Can you try some other files to see if it is caused by that specific file?

Add comment

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

350

PDF Extractor 2.0

jupri/pdf-extractor-2-0

💫 Extract PDF Document Contents including Metadata, Images, Pages, Tables, Attachments, etc.

cat

PDF Text Extractor

sami_apify/PDF-Text-Extractor

This actor downloads PDFs from provided URLs, extracts text content from them, and saves the extracted data into an Apify dataset. It’s ideal for scraping and processing PDFs available online.

sami

HTML to PDF Converter

jancurn/url-to-pdf

Loads a web page in headless Chrome using Puppeteer and prints it to PDF. The input is a JSON object and output is a PDF file.

Jan Čurn

453

HTML to PDF converter

apify/html-to-pdf-converter

Convert HTML string to A4 PDF.

Apify

4.3

HTML string to PDF

mhamas/html-string-to-pdf

Convert HTML string to A4 PDF.

Matej Hamas

Markdown Converter

jindrich.bar/markdown-converter

A simple Actor for converting pdf / doc / docx files to Markdown.

Jindřich Bär

Website To PDF Converter

louisdeconinck/website-to-pdf-converter

Convert websites to high-quality PDF documents with customizable options. This powerful actor allows you to transform website pages with both static HTML and dynamic content into professional-grade PDFs, offering a wide range of customization features such as page format, orientation, margins, …

Louis Deconinck

5.0

A4 PDF Generator from HTML

dainty_screw/a4-pdf-generator-from-html

Convert any HTML string into a neatly formatted A4-sized PDF. Perfect for quick documentation and reports

codemaster devops

Docling

vancura/docling

Docling document parser & converter – Convert documents into structured data without complexity. This Actor leverages the powerful Docling library to parse and transform various document formats into clean, structured outputs ready for analysis or integration.