Pricing

Pay per usage

Try for free

Go to Store

PDF Text Extractor

Try for free

Developed by

Jiří Moravčík

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

5.0 (1)

Pricing

Pay per usage

Issues response

1.4 days

Last modified

2 months ago

Integrations

Automation

Back to issues Create new issue

Extracting data process has ended with the issue in the beginning: Failed to load document (PDFium: Data format error)

Closed

insomniac_boy opened this issue

2024-01-11T09:09:31.163Z ERROR Actor failed with an exception 2024-01-11T09:09:31.165Z Traceback (most recent call last): 2024-01-11T09:09:31.166Z File "/usr/src/app/src/main.py", line 14, in main 2024-01-11T09:09:31.167Z pdf_document = pdfium.PdfDocument(io.BytesIO(pdf.content)) 2024-01-11T09:09:31.167Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.167Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 77, in init 2024-01-11T09:09:31.168Z self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) 2024-01-11T09:09:31.168Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.169Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 744, in _open_pdf 2024-01-11T09:09:31.169Z raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") 2024-01-11T09:09:31.169Z pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

Jiří Moravčík (jirimoravcik)

Hello, I've updated the Actor, it should now tell you where the problem is and will continue working with files that are not problematic.

insomniac_boy

It helped, however, now it obviously skips a big chunk of files. By observing logs, I came to conclusion that this error may be caused by the encoding-related bug. I mean there are Cyrillic, Greek and Roman letters in these pdfs.

Jiří Moravčík (jirimoravcik)

Yeah, it's crashing in the underlying library, so there's not much I can do here. Sorry about that.

Add comment

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

368

PDF Extractor 2.0

jupri/pdf-extractor-2-0

💫 Extract PDF Document Contents including Metadata, Images, Pages, Tables, Attachments, etc.

cat

PDF Text Extractor

sami_apify/PDF-Text-Extractor

This actor downloads PDFs from provided URLs, extracts text content from them, and saves the extracted data into an Apify dataset. It’s ideal for scraping and processing PDFs available online.

sami

HTML to PDF converter

apify/html-to-pdf-converter

Convert HTML string to A4 PDF.

Apify

4.3

HTML to PDF Converter

jancurn/url-to-pdf

Loads a web page in headless Chrome using Puppeteer and prints it to PDF. The input is a JSON object and output is a PDF file.

Jan Čurn

469

Website To PDF Converter

louisdeconinck/website-to-pdf-converter

Convert websites to high-quality PDF documents with customizable options. This powerful actor allows you to transform website pages with both static HTML and dynamic content into professional-grade PDFs, offering a wide range of customization features such as page format, orientation, margins, …

Louis Deconinck

5.0

Markdown Converter

jindrich.bar/markdown-converter

A simple Actor for converting pdf / doc / docx files to Markdown.

Jindřich Bär

HTML string to PDF

mhamas/html-string-to-pdf

Convert HTML string to A4 PDF.

Matej Hamas

Google Slides Replacer

kamil.stus/google-slides-replacer

Automate the creation of Google Slides presentations from a template, with support for dynamic text replacement.

Kamil Štus

HTML to PDF Converter Pro 🔄

powerful_bachelor/html-to-pdf-converter-pro

🔄 Convert web pages to high-quality PDFs with special canvas element handling! Perfect for 📄 documentation, 🖨️ printing, and 🔒 archiving. Features include batch processing and flexible page settings. Transform your web content into professional PDFs! 🚀