PDF Text Extractor avatar
PDF Text Extractor

Pricing

Pay per usage

Go to Store
PDF Text Extractor

PDF Text Extractor

jirimoravcik/pdf-text-extractor

Developed by

Jiří Moravčík

Maintained by Community

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

5.0 (1)

Pricing

Pay per usage

28

Monthly users

53

Runs succeeded

>99%

Response time

12 hours

Last modified

6 months ago

IB

Extracting data process has ended with the issue in the beginning: Failed to load document (PDFium: Data format error)

Closed
insomniac_boy opened this issue
a year ago

2024-01-11T09:09:31.163Z ERROR Actor failed with an exception 2024-01-11T09:09:31.165Z Traceback (most recent call last): 2024-01-11T09:09:31.166Z File "/usr/src/app/src/main.py", line 14, in main 2024-01-11T09:09:31.167Z pdf_document = pdfium.PdfDocument(io.BytesIO(pdf.content)) 2024-01-11T09:09:31.167Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.167Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 77, in init 2024-01-11T09:09:31.168Z self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) 2024-01-11T09:09:31.168Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.169Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 744, in _open_pdf 2024-01-11T09:09:31.169Z raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") 2024-01-11T09:09:31.169Z pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

jirimoravcik avatar

Hello, I've updated the Actor, it should now tell you where the problem is and will continue working with files that are not problematic.

IB

insomniac_boy

a year ago

It helped, however, now it obviously skips a big chunk of files. By observing logs, I came to conclusion that this error may be caused by the encoding-related bug. I mean there are Cyrillic, Greek and Roman letters in these pdfs.

jirimoravcik avatar

Yeah, it's crashing in the underlying library, so there's not much I can do here. Sorry about that.

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.