PDF Text Extractor avatar

PDF Text Extractor

Try for free

No credit card required

View all Actors
PDF Text Extractor

PDF Text Extractor

jirimoravcik/pdf-text-extractor
Try for free

No credit card required

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

IB

Extracting data process has ended with the issue in the beginning: Failed to load document (PDFium: Data format error)

Closed

insomniac_boy opened this issue
10 months ago

2024-01-11T09:09:31.163Z ERROR Actor failed with an exception 2024-01-11T09:09:31.165Z Traceback (most recent call last): 2024-01-11T09:09:31.166Z File "/usr/src/app/src/main.py", line 14, in main 2024-01-11T09:09:31.167Z pdf_document = pdfium.PdfDocument(io.BytesIO(pdf.content)) 2024-01-11T09:09:31.167Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.167Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 77, in init 2024-01-11T09:09:31.168Z self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) 2024-01-11T09:09:31.168Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.169Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 744, in _open_pdf 2024-01-11T09:09:31.169Z raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") 2024-01-11T09:09:31.169Z pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

jirimoravcik avatar

Hello, I've updated the Actor, it should now tell you where the problem is and will continue working with files that are not problematic.

IB

insomniac_boy

10 months ago

It helped, however, now it obviously skips a big chunk of files. By observing logs, I came to conclusion that this error may be caused by the encoding-related bug. I mean there are Cyrillic, Greek and Roman letters in these pdfs.

jirimoravcik avatar

Yeah, it's crashing in the underlying library, so there's not much I can do here. Sorry about that.

Developer
Maintained by Community
Actor metrics
  • 38 monthly users
  • 17 stars
  • 100.0% runs succeeded
  • 2.4 days response time
  • Created in Oct 2023
  • Modified about 2 months ago