PDF Text Extractor
No credit card required
PDF Text Extractor
No credit card required
PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.
2024-01-11T09:09:31.163Z ERROR Actor failed with an exception 2024-01-11T09:09:31.165Z Traceback (most recent call last): 2024-01-11T09:09:31.166Z File "/usr/src/app/src/main.py", line 14, in main 2024-01-11T09:09:31.167Z pdf_document = pdfium.PdfDocument(io.BytesIO(pdf.content)) 2024-01-11T09:09:31.167Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.167Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 77, in init 2024-01-11T09:09:31.168Z self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) 2024-01-11T09:09:31.168Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.169Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 744, in _open_pdf 2024-01-11T09:09:31.169Z raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") 2024-01-11T09:09:31.169Z pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).
Hello, I've updated the Actor, it should now tell you where the problem is and will continue working with files that are not problematic.
It helped, however, now it obviously skips a big chunk of files. By observing logs, I came to conclusion that this error may be caused by the encoding-related bug. I mean there are Cyrillic, Greek and Roman letters in these pdfs.
Yeah, it's crashing in the underlying library, so there's not much I can do here. Sorry about that.
- 38 monthly users
- 17 stars
- 100.0% runs succeeded
- 2.4 days response time
- Created in Oct 2023
- Modified about 2 months ago