
PDF Text Extractor
Pricing
Pay per usage

PDF Text Extractor
PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.
5.0 (1)
Pricing
Pay per usage
28
Monthly users
53
Runs succeeded
>99%
Response time
12 hours
Last modified
6 months ago
Extracting data process has ended with the issue in the beginning: Failed to load document (PDFium: Data format error)
2024-01-11T09:09:31.163Z ERROR Actor failed with an exception 2024-01-11T09:09:31.165Z Traceback (most recent call last): 2024-01-11T09:09:31.166Z File "/usr/src/app/src/main.py", line 14, in main 2024-01-11T09:09:31.167Z pdf_document = pdfium.PdfDocument(io.BytesIO(pdf.content)) 2024-01-11T09:09:31.167Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.167Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 77, in init 2024-01-11T09:09:31.168Z self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) 2024-01-11T09:09:31.168Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-11T09:09:31.169Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 744, in _open_pdf 2024-01-11T09:09:31.169Z raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") 2024-01-11T09:09:31.169Z pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error).

Hello, I've updated the Actor, it should now tell you where the problem is and will continue working with files that are not problematic.
insomniac_boy
It helped, however, now it obviously skips a big chunk of files. By observing logs, I came to conclusion that this error may be caused by the encoding-related bug. I mean there are Cyrillic, Greek and Roman letters in these pdfs.

Yeah, it's crashing in the underlying library, so there's not much I can do here. Sorry about that.
Pricing
Pricing model
Pay per usageThis Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.