PDF Text Extractor avatar
PDF Text Extractor
Try for free

No credit card required

View all Actors
PDF Text Extractor

PDF Text Extractor

jirimoravcik/pdf-text-extractor
Try for free

No credit card required

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

RE

Data Format Error

Closed

Regscanner opened this issue
5 months ago

when crawling I receive the following error: 2024-02-14T06:07:31.285Z ACTOR: Pulling Docker image of build 5lFVfc3pf7JN70PcE from repository. 2024-02-14T06:07:35.343Z ACTOR: Creating Docker container. 2024-02-14T06:07:35.641Z ACTOR: Starting Docker container. 2024-02-14T06:07:37.360Z INFO Initializing actor... 2024-02-14T06:07:37.363Z INFO System info ({"apify_sdk_version": "1.1.5", "apify_client_version": "1.4.1", "python_version": "3.11.7", "os": "linux"}) 2024-02-14T06:07:37.628Z --- Logging error --- 2024-02-14T06:07:37.629Z Traceback (most recent call last): 2024-02-14T06:07:37.631Z File "/usr/src/app/src/main.py", line 15, in main 2024-02-14T06:07:37.633Z pdf_document = pdfium.PdfDocument(io.BytesIO(pdf.content)) 2024-02-14T06:07:37.634Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.636Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 77, in init 2024-02-14T06:07:37.638Z self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) 2024-02-14T06:07:37.639Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.641Z File "/usr/local/lib/python3.11/site-packages/pypdfium2/_helpers/document.py", line 744, in _open_pdf 2024-02-14T06:07:37.642Z raise PdfiumError(f"Failed to load document (PDFium: {pdfium_i.ErrorToStr.get(err_code)}).") 2024-02-14T06:07:37.644Z pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Data format error). 2024-02-14T06:07:37.646Z 2024-02-14T06:07:37.647Z During handling of the above exception, another exception occurred: 2024-02-14T06:07:37.649Z 2024-02-14T06:07:37.650Z Traceback (most recent call last): 2024-02-14T06:07:37.652Z File "/usr/local/lib/python3.11/logging/init.py", line 1110, in emit 2024-02-14T06:07:37.654Z msg = self.format(record) 2024-02-14T06:07:37.656Z ^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.657Z File "/usr/local/lib/python3.11/logging/init.py", line 953, in format 2024-02-14T06:07:37.659Z return fmt.format(record) 2024-02-14T06:07:37.661Z ^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.663Z File "/usr/local/lib/python3.11/logging/init.py", line 687, in format 2024-02-14T06:07:37.665Z record.message = record.getMessage() 2024-02-14T06:07:37.672Z ^^^^^^^^^^^^^^^^^^^ 2024-02-14T06:07:37.674Z File "/usr/local/lib/python3.11/logging/init.py", line 377, in getMessage 2024-02-14T06:07:37.675Z msg = msg % self.args 2024-02-14T06:07:37.678Z ~~~~^~~~~~~~~~~ 2024-02-14T06:07:37.680Z TypeError: not all arguments converted during string formatting 2024-02-14T06:07:37.682Z Call stack: 2024-02-14T06:07:37.683Z File "

jirimoravcik avatar

Hi, sadly it seems that PDFium has problems parsing the PDF file you provided. Can you try some other files to see if it is caused by that specific file?

Developer
Maintained by Community
Actor metrics
  • 43 monthly users
  • 9 stars
  • 99.5% runs succeeded
  • Created in Oct 2023
  • Modified 3 months ago