Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoWhen downloading PDF files from windows.net, I get TypeError. Headers looks OK.
12024-11-21T09:08:59.122Z WARN HttpCrawler: Reclaiming failed request back to the list or queue. TypeError: Cannot read properties of undefined (reading 'content-type') 22024-11-21T09:08:59.124Z at file:///home/myuser/dist/file-download.js:140:51 32024-11-21T09:08:59.126Z at new Promise (<anonymous>) 42024-11-21T09:08:59.128Z at HttpCrawler.requestHandler (file:///home/myuser/dist/file-download.js:123:41) 52024-11-21T09:08:59.130Z at process.processTicksAndRejections (node:internal/process/task_queues:105:5) 62024-11-21T09:08:59.132Z at async wrap (/home/myuser/node_modules/@apify/timeout/cjs/index.cjs:54:21) {"id":"eX5pESR6RMJtmcI","url":"https://detskydiabetes.blob.core.windows.net/cms/ContentItems/252_00252/eq6D45/program-diakongres.pdf","retryCount":1}
1✗ curl -I https://detskydiabetes.blob.core.windows.net/cms/ContentItems/252_00252/eq6D45/program-diakongres.pdf 2HTTP/1.1 200 OK 3Content-Length: 11260737 4Content-Type: application/pdf 5Content-MD5: xPF6w8as7VkDeLwElY7lVQ== 6Last-Modified: Thu, 03 Nov 2022 13:11:07 GMT 7ETag: 0x8DABD9CDEB95106 8Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 9x-ms-request-id: 1647d38d-b01e-0054-01f8-3b64eb000000 10x-ms-version: 2009-09-19 11x-ms-lease-status: unlocked 12x-ms-blob-type: BlockBlob 13Access-Control-Expose-Headers: x-ms-request-id,Server,x-ms-version,Content-Type,Last-Modified,ETag,Content-MD5,x-ms-lease-status,x-ms-blob-type,Content-Length,Date,Transfer-Encoding 14Access-Control-Allow-Origin: * 15Date: Thu, 21 Nov 2024 09:32:25 GMT
Hi, I'm sorry for your troubles. Seems like an issue on our side. Our team is already working on that and I'll let you know as soon as it's resolved.
any ideas when will this be resolved? i am running in the same issue
Hi,
I was trying to find a run with the same issues, but I noticed a different problem.
In your case, it seems there’s an issue with the startURLs
.
Could you please create a new issue for this? In the meantime, I’ll continue investigating to figure out what’s happening.
@sci Regarding your issue:
I’m not sure why you’re using crawlerType = jsdom
(please note that it’s experimental and should be used at your own risk).
I’ve updated the crawlerType
to "crawlerType": "cheerio"
, and it’s running fine. Alternatively, you can use the default "playwright:adaptive"
, which also works well.
Actor Metrics
3.9k monthly users
-
718 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 21 hours ago