No credit card required

Website Content Crawler

apify/website-content-crawler

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo

Back to issues Create new issue

TypeError: Cannot read properties of undefined (reading 'content-type')

Open

sevcik opened this issue

When downloading PDF files from windows.net, I get TypeError. Headers looks OK.

sevcik

12024-11-21T09:08:59.122Z WARN  HttpCrawler: Reclaiming failed request back to the list or queue. TypeError: Cannot read properties of undefined (reading 'content-type')
22024-11-21T09:08:59.124Z     at file:///home/myuser/dist/file-download.js:140:51
32024-11-21T09:08:59.126Z     at new Promise (<anonymous>)
42024-11-21T09:08:59.128Z     at HttpCrawler.requestHandler (file:///home/myuser/dist/file-download.js:123:41)
52024-11-21T09:08:59.130Z     at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
62024-11-21T09:08:59.132Z     at async wrap (/home/myuser/node_modules/@apify/timeout/cjs/index.cjs:54:21) {"id":"eX5pESR6RMJtmcI","url":"https://detskydiabetes.blob.core.windows.net/cms/ContentItems/252_00252/eq6D45/program-diakongres.pdf","retryCount":1}

sevcik

1✗ curl -I https://detskydiabetes.blob.core.windows.net/cms/ContentItems/252_00252/eq6D45/program-diakongres.pdf
2HTTP/1.1 200 OK
3Content-Length: 11260737
4Content-Type: application/pdf
5Content-MD5: xPF6w8as7VkDeLwElY7lVQ==
6Last-Modified: Thu, 03 Nov 2022 13:11:07 GMT
7ETag: 0x8DABD9CDEB95106
8Server: Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
9x-ms-request-id: 1647d38d-b01e-0054-01f8-3b64eb000000
10x-ms-version: 2009-09-19
11x-ms-lease-status: unlocked
12x-ms-blob-type: BlockBlob
13Access-Control-Expose-Headers: x-ms-request-id,Server,x-ms-version,Content-Type,Last-Modified,ETag,Content-MD5,x-ms-lease-status,x-ms-blob-type,Content-Length,Date,Transfer-Encoding
14Access-Control-Allow-Origin: *
15Date: Thu, 21 Nov 2024 09:32:25 GMT

Dušan Vystrčil (dusan.vystrcil)

Hi, I'm sorry for your troubles. Seems like an issue on our side. Our team is already working on that and I'll let you know as soon as it's resolved.

sci

any ideas when will this be resolved? i am running in the same issue

Jiří Spilka (jiri.spilka)

Hi,

I was trying to find a run with the same issues, but I noticed a different problem. In your case, it seems there’s an issue with the startURLs.

Could you please create a new issue for this? In the meantime, I’ll continue investigating to figure out what’s happening.

Jiří Spilka (jiri.spilka)

@sci Regarding your issue: I’m not sure why you’re using crawlerType = jsdom (please note that it’s experimental and should be used at your own risk).

I’ve updated the crawlerType to "crawlerType": "cheerio", and it’s running fine. Alternatively, you can use the default "playwright:adaptive", which also works well.

Add comment

Developer

Apify

Actor Metrics

4k monthly users
839 stars
>99% runs succeeded
1 days response time
Created in Mar 2023
Modified 17 hours ago

Categories

Developer tools

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

173

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

109

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

254

RegExp Scraper

ib4ngz/regexp-scraper

This actor scrapes data from a list of provided URLs using regular expressions for precise and customizable pattern matching. It can handle both static and dynamic web pages and supports depth-based crawling to explore links and extract data from multiple levels of the web.

Iqbal R

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Omar Abdlhakim

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts data from pages using JavaScript code. The Actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

Apify

73.7k

332

Google Maps Scraper

compass/crawler-google-places

Extract data from hundreds of Google Maps locations and businesses. Get Google Maps data including reviews, images, contact info, opening hours, location, popular times, prices & more. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Compass

82.2k

724

📩📍 Google Maps Email Extractor

lukaskrivka/google-maps-with-contact-details

Extract Google Maps contact details. Scrape websites of Google Maps places for contact details and get email addresses, website, location, address, zipcode, phone number, social media links. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Lukáš Křivka

11.1k

325

Instagram Scraper

apify/instagram-scraper

Scrape and download Instagram posts, profiles, places, hashtags, photos, and comments. Get data from Instagram using one or more Instagram URLs or search queries. Export scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Apify

66.4k

751

Google Maps Reviews Scraper

compass/Google-Maps-Reviews-Scraper

Extract all reviews of Google Maps places using place URLs. Get review text, published date, response from owner, review URL, and reviewer's details. Download scraped data, run the scraper via API, schedule and monitor runs or integrate with other tools.

Compass

5.3k

125