Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

apify/website-content-crawler

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1.1k

Monthly users

Runs succeeded

>99%

Response time

2.3 days

Last modified

7 days ago

AI Developer tools

Back to issues Create new issue

Crawling fails when URLs respond with application/pdf

Closed

jfmatt opened this issue

I have a number of use cases where links go to PDF documents - which should use the raw HTML downloader - but where the URLs do not end in .pdf. In the linked example run, the URLs end in "pdf", but don't have an apparent file extension. When this happens, the crawler fails - it looks like it's attempting to use the browser-based crawl, and rejecting these pages.

Ideally, the decision of which crawler to use would be based on the MIME type of the response after fetching. Failing that - is there any way to configure different heuristics for how to handle these URLs?

Jakub Kopecký (jakub.kopecky)

Hi,

This behavior is expected when the crawler downloads a file (it throws an exception for that). If you want to access the file, you need to set "saveFiles": true or (in UI "Output settings" -> "Save files") and the files will be accessible in Key Value storage. Please see this run: https://console.apify.com/view/runs/QLJaWGZRHSb20u0X3

Let me know if this does not work for you or if you have any questions.

Jakub

Add comment

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

475

5.0

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

224

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

460

/llms.txt Generator

jakub.kopecky/llmstxt-generator

The /llms.txt Generator 🕸️📄 extracts website content to create an llms.txt file for AI apps 🤖✨ like LLM fine-tuning and indexing. Output is available 📥 in the Key-Value Store for easy download and integration into workflows. 🚀

Jakub Kopecký

5.0

Webpage Singer 🎶

josef.prochazka/webpage-singer

Ever wondered what a website would sound like as a song? This Actor takes any webpage, turns its content into lyrics, and transforms it into a track in your favorite genre. Just drop in a URL, pick a style, and let the AI do the rest.

Josef Procházka

5.0

Backlink Opportunity Finder

easyapi/backlink-opportunity-finder

🔍 Discover high-quality backlink opportunities to boost your domain authority and search rankings. Extract valuable data about potential websites for building authoritative backlinks, including domain metrics, relevance analysis, and estimated SEO impact.

EasyApi

Example Website Screenshot Crawler

dz_omar/example-website-screenshot-crawler

Automated website screenshot crawler using Pyppeteer and Apify. This open-source actor captures screenshots from specified URLs, uploads them to the Apify Key-Value Store, and provides easy access to the results, making it ideal for monitoring website changes and archiving web content.

Abdlhakim hefaia

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

6.9k

4.7

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

5.9k

5.0

Web Scraper

apify/web-scraper

Crawls arbitrary websites using the Chrome browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

79.2k

4.5