Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.0 (41)

Pricing

Pay per usage

1596

Total users

62K

Monthly users

8.2K

Runs succeeded

>99%

Issues response

8.2 days

Last modified

15 hours ago

Developer tools

Back to issues Create new issue

Crawler does not extract content, with no useful logs to debug

Closed

nimble_caretaker opened this issue

We are attempting to scan our client's website. They have whitelisted US proxies. Unfortunately, most of the time, the results are not extracted. With the same settings, it worked once, but now it completely doesn't. The logs are not insightful enough to resolve or understand the issue the crawler has encountered.

We need to provide an answer to our client on whether the problem can be solved by us or if it's something on their end (e.g., CDN blocking).

Using: CrawlerType - playwright:firefox Proxy - US proxy (3 IP address that our clients have whitelisted)

Jiří Spilka (jiri.spilka)

Hi, thank you for using Website Content Crawler.

I checked the logs of your run and adjusted a couple of settings, and it’s now working. Here are the key changes: In the logs, I noticed Navigation timed out after 20 seconds. This was due to the Request timeout setting. Increasing this value to 60 seconds (the default) resolves the issue.

Other settings to change:

"maxCrawlDepth": 0 -> 20,
"maxCrawlPages": 0 -> 9999999

Max crawl depth: Setting this to 0 means the Actor will only scrape the startUrls without further crawling. You’ll need to set it to a higher value; the default is 20.
Max crawl pages: This controls the total number of pages to scrape. Remove this setting to use the default value.

With these updated settings:

"requestTimeoutSecs": 60,
"maxCrawlDepth": 20,
"maxCrawlPages": 9999999

I was able to successfully crawl the website. Here’s a log snippet:

2024-11-12T12:02:44.268Z INFO  Crawling will be started using 1 start URLs and 0 sitemap URLs
2024-11-12T12:02:44.907Z INFO  PlaywrightCrawler: Starting the crawler.
2024-11-12T12:03:22.368Z INFO  Enqueued 66 new links on https://****il/.

Please let me know if it works for you too.

nimble_caretaker

Works well, Thank you for your time.

Add comment

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.4K

4.6

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

3.5

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

5.0

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

607

4.3

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

460

4.4

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

89K

4.5

HTML/Website Media Scraper

aweworkz/html-web-media-scraper

The Website Media scraper extracts all media files, i.e images, videos, audio, and other related media elements, from multiple websites. It then provides the corresponding descriptions or the alt="" content. You'll need to use proxies to run this actor for some websites with bot blocking features.

aweworkz

171

1.0

RAG Web Browser

apify/rag-web-browser

Web browser for OpenAI Assistants, RAG pipelines, or AI agents, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages, and returns their content as Markdown for further processing by an LLM. It can also scrape individual URLs. Supports Model Context Protocol (MCP).

Apify

4.4K

4.3

AI Web Scraper - Powered by Crawl4AI

raizen/ai-web-scraper

A blazing-fast AI web scraper powered by Crawl4AI. Perfect for LLMs, AI agents, AI automation, model training, sentiment analysis, and content generation. Supports deep crawling, multiple extraction strategies and flexible output (Markdown/JSON). Seamlessly integrates with Make.com, n8n, and Zapier.