Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.6 (39)

Pricing

Pay per usage

1416

Total users

55K

Monthly users

7.9K

Runs succeeded

>99%

Issues response

7.6 days

Last modified

5 days ago

Developer tools

Back to issues Create new issue

The actor is trying to proceed too many requests

Closed

capola opened this issue

We have an automation set up to handle this process, which was working as expected until today. However, we encountered the following issue:

The automation is configured to crawl a specified number of websites, never exceeding 50. Despite this, in the second batch, the actor is attempting to process thousands of pages that are not defined within our system.

Could you please investigate this issue and let us know what might be causing it? Thank you!

Jiří Spilka (jiri.spilka)

Hi, thank you for using Website Content Crawler.

I’ve checked your run ID: peIfqyorawtDpbWIj, and I don’t see a limit of 50 results for crawling.
You can control the number of pages being crawled using the "maxCrawlPages" parameter. In your run, maxCrawlPages is set to 9999999.

To limit the number of pages, please set maxCrawlPages=50.

However, based on your run, it seems you need to scrape a list of around 50 URLs (submitted as startURLs).
If that’s the case, you can set maxCrawlDepth=0, and the Actor will scrape only the specified URLs without crawling further.

Please let me know if this helps or provide me with a run ID where you’re encountering an issue, and I’ll take a look. Jiri

capola

Hi, Jiri!

This time I have explicitly added the following parameters:

"maxCrawlDepth": 0, "maxCrawlPages": 100,

However the results were the same: https://console.apify.com/actors/runs/jzXoTJBdEqsQqDQv7

Please let me know what we should do to resolve this issue.

Jiří Spilka (jiri.spilka)

Thank you.

In your startURLs, there’s also a sitemap URL:
*sitemap*.xml, and by default, the crawler will enqueue all the links from the sitemap.

While this might seem counterintuitive at first, some users submit sitemaps as startURLs to enqueue and scrape all the URLs listed in the sitemap.

If possible, I recommend removing the sitemap from your startURLs.

Please see my example run where exactly 27 URLs were scraped (sitemap removed):
Example Run.

If you need to download PDF files, enable the "saveFiles": true setting, and the files will be saved to Apify Key-Value Store.

I also noticed that the crawler was blocked for URLs ending in *.ru. This is likely due to bot protection. To scrape these URLs, you’ll need to use the correct proxy.

If you’re unable to remove sitemap.xml, you can set maxCrawlPages and maxResults to 100:

"The maximum number of resulting web pages to store. The crawler will automatically finish after reaching this number."

See my example run: Run.

Additionally, your startURLs seem quite diverse. It might be convenient to use different proxies for different countries to improve scraping efficiency.

I understand that setting up crawling correctly can sometimes be tricky, as websites vary and each may require a slightly different approach.
I hope this helps. Jiri

capola

Thank you for your detailed explanation—it’s much appreciated!

We’re using automation via Make.com to collect initial URLs based on specific search queries with the Google Search Results Scraper. The scenario retrieves the first 80 to 100 results and sends them in two batches to the Website Content Crawler to avoid potential timeout issues. All scraped data is consolidated into a single .txt file for later use by AI.

Here’s our last run: https://console.apify.com/actors/nFJndFXA5zjCTuudP/runs/jqo4gBbnMuUHL07fh

Is there a way to limit the results to prevent scraping sitemaps?
Can we skip PDF results?

I initially thought PDFs were scraped, but if they’re not, is there a way to exclude them right in the Google Search Results Scraper?

Proxies Based on URLs

Now that you know our use case, is it possible to set proxies automatically based on the URLs being scraped?

Thank you so much for your support! Looking forward to your guidance.

capola

Hi! I got an automatic reply asking me to rate the conversation. I hope this case is not close and that I will receive a reply to my last email.

Jiří Spilka (jiri.spilka)

Hi, I apologize for the delayed response.

No, no, this issue isn’t closed! I’d like to work on resolving it with you. Thank you for providing detailed information.

Regarding batch processing, you don’t need to create file batches. Website Content Crawler has a configurable timeout and can run for a long time, unless there’s a timeout limitation on Make.com (I’m not very familiar with that platform).

and 2. I noticed your Google search query is quite advanced. You can exclude PDFs and sitemaps directly in the Google search query using -filetype:pdf and -filetype:xml. I tested this on Google, and it works as expected.
Regarding proxy settings, it seems I initially misread the logs. Upon revisiting, the logs indicate NS_ERROR_PROXY_CONNECTION_REFUSED. However, when I checked, the website itself was not reachable. This appears to be an issue with the website, not the crawler. I apologize for the misleading information. Please disregard this and run the crawler as usual. The crawler uses datacenter proxies by default, which should work fine.

I hope this helps! Jiri

capola

Hi Jiri!

Thank you so much! Your support has been greatly appreciated!

Have a great holiday!

Jiří Spilka (jiri.spilka)

Happy to help. Have a great holidays too!

Add comment

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

936

4.5

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

561

3.8

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

5.0

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

5.0

AI Web Scraper - Powered by Crawl4AI

raizen/ai-web-scraper

A blazing-fast AI web scraper powered by Crawl4AI. Perfect for LLMs, AI agents, AI automation, model training, sentiment analysis, and content generation. Supports deep crawling, multiple extraction strategies and flexible output (Markdown/JSON). Seamlessly integrates with Make.com, n8n, and Zapier.

Raizen Technology

108

1.0

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

85K

4.4

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

357

4.1

Dynamic Markdown Scraper

louisdeconinck/dynamic-markdown-scraper

Effortlessly feed LLM AIs with clean Markdown using our advanced web scraper. Seamlessly scrape dynamic, JavaScript-rendered websites while preserving original formatting. Ideal for AI training, documentation, and content migration.

Louis Deconinck

5.0

Instant web data scraper - Scrape any website

curious_coder/instant-web-scraper

Scrape any public and private website data by providing just URL and optionally cookies and proxy information. This scraper is similar to instant data scraper but runs on cloud and can be used as API too!