Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
CP

The actor is trying to proceed too many requests

Closed

capola opened this issue
a month ago

We have an automation set up to handle this process, which was working as expected until today. However, we encountered the following issue:

The automation is configured to crawl a specified number of websites, never exceeding 50. Despite this, in the second batch, the actor is attempting to process thousands of pages that are not defined within our system.

Could you please investigate this issue and let us know what might be causing it? Thank you!

jiri.spilka avatar

Hi, thank you for using Website Content Crawler.

I’ve checked your run ID: peIfqyorawtDpbWIj, and I don’t see a limit of 50 results for crawling.
You can control the number of pages being crawled using the "maxCrawlPages" parameter. In your run, maxCrawlPages is set to 9999999.

To limit the number of pages, please set maxCrawlPages=50.

However, based on your run, it seems you need to scrape a list of around 50 URLs (submitted as startURLs).
If that’s the case, you can set maxCrawlDepth=0, and the Actor will scrape only the specified URLs without crawling further.

Please let me know if this helps or provide me with a run ID where you’re encountering an issue, and I’ll take a look. Jiri

CP

capola

a month ago

Hi, Jiri!

This time I have explicitly added the following parameters:

"maxCrawlDepth": 0, "maxCrawlPages": 100,

However the results were the same: https://console.apify.com/actors/runs/jzXoTJBdEqsQqDQv7

Please let me know what we should do to resolve this issue.

jiri.spilka avatar

Thank you.

In your startURLs, there’s also a sitemap URL:
*sitemap*.xml, and by default, the crawler will enqueue all the links from the sitemap.

While this might seem counterintuitive at first, some users submit sitemaps as startURLs to enqueue and scrape all the URLs listed in the sitemap.

  • If possible, I recommend removing the sitemap from your startURLs.

Please see my example run where exactly 27 URLs were scraped (sitemap removed):
Example Run.

If you need to download PDF files, enable the "saveFiles": true setting, and the files will be saved to Apify Key-Value Store.

I also noticed that the crawler was blocked for URLs ending in *.ru. This is likely due to bot protection. To scrape these URLs, you’ll need to use the correct proxy.

  • If you’re unable to remove sitemap.xml, you can set maxCrawlPages and maxResults to 100:

"The maximum number of resulting web pages to store. The crawler will automatically finish after reaching this number."

See my example run: Run.

Additionally, your startURLs seem quite diverse. It might be convenient to use different proxies for different countries to improve scraping efficiency.

I understand that setting up crawling correctly can sometimes be tricky, as websites vary and each may require a slightly different approach.
I hope this helps. Jiri

CP

capola

25 days ago

Thank you for your detailed explanation—it’s much appreciated!

We’re using automation via Make.com to collect initial URLs based on specific search queries with the Google Search Results Scraper. The scenario retrieves the first 80 to 100 results and sends them in two batches to the Website Content Crawler to avoid potential timeout issues. All scraped data is consolidated into a single .txt file for later use by AI.

Here’s our last run: https://console.apify.com/actors/nFJndFXA5zjCTuudP/runs/jqo4gBbnMuUHL07fh

  1. Is there a way to limit the results to prevent scraping sitemaps?

  2. Can we skip PDF results?

I initially thought PDFs were scraped, but if they’re not, is there a way to exclude them right in the Google Search Results Scraper?

  1. Proxies Based on URLs

Now that you know our use case, is it possible to set proxies automatically based on the URLs being scraped?

Thank you so much for your support! Looking forward to your guidance.

CP

capola

24 days ago

Hi! I got an automatic reply asking me to rate the conversation. I hope this case is not close and that I will receive a reply to my last email.

jiri.spilka avatar

Hi, I apologize for the delayed response.

No, no, this issue isn’t closed! I’d like to work on resolving it with you. Thank you for providing detailed information.

Regarding batch processing, you don’t need to create file batches. Website Content Crawler has a configurable timeout and can run for a long time, unless there’s a timeout limitation on Make.com (I’m not very familiar with that platform).

  1. and 2. I noticed your Google search query is quite advanced. You can exclude PDFs and sitemaps directly in the Google search query using -filetype:pdf and -filetype:xml. I tested this on Google, and it works as expected.

  2. Regarding proxy settings, it seems I initially misread the logs. Upon revisiting, the logs indicate NS_ERROR_PROXY_CONNECTION_REFUSED. However, when I checked, the website itself was not reachable. This appears to be an issue with the website, not the crawler. I apologize for the misleading information. Please disregard this and run the crawler as usual. The crawler uses datacenter proxies by default, which should work fine.

I hope this helps! Jiri

CP

capola

23 days ago

Hi Jiri!

Thank you so much! Your support has been greatly appreciated!

Have a great holiday!

jiri.spilka avatar

Happy to help. Have a great holidays too!

Developer
Maintained by Apify

Actor Metrics

  • 4k monthly users

  • 840 stars

  • >99% runs succeeded

  • 1 days response time

  • Created in Mar 2023

  • Modified 21 hours ago