Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoWe have an automation set up to handle this process, which was working as expected until today. However, we encountered the following issue:
The automation is configured to crawl a specified number of websites, never exceeding 50. Despite this, in the second batch, the actor is attempting to process thousands of pages that are not defined within our system.
Could you please investigate this issue and let us know what might be causing it? Thank you!
Hi, thank you for using Website Content Crawler.
I’ve checked your run ID: peIfqyorawtDpbWIj
, and I don’t see a limit of 50 results for crawling.
You can control the number of pages being crawled using the "maxCrawlPages"
parameter. In your run, maxCrawlPages
is set to 9999999
.
To limit the number of pages, please set maxCrawlPages=50
.
However, based on your run, it seems you need to scrape a list of around 50 URLs (submitted as startURLs
).
If that’s the case, you can set maxCrawlDepth=0
, and the Actor will scrape only the specified URLs without crawling further.
Please let me know if this helps or provide me with a run ID where you’re encountering an issue, and I’ll take a look. Jiri
Hi, Jiri!
This time I have explicitly added the following parameters:
"maxCrawlDepth": 0, "maxCrawlPages": 100,
However the results were the same: https://console.apify.com/actors/runs/jzXoTJBdEqsQqDQv7
Please let me know what we should do to resolve this issue.
Thank you.
In your startURLs
, there’s also a sitemap URL:
*sitemap*.xml
, and by default, the crawler will enqueue all the links from the sitemap.
While this might seem counterintuitive at first, some users submit sitemaps as startURLs
to enqueue and scrape all the URLs listed in the sitemap.
- If possible, I recommend removing the sitemap from your
startURLs
.
Please see my example run where exactly 27 URLs were scraped (sitemap removed):
Example Run.
If you need to download PDF files, enable the "saveFiles": true
setting, and the files will be saved to Apify Key-Value Store.
I also noticed that the crawler was blocked for URLs ending in *.ru
. This is likely due to bot protection. To scrape these URLs, you’ll need to use the correct proxy.
- If you’re unable to remove
sitemap.xml
, you can setmaxCrawlPages
andmaxResults
to 100:
"The maximum number of resulting web pages to store. The crawler will automatically finish after reaching this number."
See my example run: Run.
Additionally, your startURLs
seem quite diverse. It might be convenient to use different proxies for different countries to improve scraping efficiency.
I understand that setting up crawling correctly can sometimes be tricky, as websites vary and each may require a slightly different approach.
I hope this helps. Jiri
Thank you for your detailed explanation—it’s much appreciated!
We’re using automation via Make.com to collect initial URLs based on specific search queries with the Google Search Results Scraper. The scenario retrieves the first 80 to 100 results and sends them in two batches to the Website Content Crawler to avoid potential timeout issues. All scraped data is consolidated into a single .txt file for later use by AI.
Here’s our last run: https://console.apify.com/actors/nFJndFXA5zjCTuudP/runs/jqo4gBbnMuUHL07fh
-
Is there a way to limit the results to prevent scraping sitemaps?
-
Can we skip PDF results?
I initially thought PDFs were scraped, but if they’re not, is there a way to exclude them right in the Google Search Results Scraper?
- Proxies Based on URLs
Now that you know our use case, is it possible to set proxies automatically based on the URLs being scraped?
Thank you so much for your support! Looking forward to your guidance.
Hi! I got an automatic reply asking me to rate the conversation. I hope this case is not close and that I will receive a reply to my last email.
Hi, I apologize for the delayed response.
No, no, this issue isn’t closed! I’d like to work on resolving it with you. Thank you for providing detailed information.
Regarding batch processing, you don’t need to create file batches. Website Content Crawler has a configurable timeout and can run for a long time, unless there’s a timeout limitation on Make.com (I’m not very familiar with that platform).
-
and 2. I noticed your Google search query is quite advanced. You can exclude PDFs and sitemaps directly in the Google search query using
-filetype:pdf
and-filetype:xml
. I tested this on Google, and it works as expected. -
Regarding proxy settings, it seems I initially misread the logs. Upon revisiting, the logs indicate
NS_ERROR_PROXY_CONNECTION_REFUSED
. However, when I checked, the website itself was not reachable. This appears to be an issue with the website, not the crawler. I apologize for the misleading information. Please disregard this and run the crawler as usual. The crawler uses datacenter proxies by default, which should work fine.
I hope this helps! Jiri
Hi Jiri!
Thank you so much! Your support has been greatly appreciated!
Have a great holiday!
Happy to help. Have a great holidays too!
Actor Metrics
4k monthly users
-
840 stars
>99% runs succeeded
1 days response time
Created in Mar 2023
Modified 21 hours ago