Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoWhy there are only 47 requests, when I'm trying to scan the entire website through their sitemap. There should be hundreds.
Any ideas for improvement to scan the entire website's product pages ?
Hi, thank you for using the Website Content Crawler.
I’ve been trying to find a solution for that specific website but haven’t been successful yet. I’ll reach out internally to explore a possible solution.
Hi,
I apologize for the delayed response.
After consulting internally, here’s a working solution provided by @jindrich.bar (credit goes to him).
He used the Cheerio Scraper instead of the Website Content Crawler and was able to scrape the content successfully.
You can check out his example run, which was extremely fast. However, he aborted the run to avoid wasting resources.
Please review the results, and if you find them helpful, copy the input JSON into your Cheerio Scraper. It should work for your case.
Additionally, make sure to use proxies, as the crawler might get blocked otherwise.
We hope this helps! Feel free to reach out if you have any other questions.
Actor Metrics
3.9k monthly users
-
718 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 21 hours ago