Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demobills and loads but doesnt produce
Hi, thank you for using the Website Content Crawler.
When I checked your run log, I noticed that for each startURL
, the crawler fetched sitemaps. This process took some time because there were around 100 startURLs.
Now, depending on your use case:
-
If you want to crawl every domain and retrieve text content from all pages within each domain, keep the sitemap usage enabled. However, please note that this can take a while, as some domains (e.g., saf***ays.com) are large and contain hundreds of pages. For such cases, I recommend crawling each domain separately to maintain better control.
-
If you only need to scrape the main page without crawling, disable the
Consider URLs from sitemaps
option and setmaxCrawlDepth = 0
. These settings will limit the crawler to scraping only the pages specified in the startURLs.
I hope this helps! I’ll close this issue for now, but feel free to ask further questions or raise another issue.
Best regards, Jiri
Actor Metrics
3.9k monthly users
-
718 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 15 hours ago