Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
TF

doesnt work

Closed

topaz_frog opened this issue
16 days ago

bills and loads but doesnt produce

jiri.spilka avatar

Hi, thank you for using the Website Content Crawler.

When I checked your run log, I noticed that for each startURL, the crawler fetched sitemaps. This process took some time because there were around 100 startURLs.

Now, depending on your use case:

  • If you want to crawl every domain and retrieve text content from all pages within each domain, keep the sitemap usage enabled. However, please note that this can take a while, as some domains (e.g., saf***ays.com) are large and contain hundreds of pages. For such cases, I recommend crawling each domain separately to maintain better control.

  • If you only need to scrape the main page without crawling, disable the Consider URLs from sitemaps option and set maxCrawlDepth = 0. These settings will limit the crawler to scraping only the pages specified in the startURLs.

I hope this helps! I’ll close this issue for now, but feel free to ask further questions or raise another issue.

Best regards, Jiri

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 718 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 15 hours ago