Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
NC

Few requests

Open

nimble_caretaker opened this issue
16 days ago

Why there are only 47 requests, when I'm trying to scan the entire website through their sitemap. There should be hundreds.

Any ideas for improvement to scan the entire website's product pages ?

jiri.spilka avatar

Hi, thank you for using the Website Content Crawler.

I’ve been trying to find a solution for that specific website but haven’t been successful yet. I’ll reach out internally to explore a possible solution.

jiri.spilka avatar

Hi,
I apologize for the delayed response.
After consulting internally, here’s a working solution provided by @jindrich.bar (credit goes to him).

He used the Cheerio Scraper instead of the Website Content Crawler and was able to scrape the content successfully.

You can check out his example run, which was extremely fast. However, he aborted the run to avoid wasting resources.

Please review the results, and if you find them helpful, copy the input JSON into your Cheerio Scraper. It should work for your case.

Additionally, make sure to use proxies, as the crawler might get blocked otherwise.

We hope this helps! Feel free to reach out if you have any other questions.

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 718 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 21 hours ago