Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
NU

Crawling logic

Open

nauticallygreat opened this issue
a day ago

The doc for this crawler says:

For example, if you enter the start URL https://example.com/blog/, the actor will crawl pages like https://example.com/blog/article-1 or https://example.com/blog/section/article-2, but will skip pages like https://example.com/docs/something-else.

I don't really understand why this is the logic. How am I supposed to crawl all example.com links that are linked from https://example.com/blog/? Using an includeUrlGlob also doesn't work because I only want pages that are linked from https://example.com/blog/, and the doc says that using an include glob "will disable the default Start URLs based scoping" so I don't understand how I'm supposed to use this crawler for my use case.

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 635 stars
  • 100.0% runs succeeded
  • 2.7 days response time
  • Created in Mar 2023
  • Modified 7 days ago