Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoThe doc for this crawler says:
For example, if you enter the start URL https://example.com/blog/, the actor will crawl pages like https://example.com/blog/article-1 or https://example.com/blog/section/article-2, but will skip pages like https://example.com/docs/something-else.
I don't really understand why this is the logic. How am I supposed to crawl all example.com links that are linked from https://example.com/blog/? Using an includeUrlGlob also doesn't work because I only want pages that are linked from https://example.com/blog/, and the doc says that using an include glob "will disable the default Start URLs based scoping" so I don't understand how I'm supposed to use this crawler for my use case.
- 3.8k monthly users
- 635 stars
- 100.0% runs succeeded
- 2.7 days response time
- Created in Mar 2023
- Modified 7 days ago