Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoA lot of CMS Webpages have a marker that defines the actual content of the page (excluding navigation and stuff like that). For instance as a
orA feature where you can explicitly include certain HTML Query paths instead of excluding would be nice in the HTML processing settings.
Hello @matthias.amberg and thank you for your interest in this Actor!
I see why you might want something like this, but at the same time, it feels our current extraction setup is robust enough (so there should be no need for this). In other words - our HTML extractors should do this step automatically.
Either way, I'm very curious about this - we'll be happy to hear about your use case for this! Do you have an example of a page where this feature would help you? Thanks!
Actor Metrics
3.9k monthly users
-
711 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 7 hours ago