Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoWhy does this job take so long?
Opened 3 days ago by grossjo, last comment 3 days ago by Dušan Vystrčil (dusan.vystrcil)
Strictly Necessary cookies are absolutely necessary for core functions such as navigating the page or accessing secure areas.
Opened 3 days ago by lei.niu.bigco, last comment 3 days ago by Dušan Vystrčil (dusan.vystrcil)
Can't get globs to work
Opened 4 days ago by matt333, last comment 4 days ago by matt333
Scrapping only few elements on the page and save them in the separate fields
Opened 7 days ago by igocza, last comment 6 days ago by igocza
not the whole content is scraped
Opened 7 days ago by igocza, last comment 6 days ago by Jiří Spilka (jiri.spilka)
can't crawl the whole website
Opened 9 days ago by visable, last comment 5 days ago by Jiří Spilka (jiri.spilka)
No urls to crawl, runs forever
Opened 11 days ago by betterbrain-dev, last comment 5 days ago by Jiří Spilka (jiri.spilka)
The subpages are not crawled
Opened 11 days ago by digitrans, last comment 11 days ago by Jiří Spilka (jiri.spilka)
File extension prevent crawler from accessing pages
Opened 12 days ago by russet_backpack, last comment 12 days ago by Jiří Spilka (jiri.spilka)
Actor timeout
Opened 12 days ago by MavenAGI, last comment 5 days ago by Jiří Spilka (jiri.spilka)
Diacritics error
Opened 15 days ago by kata.hrus, last comment 15 days ago by Jiří Spilka (jiri.spilka)
Testing the actor but get this result
Opened 15 days ago by davidsen.anders, last comment 14 days ago by Jiří Spilka (jiri.spilka)
doesnt work
Opened 16 days ago by topaz_frog, last comment 16 days ago by Jiří Spilka (jiri.spilka)
This run seems to be having a big problem
Opened 20 days ago by pure_chipmunk, last comment 19 days ago by Jiří Spilka (jiri.spilka)
Can I limit the max number of pages to crawl for each start URL?
Opened 21 days ago by pai911, last comment 19 days ago by Jiří Spilka (jiri.spilka)
Actor is not parsing the rest of the pages
Opened 21 days ago by Custombizio, last comment 14 days ago by Jiří Spilka (jiri.spilka)
Keep the urls in text or markdown?
Opened 22 days ago by cgoul, last comment 20 days ago by cgoul
How do I setup pagination with a URL
Opened 22 days ago by sacdrexelmba, last comment 18 days ago by optimal_valuation
Crawling only first page
Opened 23 days ago by Atish, last comment 23 days ago by Dušan Vystrčil (dusan.vystrcil)
Combining startUrl and includeUrlGlob
Opened a month ago by nauticallygreat, last comment 25 days ago by Jiří Spilka (jiri.spilka)
Actor Metrics
3.9k monthly users
-
718 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 13 hours ago