
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.5 (39)
Pricing
Pay per usage
1341
Total users
50.6k
Monthly users
7.2k
Runs succeeded
>99%
Issue response
5.7 days
Last modified
15 hours ago
How to scrape publicly available data
Closed
This url {https://www.ncbi.nlm.nih.gov/books/NBK430685/} contains about 9330 urls. Each url contain the content I want to scrape. When I try to scrape directly by using the main url {https://www.ncbi.nlm.nih.gov/books/NBK430685/} I receive no result. What's the solution?
Hello and thank you for your interest in this Actor!
By default, Website Content Crawler only crawls pages that are descendants (children) of the start URL. If you set https://a.com/b
as the start URL, the crawler will only follow links like https://a.com/b/123
, https://a.com/b/456/789
, but not https://a.com/xxx/123
.
You can change this behavior by using the Crawler settings > Include URLs (globs)
input option. Check my example run (and feel free to copy the input to your account, if the results match your expectations).
Cheers!
keep in mind that this way, the crawler might try to scrape too many pages (depending on the structure of the website). Make sure to use the Max crawling depth
/ Max pages
/ Max results
limiting options to prevent wasting resources.