Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.5 (39)

Pricing

Pay per usage

1341

Total users

50.6k

Monthly users

7.2k

Runs succeeded

>99%

Issue response

5.7 days

Last modified

15 hours ago

EI

How to scrape publicly available data

Closed

evergreen_ideal opened this issue
a year ago

This url {https://www.ncbi.nlm.nih.gov/books/NBK430685/} contains about 9330 urls. Each url contain the content I want to scrape. When I try to scrape directly by using the main url {https://www.ncbi.nlm.nih.gov/books/NBK430685/} I receive no result. What's the solution?

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

By default, Website Content Crawler only crawls pages that are descendants (children) of the start URL. If you set https://a.com/b as the start URL, the crawler will only follow links like https://a.com/b/123, https://a.com/b/456/789, but not https://a.com/xxx/123.

You can change this behavior by using the Crawler settings > Include URLs (globs) input option. Check my example run (and feel free to copy the input to your account, if the results match your expectations). Cheers!

keep in mind that this way, the crawler might try to scrape too many pages (depending on the structure of the website). Make sure to use the Max crawling depth / Max pages / Max results limiting options to prevent wasting resources.