Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
MT

Can't get globs to work

Closed

matt333 opened this issue
4 days ago

Trying to scrape https://myip.ms/browse/sites/1/rank/100000/rankii/500000/own/376714.

Set start url as: https://myip.ms Glob: https://myip.ms/browse/sites/*/rank/100000/rankii/500000/own/376714

The * is for pagination. But the crawl finishes without even touching the glob.

2024-12-09T16:36:12.591Z INFO PlaywrightCrawler: Starting the crawler. 2024-12-09T16:36:30.797Z INFO No links found on https://myip.ms/. 2024-12-09T16:36:30.952Z INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.

MT

matt333

4 days ago

Follow up: I think it might be due to recaptcha being triggered. All the sub-pages are returning a screenshot and content of a "Are you human?" verification page.

dusan.vystrcil avatar

Hi, thank you for using Website Content Crawler.

The problem is that the Actor doesn’t find any links on the visited pages that match the defined glob https://myip.ms/browse/sites/**/rank/100000/rankii/500000/own/376714. The starting URL https://myip.ms does not conform to this pattern, and since you have includeUrlGlobs set, the crawler ignores any links not fitting that glob. As a result, it discovers no new pages to crawl, makes no additional requests, and scrapes nothing. If you want the Actor to go through other links, you’ll need to adjust or remove the includeUrlGlobs, or use a URL that matches the pattern right from the start.

I’ll close this issue for now, but feel free to reply here or open a new issue if you have further questions.

jiri.spilka avatar

Just to add to this:

To scrape the website, I think it would be better to use Apify’s Web Scraper. However, you’ll need to implement the parsing logic and handle pagination. For guidance, check out the relevant course in the Academy.

MT

matt333

4 days ago

Thanks! Yeah no matter what I try the Web Content Crawler can't find the links on the pages. I'll try the Web Scraper instead.

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 718 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 15 hours ago