
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.5 (39)
Pricing
Pay per usage
1377
Total users
52.4k
Monthly users
7.7k
Runs succeeded
>99%
Issues response
6.8 days
Last modified
2 days ago
Can't get globs to work
Closed
Trying to scrape https://myip.ms/browse/sites/1/rank/100000/rankii/500000/own/376714.
Set start url as: https://myip.ms Glob: https://myip.ms/browse/sites/*/rank/100000/rankii/500000/own/376714
The * is for pagination. But the crawl finishes without even touching the glob.
2024-12-09T16:36:12.591Z INFO PlaywrightCrawler: Starting the crawler. 2024-12-09T16:36:30.797Z INFO No links found on https://myip.ms/. 2024-12-09T16:36:30.952Z INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
matt333
Follow up: I think it might be due to recaptcha being triggered. All the sub-pages are returning a screenshot and content of a "Are you human?" verification page.

Dušan Vystrčil (dusan.vystrcil)
Hi, thank you for using Website Content Crawler.
The problem is that the Actor doesn’t find any links on the visited pages that match the defined glob https://myip.ms/browse/sites/**/rank/100000/rankii/500000/own/376714
. The starting URL https://myip.ms
does not conform to this pattern, and since you have includeUrlGlobs
set, the crawler ignores any links not fitting that glob. As a result, it discovers no new pages to crawl, makes no additional requests, and scrapes nothing. If you want the Actor to go through other links, you’ll need to adjust or remove the includeUrlGlobs
, or use a URL that matches the pattern right from the start.
I’ll close this issue for now, but feel free to reply here or open a new issue if you have further questions.

Just to add to this:
To scrape the website, I think it would be better to use Apify’s Web Scraper. However, you’ll need to implement the parsing logic and handle pagination. For guidance, check out the relevant course in the Academy.
matt333
Thanks! Yeah no matter what I try the Web Content Crawler can't find the links on the pages. I'll try the Web Scraper instead.