Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoTrying to scrape https://myip.ms/browse/sites/1/rank/100000/rankii/500000/own/376714.
Set start url as: https://myip.ms Glob: https://myip.ms/browse/sites/*/rank/100000/rankii/500000/own/376714
The * is for pagination. But the crawl finishes without even touching the glob.
2024-12-09T16:36:12.591Z INFO PlaywrightCrawler: Starting the crawler. 2024-12-09T16:36:30.797Z INFO No links found on https://myip.ms/. 2024-12-09T16:36:30.952Z INFO PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
Follow up: I think it might be due to recaptcha being triggered. All the sub-pages are returning a screenshot and content of a "Are you human?" verification page.
Hi, thank you for using Website Content Crawler.
The problem is that the Actor doesn’t find any links on the visited pages that match the defined glob https://myip.ms/browse/sites/**/rank/100000/rankii/500000/own/376714
. The starting URL https://myip.ms
does not conform to this pattern, and since you have includeUrlGlobs
set, the crawler ignores any links not fitting that glob. As a result, it discovers no new pages to crawl, makes no additional requests, and scrapes nothing. If you want the Actor to go through other links, you’ll need to adjust or remove the includeUrlGlobs
, or use a URL that matches the pattern right from the start.
I’ll close this issue for now, but feel free to reply here or open a new issue if you have further questions.
Just to add to this:
To scrape the website, I think it would be better to use Apify’s Web Scraper. However, you’ll need to implement the parsing logic and handle pagination. For guidance, check out the relevant course in the Academy.
Thanks! Yeah no matter what I try the Web Content Crawler can't find the links on the pages. I'll try the Web Scraper instead.
Actor Metrics
3.9k monthly users
-
718 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 15 hours ago