Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoI was trying to collect the texts but no matter how many times or how many version I tried, every time it is either null result or my access was blocked. I am attaching one of the run result here below.
Hi, thank you for your interest in this Actor.
I tried scraping the URL from your run, but unfortunately, I’m also getting blocked. Given the website you’re targeting, it’s not surprising.
I’ve tested various proxies—datacenter and residential, from different countries, and with different configurations—but unfortunately, I’m blocked every time.
Some sites simply have strong anti-scraping protections that we’re unable to bypass. I apologize for this.
Thanks for addressing the issue.
Is there any other way I can collect the data from the site?
As of now, we can’t bypass the blocking on this particular website. In the long run, we plan to focus more on handling blocking/unblocking for certain sites, but this will take some time.
If you have dev skills, you could try scraping the data yourself, e.g. using the Crawlee library, preferably from your local. However, if you’re not a resident of the country for that particular domain, this may be challenging. I’m sorry I can’t be of more help.
I’ll close this issue for now. Please let me know if you have any other questions.
- 3.8k monthly users
- 636 stars
- 100.0% runs succeeded
- 2.7 days response time
- Created in Mar 2023
- Modified 7 days ago