Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoWe are trying to crawl https://www.hudexchange.info/faqs/ but getting blocked. From local testing, setting a user-agent
header appears to solve the problem. This website uses AWS CloudFront, so I suspect they have something like this WAF rule in front: https://docs.aws.amazon.com/waf/latest/developerguide/aws-managed-rule-groups-baseline.html#aws-managed-rule-groups-baseline-crs
How can I set a custom user agent?
Thanks for the help
Hi, thank you for using the Website Content Crawler.
I’ve checked your runs, and not all of them were blocked. When using residential US proxies, the request wasn’t blocked, but the page didn’t load correctly. This issue occurs because the Actor attempts to expand clickable elements, which can sometimes trigger unintended navigation if a button-like element is clicked.
In this case, the Actor clicks the Grantees button in the upper-right corner. You can disable this behavior by setting the "Expand clickable elements" input to a non-existent CSS selector under HTML Processing > Expand clickable elements
input, like so: dont.click.please
.
Please see my run here
1"proxyConfiguration": { 2 "useApifyProxy": true, 3 "apifyProxyGroups": [ 4 "RESIDENTIAL" 5 ], 6 "apifyProxyCountry": "US" 7 },
and
"clickElementsCssSelector": "dont.click.please",
Kudos to @jindrich.bar
.
As for your original question, it isn’t possible to set the user-agent, and we have no plans to add this option in the near future. However, the above solution should work without it.
Please let me know if this resolves the issue or if you have any further questions.
That worked! Thank you so much for the help
I’m glad to hear that! Happy to help! I’ll go ahead and close this issue now. Let us know if you have any other questions.
- 3.8k monthly users
- 636 stars
- 100.0% runs succeeded
- 2.7 days response time
- Created in Mar 2023
- Modified 7 days ago