Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
CR

Custom user agent

Closed

civic-roundtable opened this issue
14 days ago

We are trying to crawl https://www.hudexchange.info/faqs/ but getting blocked. From local testing, setting a user-agent header appears to solve the problem. This website uses AWS CloudFront, so I suspect they have something like this WAF rule in front: https://docs.aws.amazon.com/waf/latest/developerguide/aws-managed-rule-groups-baseline.html#aws-managed-rule-groups-baseline-crs

How can I set a custom user agent?

Thanks for the help

jiri.spilka avatar

Hi, thank you for using the Website Content Crawler.

I’ve checked your runs, and not all of them were blocked. When using residential US proxies, the request wasn’t blocked, but the page didn’t load correctly. This issue occurs because the Actor attempts to expand clickable elements, which can sometimes trigger unintended navigation if a button-like element is clicked.

In this case, the Actor clicks the Grantees button in the upper-right corner. You can disable this behavior by setting the "Expand clickable elements" input to a non-existent CSS selector under HTML Processing > Expand clickable elements input, like so: dont.click.please.

Please see my run here

1"proxyConfiguration": {
2    "useApifyProxy": true,
3    "apifyProxyGroups": [
4      "RESIDENTIAL"
5    ],
6    "apifyProxyCountry": "US"
7  },

and

"clickElementsCssSelector": "dont.click.please",

Kudos to @jindrich.bar.

As for your original question, it isn’t possible to set the user-agent, and we have no plans to add this option in the near future. However, the above solution should work without it.

Please let me know if this resolves the issue or if you have any further questions.

LD

alden

12 days ago

That worked! Thank you so much for the help

jiri.spilka avatar

I’m glad to hear that! Happy to help! I’ll go ahead and close this issue now. Let us know if you have any other questions.

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 636 stars
  • 100.0% runs succeeded
  • 2.7 days response time
  • Created in Mar 2023
  • Modified 7 days ago