Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler

Developed by

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1.1k

Monthly users

6k

Runs succeeded

>99%

Response time

2.3 days

Last modified

7 days ago

OC

Block Detection and Proxy IP Session Rotation

Closed
oren_clearya opened this issue
a month ago

Hey, We've noticed that when we have multiple start-URLs, that first one gets scraped successfully, but the others get blocked.

We use:

  • crawlerType of playwright:adaptive (default).
  • proxyConfiguration of {"useApifyProxy":true} (default).
  • maxSessionRotations of 10 (default).
  • maxConcurrency of 1.

Could it be the maxSessionRotations is not working as expected? Is there a way to force the Actor to rotate IPs prior to scraping each URL?

jakub.kopecky avatar

Hi, thank you for using the Website Content Crawler.

Web Crawler Crew (WCC) is a one-size-fits-all tool that passively bypasses captchas by avoiding triggers - no clicking or solving puzzles. Still, some sites like Walmart need special treatment. Check out our Walmart Scrapers in the Apify Store for those.

Let me know if you need help! Jakub Kopecky

OC

oren_clearya

a month ago

Thanks Jakub, I manage to scrape Walmart just fine with the apify/website-content-crawler Actor. But my point here is that if I have multiple start-URLs - it only succeeds with the first one, and then gets blocked. So it seems as-if the Actor doesn't rotate the Proxy IPs based per the maxSessionRotations property.

Is that the expected behavior?

jakub.kopecky avatar

Hey,

Glad to hear you successfully scraped Walmart. Yes, that’s expected - the crawler doesn’t recognize Walmart’s CAPTCHA response as a block, so it doesn’t rotate the session.

Closing this issue, feel free to reopen if needed.

Jakub

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.