Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.0 (40)

Pricing

Pay per usage

1392

Total users

53K

Monthly users

7.9K

Runs succeeded

>99%

Issues response

6.8 days

Last modified

4 days ago

FQ

Getting 403 from public page

Open

formidable_quagmire opened this issue
7 days ago

`Recently when run this for source it was working without any issue. even with resedential proxy the crawler is being blocked.

jindrich.bar avatar

Hello, and thank you for your interest in this Actor!

You're getting blocked by Cloudflare Bot Management — their current settings appear to be quite strict, which is why WCC isn't able to get through anymore, even with residential proxies. Unfortunately, there’s not much that can be done directly in this case with your current setup.

I’d recommend trying our new Camoufox Scraper, which is specifically designed to handle these types of challenges. You can check out my example run here - it seems to have bypassed the bot filter and scraped the full content of the target page. Let me know if you need help setting it up.

I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!

FQ

formidable_quagmire

7 days ago

Yes we will definitely need help to setup couple of things in new scraper. we use lot of features of WCC such as css remover, file downloaded. will this still work ? markdown. if yes i don;t seem to find documentation around that if you can share that that will be great.

FQ

formidable_quagmire

6 days ago

could you build "Camoufox Scraper" inside the WCC ? or atleast similar input and output parity with WCC ?