
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.6 (38)
Pricing
Pay per usage
1.1k
Monthly users
6k
Runs succeeded
>99%
Response time
2.3 days
Last modified
7 days ago
Access to failed urls
I'd be good to have access to failed urls. We are scraping a few thousand urls and all the dataset data only exposes the successful scraped urls. It'd be good to have access to failed urls. Once exposed, we could use integration to get those in a table/sheet and retry them exclusively.

Hi, thank you for using Website Content Crawler.
There is actually an Actor for this specific task of resurrecting the failed requests Rebirth failed requests.
Please try this Actor and let me know if it works for you.
Jakub
vnandan
I'd like this data to be available in the dataset so it can be accessed directly and through integrations.

Hi,
After you run the Rebirth Failed Requests Actor, the requests that were resurrected are available from the run's request queue (see Storage -> Request queue). You can access them via an API; please see https://docs.apify.com/platform/storage/request-queue.
Jakub
Callen
Hi Jakub,
Thank you for your email.
Please contact Vlad who is copied on this email with further information.
Thank you.
Cal
vnandan
That doesn’t solve my problem, as sometimes, we have just 1-2 failed urls and sometimes, we have a few dozen. We want to compile the failed urls together and run them in a single go to make tracking easier and automate them as well. This is only possible if we have access to failed URLs from the original run

Hi,
Using the Rebirth Failed Requests Actor is currently the only way to get a list of or retry failed requests.
When you run the Rebirth Failed Requests Actor, you supply the original Website Content Crawler runID
(from the run that contains failed requests). After the Rebirth Failed Requests Actor finishes, you can check the original Website Content Crawler run request queue, which will contain a list of the failed URLs. You can retrieve them via an API, as I mentioned, and use them as input for a new Website Content Crawler run. Alternatively, you can resurrect the original run, and the crawler will crawl only the failed requests in the request queue.
Jakub
Pricing
Pricing model
Pay per usageThis Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.