Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler

Developed by

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1.1k

Monthly users

6k

Runs succeeded

>99%

Response time

2.3 days

Last modified

7 days ago

NN

Access to failed urls

Closed
nandanglobal opened this issue
a month ago

I'd be good to have access to failed urls. We are scraping a few thousand urls and all the dataset data only exposes the successful scraped urls. It'd be good to have access to failed urls. Once exposed, we could use integration to get those in a table/sheet and retry them exclusively.

jakub.kopecky avatar

Hi, thank you for using Website Content Crawler.

There is actually an Actor for this specific task of resurrecting the failed requests Rebirth failed requests.

Please try this Actor and let me know if it works for you.

Jakub

VN

vnandan

a month ago

I'd like this data to be available in the dataset so it can be accessed directly and through integrations.

jakub.kopecky avatar

Hi,

After you run the Rebirth Failed Requests Actor, the requests that were resurrected are available from the run's request queue (see Storage -> Request queue). You can access them via an API; please see https://docs.apify.com/platform/storage/request-queue.

Jakub

CL

Callen

a month ago

Hi Jakub,

Thank you for your email.

Please contact Vlad who is copied on this email with further information.

Thank you.

Cal

VN

vnandan

a month ago

That doesn’t solve my problem, as sometimes, we have just 1-2 failed urls and sometimes, we have a few dozen. We want to compile the failed urls together and run them in a single go to make tracking easier and automate them as well. This is only possible if we have access to failed URLs from the original run

jakub.kopecky avatar

Hi,

Using the Rebirth Failed Requests Actor is currently the only way to get a list of or retry failed requests.

When you run the Rebirth Failed Requests Actor, you supply the original Website Content Crawler runID (from the run that contains failed requests). After the Rebirth Failed Requests Actor finishes, you can check the original Website Content Crawler run request queue, which will contain a list of the failed URLs. You can retrieve them via an API, as I mentioned, and use them as input for a new Website Content Crawler run. Alternatively, you can resurrect the original run, and the crawler will crawl only the failed requests in the request queue.

Jakub

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.