Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1310

Total users

49.4k

Monthly users

6.9k

Runs succeeded

>99%

Issue response

3.8 days

Last modified

7 days ago

Developer tools

Back to issues Create new issue

Access to failed urls

Closed

nandanglobal opened this issue

I'd be good to have access to failed urls. We are scraping a few thousand urls and all the dataset data only exposes the successful scraped urls. It'd be good to have access to failed urls. Once exposed, we could use integration to get those in a table/sheet and retry them exclusively.

Jakub Kopecký (jakub.kopecky)

Hi, thank you for using Website Content Crawler.

There is actually an Actor for this specific task of resurrecting the failed requests Rebirth failed requests.

Please try this Actor and let me know if it works for you.

Jakub

vnandan

I'd like this data to be available in the dataset so it can be accessed directly and through integrations.

Jakub Kopecký (jakub.kopecky)

Hi,

After you run the Rebirth Failed Requests Actor, the requests that were resurrected are available from the run's request queue (see Storage -> Request queue). You can access them via an API; please see https://docs.apify.com/platform/storage/request-queue.

Jakub

Callen

Hi Jakub,

Thank you for your email.

Please contact Vlad who is copied on this email with further information.

Thank you.

Cal

vnandan

That doesn’t solve my problem, as sometimes, we have just 1-2 failed urls and sometimes, we have a few dozen. We want to compile the failed urls together and run them in a single go to make tracking easier and automate them as well. This is only possible if we have access to failed URLs from the original run

Jakub Kopecký (jakub.kopecky)

Hi,

Using the Rebirth Failed Requests Actor is currently the only way to get a list of or retry failed requests.

When you run the Rebirth Failed Requests Actor, you supply the original Website Content Crawler runID (from the run that contains failed requests). After the Rebirth Failed Requests Actor finishes, you can check the original Website Content Crawler run request queue, which will contain a list of the failed URLs. You can retrieve them via an API, as I mentioned, and use them as input for a new Website Content Crawler run. Alternatively, you can resurrect the original run, and the crawler will crawl only the failed requests in the request queue.

Jakub

Add comment

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

527

3.8

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

5.0

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

5.0

Ai Web Scraper - Extract Data With Ease

eloquent_mountain/ai-web-scraper-extract-data-with-ease

Ai Web Scraper enables scraping for everyone, including non-techies! It uses Google's Gemini LLM to scrape websites with natural language commands. It dynamically extracts data, no selector input needed, handles dynamic content and cookie consent, avoids bot detection, outputs JSON or other formats.

Paco

304

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

697

4.2

RAG Web Browser

apify/rag-web-browser

Web browser for OpenAI Assistants, RAG pipelines, or AI agents, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages, and returns their content as Markdown for further processing by an LLM. It can also scrape individual URLs. Supports Model Context Protocol (MCP).

Apify

2.3k

4.4

AI Web Scraper - Powered by Crawl4AI

raizen/ai-web-scraper

A blazing-fast AI web scraper powered by Crawl4AI. Perfect for LLMs, AI agents, AI automation, model training, sentiment analysis, and content generation. Supports deep crawling, multiple extraction strategies and flexible output (Markdown/JSON). Seamlessly integrates with Make.com, n8n, and Zapier.

Raizen Technology

1.0

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

82.6k

4.5

Extract-any-webpage-content-for-llm

ai-developer/extract-any-webpage-content-for-llm

Fast and easy way to extract data from any webpage and are LLM friendly. The tool lets you easily extract content from any website. Ideal for researchers, marketers, and developers.