Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.0 (41)

Pricing

Pay per usage

1591

Total users

62K

Monthly users

8.2K

Runs succeeded

>99%

Issues response

7.9 days

Last modified

2 hours ago

Developer tools

Back to issues Create new issue

Predict the number of pages before running the actor

Closed

alex_simas opened this issue

Hello,

I'm testing your crawler and its content extraction capacity is really fantastic. However, would it be possible before running the actor to know the number of pages on a website? I searched the documentation but was unsuccessful.

Jindřich Bär (jindrich.bar)

Hello Alex, and thank you for your interest in this Actor!

Indeed, Website Content Crawler cannot do this. This is because WCC (like any other web crawler) only follows the links on web pages, discovering one new web page at a time - and stops only once all the links lead to already discovered pages.

The only idea I have right now is sitemap discovery - by parsing the website's sitemap, you should get a list of all the URLs on the website, and therefore get the count of all the pages on the website. However, Website Content Crawler doesn't do this, as we didn't see the need for it yet.

Could you share your ideas about this feature - most importantly, what's your use case for this? Thanks again!

alex_simas

Hello Jindrich Bär, thanks for the quick response.

I imagined it would be like this. Your question about the use case is in fact the most important and I will make sure that the next time I use this space I make it clear why I have this question.

Use case: Estimate the cost and duration of scrapping considering a given infrastructure resource.

Maybe this should be my question, is there any way I can predict the cost and execution time of scrapping considering 1GB of CPU and 4GB of memory?

I know there are more variables that can impact this forecast, but as I am going to offer the service to the end public, I wanted to have a certain predictability to make more assertive pricing.

Jindřich Bär (jindrich.bar)

Alright, now it makes perfect sense!

The cost and scraping time depend on the following three variables:

The website size (more pages = longer and more pricy crawl)
- this is quite hard to estimate, see above
The amount of RAM available (more RAM = more space for concurrent processes, so faster processing, but you're paying more $ per hour)
The crawler type (Firefox and Chrome take some time to load the page, Cheerio loads the pages almost instantly, but doesn't execute client-side JS)
- you want to go for Cheerio whenever it's possible... but it's not always possible (e.g. on pages with dynamic content loading).

On our testing account, the average price per result is $0.011 (or 12 seconds) and the median is $0.006 (or 3 seconds). The majority of the runs for these statistics are Firefox with 8 GB of RAM - and they are from reasonably long crawls (the Actor takes a few seconds to spin up, so the efficiency of short runs will be worse).

You can always check out the "Monitoring" tab (https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/monitoring), which shows you statistics about your runs - In the Stats per run section, you can find the Duration/Cost per result chart, which shows you the average price for crawling one website (based on your last 200 runs).

Feel free to ask any additional questions. Thanks!

alex_simas

Thank you very much for your answer, Jindřich Bär!

In fact, it was quite enlightening for me, as I have only been using technology for a short time and I confess that I have a technical deficiency to overcome in order to make the best use of the crawler.

The service's documentation is very rich and I'm going to dive deeper into this incredible tool. I signed up to create several website scenarios that I intend to scrape and further explore my use case empirically, vary the settings, and then evaluate the statistics of each scenario.

Add comment

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

5.0

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

1.3K

4.4

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

3.5

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

605

4.3

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

457

4.1

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

89K

4.5

RAG Web Browser

apify/rag-web-browser

Web browser for OpenAI Assistants, RAG pipelines, or AI agents, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages, and returns their content as Markdown for further processing by an LLM. It can also scrape individual URLs. Supports Model Context Protocol (MCP).

Apify

4.3K

4.3

Dynamic Markdown Scraper

louisdeconinck/dynamic-markdown-scraper

Effortlessly feed LLM AIs with clean Markdown using our advanced web scraper. Seamlessly scrape dynamic, JavaScript-rendered websites while preserving original formatting. Ideal for AI training, documentation, and content migration.

Louis Deconinck

5.0

AI Web Scraper - Powered by Crawl4AI

raizen/ai-web-scraper

A blazing-fast AI web scraper powered by Crawl4AI. Perfect for LLMs, AI agents, AI automation, model training, sentiment analysis, and content generation. Supports deep crawling, multiple extraction strategies and flexible output (Markdown/JSON). Seamlessly integrates with Make.com, n8n, and Zapier.