Pricing

from $0.70 / 1,000 page scrapeds

Website Content Crawler

Deep crawl websites and extract clean text, Markdown, or HTML for LLMs, RAG, and AI apps. Removes navigation, ads, cookie banners. Supports headless browser & HTTP. Sitemap discovery, URL scoping, file downloads. Feed ChatGPT, LangChain, LlamaIndex, Pinecone. The cheapest content crawler on Apify.

Pricing

from $0.70 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

kata Kuri

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Key Features

Multiple crawler engines - Adaptive mode tries headless Firefox first and automatically falls back to HTTP if the site blocks browsers. Or choose a specific engine manually.
Clean content extraction - Automatically removes navigation, headers, footers, cookie banners, ads, modals, and other irrelevant page elements.
Flexible output formats - Save content as Markdown, plain text, or HTML.
Smart URL scoping - Stays within the start URL path. Supports include/exclude glob patterns for fine-grained control.
Sitemap discovery - Automatically finds and parses sitemaps to discover more pages.
Canonical URL deduplication - Skips duplicate pages identified by the same canonical URL.
Dynamic content support - Wait for JavaScript rendering, scroll to trigger lazy loading, expand accordions and tabs.
Cookie banner dismissal - Automatically detects and dismisses cookie consent popups.
File downloads - Optionally download linked PDF, DOC, DOCX, XLS, XLSX, and CSV files.
Rich metadata extraction - Extracts title, description, author, keywords, language, and canonical URL from every page.

Use Cases

Feed LLMs and AI Applications

Crawl documentation sites, knowledge bases, help centers, or blogs and feed the extracted content directly into your LLM, ChatGPT, or custom AI assistant.

Retrieval Augmented Generation (RAG)

Build a knowledge base from any website. Use the crawled content with vector databases like Pinecone, Qdrant, or Weaviate to power RAG-based question answering.

Custom GPTs and AI Assistants

Export crawled data as JSON and upload it as knowledge files to your custom OpenAI GPTs or AI assistants.

Content Processing at Scale

Scrape content for summarization, translation, proofreading, or style transformation using LLMs.

LangChain and LlamaIndex Integration

Use the Apify integration with LangChain or LlamaIndex to feed crawled content directly into your AI pipeline.

How It Works

The crawler operates in three stages:

Crawling - Discovers and downloads web pages starting from your URLs, following links within scope.
HTML Processing - Cleans the DOM by removing navigation, ads, cookie warnings, and other noise.
Output - Converts the cleaned HTML to your chosen format (Markdown, text, or HTML) with metadata.

Input Configuration

The only required input is Start URLs. All other settings have sensible defaults.

Setting	Description	Default
Start URLs	URLs to begin crawling from	(required)
Crawler type	Engine: Adaptive, Firefox browser, or Cheerio HTTP	Adaptive
Max pages	Maximum number of pages to crawl	100
Max crawling depth	How deep to follow links from start URLs	20
Output format	Markdown, plain text, or HTML	Markdown
Exclude URLs (globs)	Glob patterns for URLs to skip	(none)
Include URLs (globs)	Only crawl URLs matching these globs	(none)
Remove elements (CSS)	Additional CSS selectors to remove	(none, defaults always applied)
Extract elements (CSS)	Only keep content from these elements	(none)
Remove cookie warnings	Auto-dismiss cookie consent banners	Yes
Wait for dynamic content	Time to wait for JS rendering (ms)	1000
Scroll height	Scroll to trigger lazy loading (px)	0
Expand clickables	Click accordions/tabs to expand	No
Save files	Download linked PDF/DOC/XLS files	No
Use sitemaps	Discover URLs from sitemaps	Yes

Output Format

Each crawled page produces a JSON object:

{
    "url": "https://example.com/docs/getting-started",
    "crawl": {
        "loadedUrl": "https://example.com/docs/getting-started",
        "loadedTime": "2024-01-15T10:30:00.000Z",
        "depth": 1
    },
    "metadata": {
        "canonicalUrl": "https://example.com/docs/getting-started",
        "title": "Getting Started | Example Docs",
        "description": "Learn how to get started with Example.",
        "author": "Example Team",
        "keywords": "docs, getting started",
        "languageCode": "en"
    },
    "text": null,
    "markdown": "# Getting Started\n\nWelcome to Example...",
    "html": null
}

The content field (text, markdown, or html) is populated based on your chosen output format. The other two fields will be null.

Pricing

Only $0.001 per page ($1.00 per 1,000 pages) via pay-per-event billing.

	This Actor	Official Apify Crawler	Firecrawl-based Actors
Price per page	$0.001	$0.005 - $0.05	$0.004
1,000 pages	$1.00	$5.00 - $50.00	$4.00
10,000 pages	$10.00	$50.00 - $500.00	$40.00

4x cheaper than Firecrawl-based alternatives
Up to 5x cheaper than the official browser crawler
You only pay for pages successfully crawled and saved to the dataset

Apify's free plan includes $5/month in credits, enough to crawl ~5,000 pages for free.

Integration Examples

Python (LangChain)

from langchain_community.utilities import ApifyWrapper

apify = ApifyWrapper()
loader = apify.call_actor(
    actor_id="worshipful_knife/website-content-crawler",
    run_input={
        "startUrls": [{"url": "https://docs.example.com/"}],
        "maxCrawlPages": 50
    },
    dataset_mapping_function=lambda item: Document(
        page_content=item["markdown"] or item["text"] or "",
        metadata={"source": item["url"]}
    ),
)

Node.js (Apify Client)

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('worshipful_knife/website-content-crawler').call({
    startUrls: [{ url: 'https://docs.example.com/' }],
    maxCrawlPages: 50,
    outputFormat: 'markdown',
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Troubleshooting

Missing content? Try switching to headless browser crawler type, which renders JavaScript.
Too much noise in output? Use the "Remove HTML elements" or "Extract HTML elements" CSS selectors to fine-tune.
Crawler too slow? Increase "Max concurrency" or switch to Cheerio crawler for static sites.
Getting blocked? Use the headless browser crawler type with residential proxies.

Support

If you have any questions or feedback, please open an issue on the Actor's GitHub page or contact us through Apify support.

Website Content Crawler

mikolabs/website-content-crawler

Deep-crawl websites to extract clean text, Markdown, or HTML for AI/LLM apps, RAG pipelines, and vector databases. Supports adaptive crawling, HTML cleaning, file downloads, and structured dataset output. Easily integrates with LangChain, LlamaIndex, and other LLM tools.

mikolabs

5.0

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

halam

537

2.1

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

126K

4.7

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

Rag Web Browser

opalescent_quintet/rag-web-browser

A specialized crawler designed exclusively to feed LLMs. It visits a website and extracts core content into clean, token-optimized Markdown, stripping all "junk" (navs, footers, ads, cookie banners).

Aryan

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Daniel Dimitrov

RAG-Markdown Extractor

hachi-dev/rag-markdown-extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.

JI JUN

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

5.0

Universal Web to Markdown (Bulk & AI-Ready)

lentic_october/web-to-markdown-converter

Bulk convert any website URLs to clean Markdown for AI & LLMs. Universal scraper that removes ads, scripts, and clutter. Optimized for RAG, ChatGPT, Claude, and LangChain. Fast, async, and API-ready.

kalthireddy Abhishek

ai-news-scraper-pro

web-architect/ai-news-scraper-pro

Extract clean text from any news site or blog. Removes ads, navigation, and HTML. Returns structured JSON ready for AI training, ChatGPT, RAG pipelines. Fast & Free proxy supported.