Pricing

from $1.00 / 1,000 results

Website Content Crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

What does Website Content Crawler do?

This actor takes one or more URLs and crawls entire websites by following links. It extracts clean, readable content from every page — stripping navigation, scripts, footers, and other non-content elements. The output is optimized for use with large language models (LLMs) and retrieval-augmented generation (RAG) systems.

Features

Deep website crawling — Follows links across pages up to a configurable depth
Multiple output formats — Get content as Markdown, plain text, or cleaned HTML
Smart content extraction — Automatically removes navigation, scripts, footers, cookie banners, and other boilerplate
URL filtering — Include or exclude pages using glob patterns (e.g., https://example.com/blog/**)
Configurable limits — Control max pages, crawl depth, and concurrency
JavaScript rendering — Uses a headless browser (Chromium or Firefox) to handle dynamic websites
Fast HTTP mode — Optional raw HTTP mode for static sites that don't need JavaScript
No login required — Works with publicly accessible pages

Input

Field	Type	Required	Default	Description
`startUrls`	Array of URLs	Yes	—	One or more URLs to begin crawling from
`crawlerType`	String	No	`playwright:chromium`	Crawler engine: `playwright:chromium`, `playwright:firefox`, or `http`
`maxCrawlDepth`	Integer	No	10	Maximum link-following depth (0 = start URLs only)
`maxCrawlPages`	Integer	No	100	Total page limit
`maxConcurrency`	Integer	No	5	Number of pages loaded in parallel
`includeUrlGlobs`	Array of strings	No	[]	Only crawl URLs matching these glob patterns
`excludeUrlGlobs`	Array of strings	No	[]	Skip URLs matching these glob patterns
`outputFormat`	String	No	`markdown`	Output format: `markdown`, `text`, or `html`
`proxyConfiguration`	Object	No	—	Optional proxy settings

Example Input

{
    "startUrls": [
        { "url": "https://docs.apify.com" }
    ],
    "maxCrawlPages": 50,
    "maxCrawlDepth": 3,
    "outputFormat": "markdown"
}

URL Filtering Example

{
    "startUrls": [
        { "url": "https://example.com" }
    ],
    "includeUrlGlobs": ["https://example.com/blog/**"],
    "excludeUrlGlobs": ["**login**", "**signup**"]
}

Output

Each crawled page produces a result with the following fields.

Field	Type	Description
`url`	String	Original URL that was requested
`loadedUrl`	String	Final URL after any redirects
`title`	String	Page title
`description`	String	Meta description of the page
`languageCode`	String	Language code from the HTML lang attribute
`text`	String	Clean plain text extracted from the page
`markdown`	String	Page content converted to Markdown
`html`	String	Cleaned HTML content (when output format is HTML)
`depth`	Integer	Crawl depth (0 = start URL)
`httpStatusCode`	Integer	HTTP response status code
`loadedTime`	String	ISO 8601 timestamp when the page was loaded
`referrerUrl`	String	URL of the page that linked to this one

Example Output

{
    "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "title": "Web scraping for beginners | Apify Documentation",
    "description": "Learn how to build web scrapers from scratch.",
    "languageCode": "en",
    "text": "Web scraping for beginners\n\nThis course teaches you the basics of web scraping...",
    "markdown": "# Web scraping for beginners\n\nThis course teaches you the basics of web scraping...",
    "html": "",
    "depth": 1,
    "httpStatusCode": 200,
    "loadedTime": "2025-01-15T10:30:00.000Z",
    "referrerUrl": "https://docs.apify.com"
}

Use Cases

LLM Training Data — Crawl documentation sites, blogs, or knowledge bases to build training datasets
RAG Pipelines — Extract and index website content for retrieval-augmented generation
Knowledge Base Building — Convert entire websites into structured, searchable content
Content Migration — Export website content as Markdown for migration to a new platform
Competitive Analysis — Extract and compare content across competitor websites
SEO Auditing — Crawl your site to analyze content, titles, and meta descriptions
Documentation Archival — Create offline copies of documentation in clean text format

FAQ

What output format should I use for LLMs?

Markdown is the recommended format for LLM use cases. It preserves document structure (headings, lists, links) while remaining clean and readable. Use text if you need the simplest possible format with no formatting.

How many pages can I crawl?

You can crawl up to 100,000 pages per run. The default limit is 100 pages. Adjust the maxCrawlPages setting based on your needs and available compute.

What's the difference between Chromium and HTTP mode?

Chromium (default) uses a headless browser that renders JavaScript, making it work with modern dynamic websites. HTTP mode fetches raw HTML without running JavaScript — it's much faster but only works with static (server-rendered) pages.

Does the crawler follow links to other domains?

No. The crawler only follows links within the same domain as the start URL. This prevents accidentally crawling the entire internet.

How does URL filtering work?

Use includeUrlGlobs to restrict crawling to specific sections of a site (e.g., https://example.com/docs/**). Use excludeUrlGlobs to skip certain pages (e.g., **login**, **.pdf). Glob patterns use standard wildcard matching where * matches anything within a path segment and ** matches across segments.

Does it handle JavaScript-rendered content?

Yes. The default Chromium mode renders JavaScript before extracting content. This means Single Page Applications (SPAs), React sites, and other dynamic pages are fully supported.

How clean is the extracted content?

The crawler automatically removes navigation menus, headers, footers, scripts, styles, cookie banners, and other non-content elements. The resulting text or markdown contains only the main page content.

Can I crawl password-protected pages?

No. This crawler works with publicly accessible pages only. It does not support login or authentication.

What happens if a page fails to load?

Failed pages are logged and skipped. The crawler continues with the remaining URLs. Check the run log for details on any failures.

Does the crawler respect robots.txt?

The headless browser mode does not explicitly check robots.txt. If you need to respect robots.txt restrictions, review the site's rules before crawling.

What output formats are available for export?

Results can be exported as JSON, CSV, Excel (XLSX), HTML, RSS, or XML directly from the Apify platform.

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

Juan Triviño

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

ParseForge

Website Crawler API — Markdown for RAG

tugelbay/website-content-crawler

Website crawler API that extracts clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

179

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

RAG-Ready Website Crawler — Clean Content for LLMs & Vector DBs

yourwingman/rag-ready-crawler

Crawl websites and output clean, chunked content optimized for RAG pipelines, LLM training data, and vector databases. Built for AI knowledge bases and semantic search.

Wingman

Website to RAG Knowledge Dataset

ghostgrid/website-to-rag-knowledge-dataset

Convert a website sitemap into clean text and markdown rows for RAG, AI search, chatbots, and knowledge base workflows.

GhostGrid

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.