Pricing

Pay per usage

Website to Markdown Crawler â€” AI/RAG Data Pipeline

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ricardo Akiyoshi

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Website to Markdown Crawler — AI/RAG Data Pipeline

Crawl any website and convert every page to clean, structured Markdown. Built specifically for AI/RAG pipelines, LLM training data preparation, vector database ingestion, and knowledge base building.

Perfect for: LangChain, LlamaIndex, OpenAI embeddings, Pinecone, Weaviate, Chroma, Qdrant, and any RAG stack.

Why This Crawler?

Most web scrapers give you raw HTML or poorly formatted text. This crawler produces publication-quality Markdown that LLMs can understand directly — with proper headings, lists, code blocks, tables, and links preserved.

What Makes It Different

Smart content extraction — Automatically finds the main content and strips navigation, ads, cookie banners, popups, sidebars, and other boilerplate
High-fidelity Markdown — Proper heading hierarchy, nested lists, code blocks with language detection, Markdown tables, blockquotes, and inline formatting
RAG-ready chunking — Split content into overlapping chunks at paragraph/sentence boundaries (not mid-word) for optimal embedding quality
Rich metadata — Title, description, author, published date, Open Graph tags, JSON-LD, word count, and estimated reading time
Sitemap support — Discover all pages via sitemap.xml for complete site coverage
URL filtering — Include/exclude pages with regex patterns

Use Cases

Documentation Crawling — Convert your docs site to Markdown for RAG-powered Q&A bots
Research Compilation — Crawl multiple sources and compile structured research data
AI Training Data — Build clean text corpora for fine-tuning language models
Knowledge Base Building — Ingest website content into vector databases for semantic search
Content Migration — Convert HTML websites to Markdown for static site generators
Competitive Analysis — Extract and structure competitor content for analysis

Input Parameters

Parameter	Type	Default	Description
`startUrls`	string[]	required	URLs to start crawling from
`maxPages`	integer	50	Maximum pages to crawl (1-10,000)
`maxDepth`	integer	3	Maximum link depth (0-10)
`includeBody`	boolean	true	Include page body content
`includeMetadata`	boolean	true	Include page metadata
`removeNavigation`	boolean	true	Strip nav, header, footer, sidebar
`removeAds`	boolean	true	Strip ads, popups, cookie banners
`chunkSize`	integer	0	Split into chunks of N characters (0 = off)
`chunkOverlap`	integer	200	Character overlap between chunks
`outputFormat`	enum	"markdown"	Output: "markdown", "text", or "html"
`sitemapUrl`	string	—	Sitemap URL for URL discovery
`urlPattern`	string	—	Regex: only crawl matching URLs
`excludePattern`	string	—	Regex: skip matching URLs
`maxRequestsPerMinute`	integer	30	Rate limit
`proxyConfiguration`	object	—	Apify proxy settings

Output Format

Each crawled page produces one dataset item:

{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started — Example Docs",
  "markdown": "# Getting Started\n\nWelcome to Example...",
  "text": "Getting Started Welcome to Example...",
  "metadata": {
    "title": "Getting Started — Example Docs",
    "description": "Learn how to get started with Example",
    "author": "Example Team",
    "publishedDate": "2025-01-15T00:00:00Z",
    "modifiedDate": "2026-02-01T00:00:00Z",
    "canonicalUrl": "https://docs.example.com/getting-started",
    "language": "en",
    "ogImage": "https://docs.example.com/og-image.png",
    "ogType": "article",
    "ogSiteName": "Example Docs",
    "jsonLd": { "@type": "Article", "..." : "..." },
    "wordCount": 1234,
    "readingTimeMinutes": 5,
    "keywords": "getting started, tutorial, example",
    "robots": "index, follow"
  },
  "wordCount": 1234,
  "chunks": [
    {
      "text": "# Getting Started\n\nWelcome to Example...",
      "chunkIndex": 0,
      "totalChunks": 3
    },
    {
      "text": "...continued content with overlap...",
      "chunkIndex": 1,
      "totalChunks": 3
    }
  ],
  "links": [
    "https://docs.example.com/installation",
    "https://docs.example.com/api-reference"
  ],
  "depth": 1,
  "scrapedAt": "2026-03-01T12:00:00.000Z"
}

Chunking for RAG / Embeddings

When chunkSize is set, content is split into overlapping chunks for direct ingestion into vector databases. The chunker splits on natural boundaries:

Paragraph breaks (double newline) — preferred
Line breaks (single newline) — fallback
Sentence boundaries (. ! ?) — next fallback
Word boundaries (spaces) — last resort

Recommended chunk sizes by embedding model:

Model	Recommended `chunkSize`	`chunkOverlap`
OpenAI text-embedding-3-small	1000–1500	200
OpenAI text-embedding-3-large	1500–2000	200
Cohere embed-v3	1000–1500	150
BGE / E5 models	500–1000	100
Sentence Transformers	500–800	100

Integration Examples

LangChain (Python)

from apify_client import ApifyClient
from langchain_community.document_loaders import ApifyDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("your-username/website-to-markdown").call(run_input={
    "startUrls": ["https://docs.example.com"],
    "maxPages": 100,
    "chunkSize": 1000,
    "chunkOverlap": 200,
})

# Load directly from dataset
dataset = client.dataset(run["defaultDatasetId"])
items = dataset.list_items().items

# Each item has .markdown, .chunks, .metadata ready for your pipeline
for item in items:
    for chunk in item.get("chunks", []):
        # chunk["text"] is ready for embedding
        pass

LlamaIndex (Python)

from apify_client import ApifyClient
from llama_index.core import Document

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("your-username/website-to-markdown").call(run_input={
    "startUrls": ["https://docs.example.com"],
    "maxPages": 50,
    "outputFormat": "markdown",
})

dataset = client.dataset(run["defaultDatasetId"])
items = dataset.list_items().items

documents = [
    Document(
        text=item["markdown"],
        metadata={
            "url": item["url"],
            "title": item["title"],
            **item.get("metadata", {}),
        }
    )
    for item in items
]

Direct API Call

curl -X POST "https://api.apify.com/v2/acts/your-username~website-to-markdown/runs" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "startUrls": ["https://docs.example.com"],
    "maxPages": 100,
    "chunkSize": 1500,
    "outputFormat": "markdown"
  }'

Pricing

This actor uses Pay Per Event pricing:

Event	Price
Page crawled	$0.003

Example costs:

50-page docs site: $0.15
200-page blog: $0.60
1,000-page wiki: $3.00

Cost comparison with alternatives:

Apify Website Content Crawler: ~$0.005/page
Diffbot: $0.01-0.05/page
Custom scraping infrastructure: $50-200/month fixed
This actor: $0.003/page — the most affordable option

Content Extraction Quality

The crawler uses a multi-layer approach for reliable content extraction:

Boilerplate removal — 60+ CSS selectors for navigation, ads, cookie banners, popups, social widgets, newsletter signups, comments, and related posts
Main content detection — Tries semantic selectors (article, main, [role=main]) first, then falls back to text density scoring that considers paragraph count, heading count, link density, and content-related class names
Semantic Markdown conversion — Recursive DOM traversal that preserves document structure: headings, lists (nested), code blocks (with language detection from 40+ languages), tables, blockquotes, links, images, figures with captions, definition lists, and inline formatting

Tips

Start small — Test with maxPages: 5 to verify output quality before running large crawls
Use sitemap — For complete site coverage, provide sitemapUrl to discover all pages
Filter URLs — Use urlPattern to focus on specific sections (e.g., /docs/ or /blog/)
Exclude patterns — Skip binary files with excludePattern: \\.(pdf|zip|png|jpg|gif)$
Adjust rate limit — Lower maxRequestsPerMinute for smaller sites to be polite
Enable proxy — Use Apify proxy for sites with anti-bot protection

Web Scraper — General-purpose web scraper
Google Search Scraper — Find URLs to crawl via Google
SEO Analyzer — Analyze website SEO metrics
Contact Email Finder — Extract emails from crawled sites

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

Kai Agent

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

Juan Triviño

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

5.0

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Website to Markdown and RAG Dataset Crawler

orbiscribe/website-rag-dataset-builder

Crawl public websites into clean Markdown, text, metadata, links, JSON-LD, and chunks for RAG and knowledge bases.

Orbiscribe Labs

AI Website Content Extractor

scrapeai/ai-website-content-extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

ScrapeAI

5.0

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!