Website to Markdown Crawler — AI/RAG Data Pipeline avatar

Website to Markdown Crawler — AI/RAG Data Pipeline

Pricing

Pay per usage

Go to Apify Store
Website to Markdown Crawler — AI/RAG Data Pipeline

Website to Markdown Crawler — AI/RAG Data Pipeline

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ricardo Akiyoshi

Ricardo Akiyoshi

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 hours ago

Last modified

Categories

Share

Website to Markdown Crawler — AI/RAG Data Pipeline

Crawl any website and convert every page to clean, structured Markdown. Built specifically for AI/RAG pipelines, LLM training data preparation, vector database ingestion, and knowledge base building.

Perfect for: LangChain, LlamaIndex, OpenAI embeddings, Pinecone, Weaviate, Chroma, Qdrant, and any RAG stack.

Why This Crawler?

Most web scrapers give you raw HTML or poorly formatted text. This crawler produces publication-quality Markdown that LLMs can understand directly — with proper headings, lists, code blocks, tables, and links preserved.

What Makes It Different

  • Smart content extraction — Automatically finds the main content and strips navigation, ads, cookie banners, popups, sidebars, and other boilerplate
  • High-fidelity Markdown — Proper heading hierarchy, nested lists, code blocks with language detection, Markdown tables, blockquotes, and inline formatting
  • RAG-ready chunking — Split content into overlapping chunks at paragraph/sentence boundaries (not mid-word) for optimal embedding quality
  • Rich metadata — Title, description, author, published date, Open Graph tags, JSON-LD, word count, and estimated reading time
  • Sitemap support — Discover all pages via sitemap.xml for complete site coverage
  • URL filtering — Include/exclude pages with regex patterns

Use Cases

  1. Documentation Crawling — Convert your docs site to Markdown for RAG-powered Q&A bots
  2. Research Compilation — Crawl multiple sources and compile structured research data
  3. AI Training Data — Build clean text corpora for fine-tuning language models
  4. Knowledge Base Building — Ingest website content into vector databases for semantic search
  5. Content Migration — Convert HTML websites to Markdown for static site generators
  6. Competitive Analysis — Extract and structure competitor content for analysis

Input Parameters

ParameterTypeDefaultDescription
startUrlsstring[]requiredURLs to start crawling from
maxPagesinteger50Maximum pages to crawl (1-10,000)
maxDepthinteger3Maximum link depth (0-10)
includeBodybooleantrueInclude page body content
includeMetadatabooleantrueInclude page metadata
removeNavigationbooleantrueStrip nav, header, footer, sidebar
removeAdsbooleantrueStrip ads, popups, cookie banners
chunkSizeinteger0Split into chunks of N characters (0 = off)
chunkOverlapinteger200Character overlap between chunks
outputFormatenum"markdown"Output: "markdown", "text", or "html"
sitemapUrlstringSitemap URL for URL discovery
urlPatternstringRegex: only crawl matching URLs
excludePatternstringRegex: skip matching URLs
maxRequestsPerMinuteinteger30Rate limit
proxyConfigurationobjectApify proxy settings

Output Format

Each crawled page produces one dataset item:

{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started — Example Docs",
"markdown": "# Getting Started\n\nWelcome to Example...",
"text": "Getting Started Welcome to Example...",
"metadata": {
"title": "Getting Started — Example Docs",
"description": "Learn how to get started with Example",
"author": "Example Team",
"publishedDate": "2025-01-15T00:00:00Z",
"modifiedDate": "2026-02-01T00:00:00Z",
"canonicalUrl": "https://docs.example.com/getting-started",
"language": "en",
"ogImage": "https://docs.example.com/og-image.png",
"ogType": "article",
"ogSiteName": "Example Docs",
"jsonLd": { "@type": "Article", "..." : "..." },
"wordCount": 1234,
"readingTimeMinutes": 5,
"keywords": "getting started, tutorial, example",
"robots": "index, follow"
},
"wordCount": 1234,
"chunks": [
{
"text": "# Getting Started\n\nWelcome to Example...",
"chunkIndex": 0,
"totalChunks": 3
},
{
"text": "...continued content with overlap...",
"chunkIndex": 1,
"totalChunks": 3
}
],
"links": [
"https://docs.example.com/installation",
"https://docs.example.com/api-reference"
],
"depth": 1,
"scrapedAt": "2026-03-01T12:00:00.000Z"
}

Chunking for RAG / Embeddings

When chunkSize is set, content is split into overlapping chunks for direct ingestion into vector databases. The chunker splits on natural boundaries:

  1. Paragraph breaks (double newline) — preferred
  2. Line breaks (single newline) — fallback
  3. Sentence boundaries (. ! ?) — next fallback
  4. Word boundaries (spaces) — last resort

Recommended chunk sizes by embedding model:

ModelRecommended chunkSizechunkOverlap
OpenAI text-embedding-3-small1000–1500200
OpenAI text-embedding-3-large1500–2000200
Cohere embed-v31000–1500150
BGE / E5 models500–1000100
Sentence Transformers500–800100

Integration Examples

LangChain (Python)

from apify_client import ApifyClient
from langchain_community.document_loaders import ApifyDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("your-username/website-to-markdown").call(run_input={
"startUrls": ["https://docs.example.com"],
"maxPages": 100,
"chunkSize": 1000,
"chunkOverlap": 200,
})
# Load directly from dataset
dataset = client.dataset(run["defaultDatasetId"])
items = dataset.list_items().items
# Each item has .markdown, .chunks, .metadata ready for your pipeline
for item in items:
for chunk in item.get("chunks", []):
# chunk["text"] is ready for embedding
pass

LlamaIndex (Python)

from apify_client import ApifyClient
from llama_index.core import Document
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("your-username/website-to-markdown").call(run_input={
"startUrls": ["https://docs.example.com"],
"maxPages": 50,
"outputFormat": "markdown",
})
dataset = client.dataset(run["defaultDatasetId"])
items = dataset.list_items().items
documents = [
Document(
text=item["markdown"],
metadata={
"url": item["url"],
"title": item["title"],
**item.get("metadata", {}),
}
)
for item in items
]

Direct API Call

curl -X POST "https://api.apify.com/v2/acts/your-username~website-to-markdown/runs" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"startUrls": ["https://docs.example.com"],
"maxPages": 100,
"chunkSize": 1500,
"outputFormat": "markdown"
}'

Pricing

This actor uses Pay Per Event pricing:

EventPrice
Page crawled$0.003

Example costs:

  • 50-page docs site: $0.15
  • 200-page blog: $0.60
  • 1,000-page wiki: $3.00

Cost comparison with alternatives:

  • Apify Website Content Crawler: ~$0.005/page
  • Diffbot: $0.01-0.05/page
  • Custom scraping infrastructure: $50-200/month fixed
  • This actor: $0.003/page — the most affordable option

Content Extraction Quality

The crawler uses a multi-layer approach for reliable content extraction:

  1. Boilerplate removal — 60+ CSS selectors for navigation, ads, cookie banners, popups, social widgets, newsletter signups, comments, and related posts
  2. Main content detection — Tries semantic selectors (article, main, [role=main]) first, then falls back to text density scoring that considers paragraph count, heading count, link density, and content-related class names
  3. Semantic Markdown conversion — Recursive DOM traversal that preserves document structure: headings, lists (nested), code blocks (with language detection from 40+ languages), tables, blockquotes, links, images, figures with captions, definition lists, and inline formatting

Tips

  • Start small — Test with maxPages: 5 to verify output quality before running large crawls
  • Use sitemap — For complete site coverage, provide sitemapUrl to discover all pages
  • Filter URLs — Use urlPattern to focus on specific sections (e.g., /docs/ or /blog/)
  • Exclude patterns — Skip binary files with excludePattern: \\.(pdf|zip|png|jpg|gif)$
  • Adjust rate limit — Lower maxRequestsPerMinute for smaller sites to be polite
  • Enable proxy — Use Apify proxy for sites with anti-bot protection