Website to Markdown Crawler — AI/RAG Data Pipeline
Pricing
Pay per usage
Website to Markdown Crawler — AI/RAG Data Pipeline
Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Ricardo Akiyoshi
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 hours ago
Last modified
Categories
Share
Website to Markdown Crawler — AI/RAG Data Pipeline
Crawl any website and convert every page to clean, structured Markdown. Built specifically for AI/RAG pipelines, LLM training data preparation, vector database ingestion, and knowledge base building.
Perfect for: LangChain, LlamaIndex, OpenAI embeddings, Pinecone, Weaviate, Chroma, Qdrant, and any RAG stack.
Why This Crawler?
Most web scrapers give you raw HTML or poorly formatted text. This crawler produces publication-quality Markdown that LLMs can understand directly — with proper headings, lists, code blocks, tables, and links preserved.
What Makes It Different
- Smart content extraction — Automatically finds the main content and strips navigation, ads, cookie banners, popups, sidebars, and other boilerplate
- High-fidelity Markdown — Proper heading hierarchy, nested lists, code blocks with language detection, Markdown tables, blockquotes, and inline formatting
- RAG-ready chunking — Split content into overlapping chunks at paragraph/sentence boundaries (not mid-word) for optimal embedding quality
- Rich metadata — Title, description, author, published date, Open Graph tags, JSON-LD, word count, and estimated reading time
- Sitemap support — Discover all pages via sitemap.xml for complete site coverage
- URL filtering — Include/exclude pages with regex patterns
Use Cases
- Documentation Crawling — Convert your docs site to Markdown for RAG-powered Q&A bots
- Research Compilation — Crawl multiple sources and compile structured research data
- AI Training Data — Build clean text corpora for fine-tuning language models
- Knowledge Base Building — Ingest website content into vector databases for semantic search
- Content Migration — Convert HTML websites to Markdown for static site generators
- Competitive Analysis — Extract and structure competitor content for analysis
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | string[] | required | URLs to start crawling from |
maxPages | integer | 50 | Maximum pages to crawl (1-10,000) |
maxDepth | integer | 3 | Maximum link depth (0-10) |
includeBody | boolean | true | Include page body content |
includeMetadata | boolean | true | Include page metadata |
removeNavigation | boolean | true | Strip nav, header, footer, sidebar |
removeAds | boolean | true | Strip ads, popups, cookie banners |
chunkSize | integer | 0 | Split into chunks of N characters (0 = off) |
chunkOverlap | integer | 200 | Character overlap between chunks |
outputFormat | enum | "markdown" | Output: "markdown", "text", or "html" |
sitemapUrl | string | — | Sitemap URL for URL discovery |
urlPattern | string | — | Regex: only crawl matching URLs |
excludePattern | string | — | Regex: skip matching URLs |
maxRequestsPerMinute | integer | 30 | Rate limit |
proxyConfiguration | object | — | Apify proxy settings |
Output Format
Each crawled page produces one dataset item:
{"url": "https://docs.example.com/getting-started","title": "Getting Started — Example Docs","markdown": "# Getting Started\n\nWelcome to Example...","text": "Getting Started Welcome to Example...","metadata": {"title": "Getting Started — Example Docs","description": "Learn how to get started with Example","author": "Example Team","publishedDate": "2025-01-15T00:00:00Z","modifiedDate": "2026-02-01T00:00:00Z","canonicalUrl": "https://docs.example.com/getting-started","language": "en","ogImage": "https://docs.example.com/og-image.png","ogType": "article","ogSiteName": "Example Docs","jsonLd": { "@type": "Article", "..." : "..." },"wordCount": 1234,"readingTimeMinutes": 5,"keywords": "getting started, tutorial, example","robots": "index, follow"},"wordCount": 1234,"chunks": [{"text": "# Getting Started\n\nWelcome to Example...","chunkIndex": 0,"totalChunks": 3},{"text": "...continued content with overlap...","chunkIndex": 1,"totalChunks": 3}],"links": ["https://docs.example.com/installation","https://docs.example.com/api-reference"],"depth": 1,"scrapedAt": "2026-03-01T12:00:00.000Z"}
Chunking for RAG / Embeddings
When chunkSize is set, content is split into overlapping chunks for direct ingestion into vector databases. The chunker splits on natural boundaries:
- Paragraph breaks (double newline) — preferred
- Line breaks (single newline) — fallback
- Sentence boundaries (. ! ?) — next fallback
- Word boundaries (spaces) — last resort
Recommended chunk sizes by embedding model:
| Model | Recommended chunkSize | chunkOverlap |
|---|---|---|
| OpenAI text-embedding-3-small | 1000–1500 | 200 |
| OpenAI text-embedding-3-large | 1500–2000 | 200 |
| Cohere embed-v3 | 1000–1500 | 150 |
| BGE / E5 models | 500–1000 | 100 |
| Sentence Transformers | 500–800 | 100 |
Integration Examples
LangChain (Python)
from apify_client import ApifyClientfrom langchain_community.document_loaders import ApifyDatasetLoaderfrom langchain.text_splitter import RecursiveCharacterTextSplitterclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("your-username/website-to-markdown").call(run_input={"startUrls": ["https://docs.example.com"],"maxPages": 100,"chunkSize": 1000,"chunkOverlap": 200,})# Load directly from datasetdataset = client.dataset(run["defaultDatasetId"])items = dataset.list_items().items# Each item has .markdown, .chunks, .metadata ready for your pipelinefor item in items:for chunk in item.get("chunks", []):# chunk["text"] is ready for embeddingpass
LlamaIndex (Python)
from apify_client import ApifyClientfrom llama_index.core import Documentclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("your-username/website-to-markdown").call(run_input={"startUrls": ["https://docs.example.com"],"maxPages": 50,"outputFormat": "markdown",})dataset = client.dataset(run["defaultDatasetId"])items = dataset.list_items().itemsdocuments = [Document(text=item["markdown"],metadata={"url": item["url"],"title": item["title"],**item.get("metadata", {}),})for item in items]
Direct API Call
curl -X POST "https://api.apify.com/v2/acts/your-username~website-to-markdown/runs" \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_API_TOKEN" \-d '{"startUrls": ["https://docs.example.com"],"maxPages": 100,"chunkSize": 1500,"outputFormat": "markdown"}'
Pricing
This actor uses Pay Per Event pricing:
| Event | Price |
|---|---|
| Page crawled | $0.003 |
Example costs:
- 50-page docs site: $0.15
- 200-page blog: $0.60
- 1,000-page wiki: $3.00
Cost comparison with alternatives:
- Apify Website Content Crawler: ~$0.005/page
- Diffbot: $0.01-0.05/page
- Custom scraping infrastructure: $50-200/month fixed
- This actor: $0.003/page — the most affordable option
Content Extraction Quality
The crawler uses a multi-layer approach for reliable content extraction:
- Boilerplate removal — 60+ CSS selectors for navigation, ads, cookie banners, popups, social widgets, newsletter signups, comments, and related posts
- Main content detection — Tries semantic selectors (article, main, [role=main]) first, then falls back to text density scoring that considers paragraph count, heading count, link density, and content-related class names
- Semantic Markdown conversion — Recursive DOM traversal that preserves document structure: headings, lists (nested), code blocks (with language detection from 40+ languages), tables, blockquotes, links, images, figures with captions, definition lists, and inline formatting
Tips
- Start small — Test with
maxPages: 5to verify output quality before running large crawls - Use sitemap — For complete site coverage, provide
sitemapUrlto discover all pages - Filter URLs — Use
urlPatternto focus on specific sections (e.g.,/docs/or/blog/) - Exclude patterns — Skip binary files with
excludePattern:\\.(pdf|zip|png|jpg|gif)$ - Adjust rate limit — Lower
maxRequestsPerMinutefor smaller sites to be polite - Enable proxy — Use Apify proxy for sites with anti-bot protection
Related Actors
- Web Scraper — General-purpose web scraper
- Google Search Scraper — Find URLs to crawl via Google
- SEO Analyzer — Analyze website SEO metrics
- Contact Email Finder — Extract emails from crawled sites