Docs Markdown Rag Ready Crawler
Pricing
from $5.00 / 1,000 results
Docs Markdown Rag Ready Crawler
Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer

Dev with Bobby
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 hours ago
Last modified
Share
Docs Markdown RAG-Ready Crawler
An Apify Actor that crawls documentation websites and converts them into clean markdown with RAG-ready chunks for embeddings. Includes internal link graphs and content hashes for change detection.
Features
- Markdown Conversion - Converts HTML content to clean, well-formatted markdown
- RAG-Ready Chunks - Automatically splits content into chunks optimized for embedding models
- Dual Crawler Support - Playwright for JavaScript SPAs, Cheerio for static HTML (faster)
- Link Graph - Extracts internal link relationships for building knowledge graphs
- Content Hashing - SHA-256 hashes for detecting content changes
- Smart Content Extraction - Automatically identifies main content and removes navigation/noise
- URL Normalization - Handles query params, trailing slashes, and tracking parameters
Output Datasets
The crawler generates multiple dataset types (identified by _datasetType):
Pages (_datasetType: 'pages')
Full page data including:
url,normalizedUrl,canonicalUrltitle,h1,languagetext- Plain text contentmarkdown- Converted markdownexcerpt- First 300 charactersdepth- Crawl depth from start URLreferrers- URLs that linked to this pageoutgoingInternalLinks,outgoingExternalLinkscontentHash- SHA-256 hash of markdown contentfetchedAt- ISO timestamp
Chunks (_datasetType: 'chunks')
RAG-ready content chunks:
chunkId- Stable unique identifierurl,normalizedUrlchunkIndex- Position in documentheadingPath- Array of parent headings (e.g.,["Getting Started", "Installation"])markdown,text- Chunk contentcharStart,charEnd- Character positions in original documentchunkHash- Hash of chunk contentpageContentHash- Hash of parent pagetokenEstimate- Approximate token count
Edges (_datasetType: 'edges')
Internal link graph:
from- Source URL (normalized)to- Target URL (normalized)type- Link type (a[href])anchorText- Link text
Issues (_datasetType: 'issues')
Crawl errors and warnings:
type- Error typeurl- Affected URLmessage- Error messageseverity- Error severity level
Input Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
domain | string | required | Domain to crawl (e.g., https://docs.example.com) |
startUrls | array | [] | Override start URLs (optional) |
maxPages | integer | 200 | Maximum pages to crawl (1-10,000) |
maxDepth | integer | 4 | Maximum crawl depth (1-10) |
makeRagReady | boolean | true | Generate RAG-ready chunks |
mode | string | "docs" | Extraction mode: docs, article, generic |
output | string | "all" | Output: all, pagesOnly, chunksOnly, edgesOnly |
crawlerType | string | "playwright" | Engine: playwright (for SPAs) or cheerio (for static) |
includeSubdomains | boolean | false | Also crawl subdomains |
respectRobotsTxt | boolean | true | Follow robots.txt rules |
removeSelectors | array | ["nav", "aside", ...] | CSS selectors to remove |
allowPatterns | array | [] | Regex patterns for URLs to include |
denyPatterns | array | [".*utm_.*", ...] | Regex patterns for URLs to exclude |
stripQueryParams | boolean | true | Remove query parameters from URLs |
chunkTargetChars | integer | 2500 | Target chunk size (500-10,000) |
chunkMaxChars | integer | 4500 | Maximum chunk size (1,000-20,000) |
minChunkChars | integer | 400 | Minimum chunk size (100-2,000) |
proxyConfiguration | object | - | Apify proxy settings |
Example Input
{"domain": "https://docs.convex.dev","maxPages": 500,"maxDepth": 5,"makeRagReady": true,"mode": "docs","output": "all","crawlerType": "playwright","chunkTargetChars": 2500,"chunkMaxChars": 4500}
Crawler Types
Playwright (default)
- Best for: JavaScript SPAs, React/Vue/Next.js documentation sites
- Waits for
networkidleto ensure all content is loaded - Slower but handles dynamic content
- Timeout: 120 seconds per page
Cheerio
- Best for: Static HTML sites, traditional documentation
- Much faster (no browser required)
- Lower resource usage
- Timeout: 30 seconds per page
Content Extraction
The crawler uses smart selectors to find main content:
Docs mode tries (in order):
main,article,[role="main"].content,.markdown,.prose.theme-doc-markdown,.md-content,.docs-content- Falls back to
body
Automatically removes noise elements:
nav,aside,header,footer.toc,.sidebar,.navigation,.menu- Any custom selectors you specify
Chunking Strategy
Content is split into chunks based on:
- Heading boundaries - New chunks at
#,##,###,####headings - Target size - Aims for ~2,500 characters per chunk
- Max size - Hard limit at 4,500 characters
- Min size - Avoids tiny chunks under 400 characters
- Paragraph preservation - Splits at paragraph boundaries when possible
- Sentence preservation - Falls back to sentence/word boundaries for very long paragraphs
Each chunk includes its headingPath for context, making it ideal for RAG systems.
Local Development
# Install dependenciesnpm install# Run locallyapify run# Run with inputapify run --input='{"domain": "https://docs.example.com"}'# Deploy to Apifyapify push
Technical Notes
- 9MB Limit: Apify dataset items have a ~9MB limit. Pages exceeding this are automatically truncated (with
truncated: trueflag). - URL Normalization: URLs are normalized (HTTPS, no trailing slashes, tracking params stripped) for deduplication.
- Content Hashes: Use
contentHashandchunkHashfields to detect content changes between crawls. - Stable Chunk IDs:
chunkIdis deterministic based on URL, position, and content - same content = same ID.
Dependencies
- Apify SDK - Actor framework
- Crawlee - Web scraping library
- Playwright - Browser automation
- Turndown - HTML to Markdown conversion
License
ISC