Docs Markdown Rag Ready Crawler avatar
Docs Markdown Rag Ready Crawler

Pricing

from $5.00 / 1,000 results

Go to Apify Store
Docs Markdown Rag Ready Crawler

Docs Markdown Rag Ready Crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Dev with Bobby

Dev with Bobby

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 hours ago

Last modified

Categories

Share

Docs Markdown RAG-Ready Crawler

An Apify Actor that crawls documentation websites and converts them into clean markdown with RAG-ready chunks for embeddings. Includes internal link graphs and content hashes for change detection.

Features

  • Markdown Conversion - Converts HTML content to clean, well-formatted markdown
  • RAG-Ready Chunks - Automatically splits content into chunks optimized for embedding models
  • Dual Crawler Support - Playwright for JavaScript SPAs, Cheerio for static HTML (faster)
  • Link Graph - Extracts internal link relationships for building knowledge graphs
  • Content Hashing - SHA-256 hashes for detecting content changes
  • Smart Content Extraction - Automatically identifies main content and removes navigation/noise
  • URL Normalization - Handles query params, trailing slashes, and tracking parameters

Output Datasets

The crawler generates multiple dataset types (identified by _datasetType):

Pages (_datasetType: 'pages')

Full page data including:

  • url, normalizedUrl, canonicalUrl
  • title, h1, language
  • text - Plain text content
  • markdown - Converted markdown
  • excerpt - First 300 characters
  • depth - Crawl depth from start URL
  • referrers - URLs that linked to this page
  • outgoingInternalLinks, outgoingExternalLinks
  • contentHash - SHA-256 hash of markdown content
  • fetchedAt - ISO timestamp

Chunks (_datasetType: 'chunks')

RAG-ready content chunks:

  • chunkId - Stable unique identifier
  • url, normalizedUrl
  • chunkIndex - Position in document
  • headingPath - Array of parent headings (e.g., ["Getting Started", "Installation"])
  • markdown, text - Chunk content
  • charStart, charEnd - Character positions in original document
  • chunkHash - Hash of chunk content
  • pageContentHash - Hash of parent page
  • tokenEstimate - Approximate token count

Edges (_datasetType: 'edges')

Internal link graph:

  • from - Source URL (normalized)
  • to - Target URL (normalized)
  • type - Link type (a[href])
  • anchorText - Link text

Issues (_datasetType: 'issues')

Crawl errors and warnings:

  • type - Error type
  • url - Affected URL
  • message - Error message
  • severity - Error severity level

Input Configuration

ParameterTypeDefaultDescription
domainstringrequiredDomain to crawl (e.g., https://docs.example.com)
startUrlsarray[]Override start URLs (optional)
maxPagesinteger200Maximum pages to crawl (1-10,000)
maxDepthinteger4Maximum crawl depth (1-10)
makeRagReadybooleantrueGenerate RAG-ready chunks
modestring"docs"Extraction mode: docs, article, generic
outputstring"all"Output: all, pagesOnly, chunksOnly, edgesOnly
crawlerTypestring"playwright"Engine: playwright (for SPAs) or cheerio (for static)
includeSubdomainsbooleanfalseAlso crawl subdomains
respectRobotsTxtbooleantrueFollow robots.txt rules
removeSelectorsarray["nav", "aside", ...]CSS selectors to remove
allowPatternsarray[]Regex patterns for URLs to include
denyPatternsarray[".*utm_.*", ...]Regex patterns for URLs to exclude
stripQueryParamsbooleantrueRemove query parameters from URLs
chunkTargetCharsinteger2500Target chunk size (500-10,000)
chunkMaxCharsinteger4500Maximum chunk size (1,000-20,000)
minChunkCharsinteger400Minimum chunk size (100-2,000)
proxyConfigurationobject-Apify proxy settings

Example Input

{
"domain": "https://docs.convex.dev",
"maxPages": 500,
"maxDepth": 5,
"makeRagReady": true,
"mode": "docs",
"output": "all",
"crawlerType": "playwright",
"chunkTargetChars": 2500,
"chunkMaxChars": 4500
}

Crawler Types

Playwright (default)

  • Best for: JavaScript SPAs, React/Vue/Next.js documentation sites
  • Waits for networkidle to ensure all content is loaded
  • Slower but handles dynamic content
  • Timeout: 120 seconds per page

Cheerio

  • Best for: Static HTML sites, traditional documentation
  • Much faster (no browser required)
  • Lower resource usage
  • Timeout: 30 seconds per page

Content Extraction

The crawler uses smart selectors to find main content:

Docs mode tries (in order):

  1. main, article, [role="main"]
  2. .content, .markdown, .prose
  3. .theme-doc-markdown, .md-content, .docs-content
  4. Falls back to body

Automatically removes noise elements:

  • nav, aside, header, footer
  • .toc, .sidebar, .navigation, .menu
  • Any custom selectors you specify

Chunking Strategy

Content is split into chunks based on:

  1. Heading boundaries - New chunks at #, ##, ###, #### headings
  2. Target size - Aims for ~2,500 characters per chunk
  3. Max size - Hard limit at 4,500 characters
  4. Min size - Avoids tiny chunks under 400 characters
  5. Paragraph preservation - Splits at paragraph boundaries when possible
  6. Sentence preservation - Falls back to sentence/word boundaries for very long paragraphs

Each chunk includes its headingPath for context, making it ideal for RAG systems.

Local Development

# Install dependencies
npm install
# Run locally
apify run
# Run with input
apify run --input='{"domain": "https://docs.example.com"}'
# Deploy to Apify
apify push

Technical Notes

  • 9MB Limit: Apify dataset items have a ~9MB limit. Pages exceeding this are automatically truncated (with truncated: true flag).
  • URL Normalization: URLs are normalized (HTTPS, no trailing slashes, tracking params stripped) for deduplication.
  • Content Hashes: Use contentHash and chunkHash fields to detect content changes between crawls.
  • Stable Chunk IDs: chunkId is deterministic based on URL, position, and content - same content = same ID.

Dependencies

License

ISC