Documentation Crawler
Pricing
Pay per usage
Documentation Crawler
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Donny Nguyen
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Crawl documentation sites and extract structured content as clean Markdown. This actor handles sidebar navigation, code blocks, version selectors, and nested page hierarchies commonly found in technical documentation.
Features
- Structured Markdown extraction from documentation pages with proper heading hierarchy
- Code block preservation with language detection and syntax highlighting markers
- Sidebar navigation following to discover all documentation pages automatically
- Breadcrumb extraction for understanding page hierarchy and context
- Word count and metadata for content analysis and quality assessment
- Configurable crawl depth to control how many pages to process
Use Cases
- Build AI knowledge bases from technical documentation
- Create offline documentation archives
- Monitor documentation changes over time
- Feed documentation into RAG (Retrieval-Augmented Generation) pipelines
- Analyze documentation coverage and quality across products
Input Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | ["https://docs.apify.com"] | Documentation site URLs to crawl |
maxPages | integer | 200 | Maximum number of pages to crawl |
followSidebar | boolean | true | Follow links found in sidebar navigation |
Output Format
Each page produces a dataset item with the following fields:
url- The page URLtitle- The page titlecontent- Full page content as Markdownheadings- Array of headings with level and textcodeBlocks- Array of code blocks with language and contentbreadcrumb- Navigation breadcrumb pathwordCount- Number of words in the contentscrapedAt- ISO timestamp of when the page was scraped
Integration with AI Pipelines
The structured Markdown output is ideal for feeding into AI systems. Each page is self-contained with metadata, making it easy to chunk and embed for vector databases. The headings array enables semantic sectioning, while code blocks are preserved with language tags for proper formatting.
Supported Documentation Platforms
This actor works with most documentation frameworks including Docusaurus, GitBook, ReadTheDocs, MkDocs, VuePress, and custom documentation sites with standard HTML structure.
Limitations
- JavaScript-rendered documentation may require the Puppeteer variant
- Rate limiting is respected automatically via Crawlee's built-in mechanisms
- Very large documentation sites (10,000+ pages) should use pagination via
maxPages