Documentation Crawler avatar

Documentation Crawler

Pricing

Pay per usage

Go to Apify Store
Documentation Crawler

Documentation Crawler

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Donny Nguyen

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Categories

Share

Crawl documentation sites and extract structured content as clean Markdown. This actor handles sidebar navigation, code blocks, version selectors, and nested page hierarchies commonly found in technical documentation.

Features

  • Structured Markdown extraction from documentation pages with proper heading hierarchy
  • Code block preservation with language detection and syntax highlighting markers
  • Sidebar navigation following to discover all documentation pages automatically
  • Breadcrumb extraction for understanding page hierarchy and context
  • Word count and metadata for content analysis and quality assessment
  • Configurable crawl depth to control how many pages to process

Use Cases

  • Build AI knowledge bases from technical documentation
  • Create offline documentation archives
  • Monitor documentation changes over time
  • Feed documentation into RAG (Retrieval-Augmented Generation) pipelines
  • Analyze documentation coverage and quality across products

Input Configuration

ParameterTypeDefaultDescription
startUrlsarray["https://docs.apify.com"]Documentation site URLs to crawl
maxPagesinteger200Maximum number of pages to crawl
followSidebarbooleantrueFollow links found in sidebar navigation

Output Format

Each page produces a dataset item with the following fields:

  • url - The page URL
  • title - The page title
  • content - Full page content as Markdown
  • headings - Array of headings with level and text
  • codeBlocks - Array of code blocks with language and content
  • breadcrumb - Navigation breadcrumb path
  • wordCount - Number of words in the content
  • scrapedAt - ISO timestamp of when the page was scraped

Integration with AI Pipelines

The structured Markdown output is ideal for feeding into AI systems. Each page is self-contained with metadata, making it easy to chunk and embed for vector databases. The headings array enables semantic sectioning, while code blocks are preserved with language tags for proper formatting.

Supported Documentation Platforms

This actor works with most documentation frameworks including Docusaurus, GitBook, ReadTheDocs, MkDocs, VuePress, and custom documentation sites with standard HTML structure.

Limitations

  • JavaScript-rendered documentation may require the Puppeteer variant
  • Rate limiting is respected automatically via Crawlee's built-in mechanisms
  • Very large documentation sites (10,000+ pages) should use pagination via maxPages