Pricing

Pay per usage

Go to Apify Store

Documentation Crawler

Try for free

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

Features

Structured Markdown extraction from documentation pages with proper heading hierarchy
Code block preservation with language detection and syntax highlighting markers
Sidebar navigation following to discover all documentation pages automatically
Breadcrumb extraction for understanding page hierarchy and context
Word count and metadata for content analysis and quality assessment
Configurable crawl depth to control how many pages to process

Use Cases

Build AI knowledge bases from technical documentation
Create offline documentation archives
Monitor documentation changes over time
Feed documentation into RAG (Retrieval-Augmented Generation) pipelines
Analyze documentation coverage and quality across products

Input Configuration

Parameter	Type	Default	Description
`startUrls`	array	`["https://docs.apify.com"]`	Documentation site URLs to crawl
`maxPages`	integer	`200`	Maximum number of pages to crawl
`followSidebar`	boolean	`true`	Follow links found in sidebar navigation

Output Format

Each page produces a dataset item with the following fields:

url - The page URL
title - The page title
content - Full page content as Markdown
headings - Array of headings with level and text
codeBlocks - Array of code blocks with language and content
breadcrumb - Navigation breadcrumb path
wordCount - Number of words in the content
scrapedAt - ISO timestamp of when the page was scraped

Integration with AI Pipelines

The structured Markdown output is ideal for feeding into AI systems. Each page is self-contained with metadata, making it easy to chunk and embed for vector databases. The headings array enables semantic sectioning, while code blocks are preserved with language tags for proper formatting.

Supported Documentation Platforms

This actor works with most documentation frameworks including Docusaurus, GitBook, ReadTheDocs, MkDocs, VuePress, and custom documentation sites with standard HTML structure.

Limitations

JavaScript-rendered documentation may require the Puppeteer variant
Rate limiting is respected automatically via Crawlee's built-in mechanisms
Very large documentation sites (10,000+ pages) should use pagination via maxPages

Llm Ready Documentation Scraper

direct_duty/llm-ready-documentation-scraper

Developers and AI agents need to read documentation (e.g. Stripe Docs, Next.js Docs), but standard scrapers return noisy HTML that includes: navigation bars headers / footers ads / cookie banners This Actor must return pure content-only Markdown, suitable for vectorization and semantic search.

Sean

GitHub Documentation Extractor (Agentic)

himanshi1rana/github-docs-intelligence

An agentic AI actor that automatically extracts and analyzes documentation from GitHub repositories to help developers understand projects faster.

Himanshi Rana

Crawl Documentation Site — Data, Details & Metadata

tropical_quince/documentation-site-crawler

Crawl documentation site data at scale with this powerful Apify actor. Extracts data, details & metadata with automatic pagination and proxy rotation. Perfect for market research, competitive intelligence, and data-driven decision making.

Donny Nguyen

JP Castnet Yodobashi Scraper

styleindexamerica/jp-yodobashi-scraper

This actor is intended to extract data from yodobashi.com

PopinBorder Castnet

Amazon Product Review

pintostudio/amazon-product-review

The Amazon Product Review Scraper is an Apify actor that lets you extract comprehensive review data from Amazon product pages

Pinto Studio

102

Sephora Scraper

getdataforme/sephora-scraper

A Sephora scraper automatically extracts beauty product data from Sephora.com, including prices, brand information, reviews, ingredients, stock status, and promotional offers. It helps track beauty trends, monitor competitor pricing, and analyze customer sentiment for market research.

GetDataForMe

Cloudflare Bypass Scraper Pro

xtech/cloudflare-scraper-pro

Cloudflare Scraper Pro: The ultimate solution for scraping Cloudflare-protected websites. Advanced browser automation with intelligent Turnstile & CAPTCHA bypass, automatic Cloudflare challenge resolution, and robust proxy rotation to extract data from the most heavily protected sites.