Pricing

Pay per usage

Llm Ready Documentation Scraper

Developers and AI agents need to read documentation (e.g. Stripe Docs, Next.js Docs), but standard scrapers return noisy HTML that includes: navigation bars headers / footers ads / cookie banners This Actor must return pure content-only Markdown, suitable for vectorization and semantic search.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Sean

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

LLM-Ready Documentation Scraper

Crawl any documentation website and get clean, formatted Markdown perfect for LLMs and RAG (Retrieval-Augmented Generation) applications.

🎯 Problem

Developers and AI agents need to read documentation (Stripe Docs, Next.js Docs, etc.), but standard scrapers return messy HTML with navbars, footers, and ads. This Actor solves that by delivering pure, clean Markdown.

✨ Features

Clean Markdown Output: Strips navigation, sidebars, footers, scripts, and ads
Smart Content Detection: Automatically finds the main content area
Token Counting: Each page includes token count for LLM context planning
Merge Mode: Combine all pages into a single full_documentation.md file
Configurable Depth: Control how deep to crawl
URL Filtering: Include/exclude patterns using globs

📥 Input

Field	Type	Description
`startUrl`	String	The root URL of the documentation site
`maxDepth`	Number	Maximum link depth to crawl (default: 10)
`maxPages`	Number	Maximum pages to scrape (default: 100)
`includeGlobs`	Array	URL patterns to include
`excludeGlobs`	Array	URL patterns to exclude
`excludeElements`	String	CSS selectors to remove
`contentSelector`	String	CSS selector for main content
`mergeOutput`	Boolean	Combine all pages into one file

📤 Output

Each page is stored in the dataset with:

{
  "url": "https://docs.example.com/api/auth",
  "title": "Authentication",
  "markdown": "# Authentication\n\nThis guide covers...",
  "tokenCount": 1523,
  "scrapedAt": "2024-01-15T10:30:00.000Z"
}

When mergeOutput is enabled, a combined full_documentation.md is saved to the Key-Value Store.

🚀 Usage Examples

Crawl Stripe Docs

{
  "startUrl": "https://stripe.com/docs/api",
  "maxPages": 50,
  "mergeOutput": true
}

Crawl with Custom Content Selector

{
  "startUrl": "https://nextjs.org/docs",
  "contentSelector": ".docs-content",
  "excludeElements": "nav, footer, .sidebar, .carbon-ads",
  "maxDepth": 3
}

🔧 Technical Details

Built with TypeScript and the Apify SDK
Uses CheerioCrawler for fast HTML parsing
Turndown library for HTML-to-Markdown conversion
gpt-tokenizer for accurate token counting

📝 License

ISC

Tech Docs to LLM-Ready Markdown

hedelka/tech-docs-scraper

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

Dmitry Goncharov

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

Zendesk to RAG Markdown Scraper

inclusive_insect/Zendesk-to-RAG-Markdown-Pipeline

Crawl any Zendesk Help Center and extract pristine, semantic Markdown optimized for LLMs, RAG pipelines, and Vector Databases. Automatically strips HTML junk, navigation bars, and footers to provide high-accuracy AI training data.

Gonds Studio

Docs to Markdown + AI Embeddings → Vector DB Crawler

badruddeen/docs-to-markdown-ai-embeddings---vector-db-crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

Badruddeen Naseem

5.0

Return Prediction API

vivid_astronaut/return-prediction

Fabio Suizu

Google Docs Mcp

aluminum_jam/google-docs-mcp

The Google Docs MCP Actor functions as a model context protocol server, facilitating interactions between AI assistants, automation tools, and Google Docs. It helps in linking AI models to Google Workspace, enabling intelligent document processing, content generation, and collaborative workflows.

anuj upadhyay

AI Markdown Maker

onescales/bulk-ai-markdown-maker

Convert any web page into clean, AI ready markdown format in seconds. Perfect for feeding content to AI models, creating documentation, or archiving web content in a portable format. In addition it intelligently parse web content, removing ads, navigation, and other clutter. Generate Markdown Today!

One Scales

5.0

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠