Tech Docs to LLM-Ready Markdown
Pricing
Pay per usage
Tech Docs to LLM-Ready Markdown
Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Dmitry Goncharov
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 hours ago
Last modified
Categories
Share
Tech Docs to LLM-Ready Markdown Scraper
π Convert any technical documentation site to clean, structured Markdown β ready for RAG pipelines, LLM training, and AI assistants.
Why This Actor?
While generic web scrapers dump raw HTML, this Actor is specifically designed for technical documentation:
| Feature | Generic Scrapers | This Actor |
|---|---|---|
| Code block preservation | β Lost or broken | β With language tags |
| Framework-aware extraction | β One-size-fits-all | β Docusaurus, GitBook, MkDocs |
| Navigation removal | β Mixed with content | β Clean content only |
| RAG-ready output | β Needs post-processing | β
doc_id, section_path, chunking |
π― RAG-First Output
Every result includes fields optimized for vector databases and LLM loaders:
{"doc_id": "acdb145c14f4310b","url": "https://crawlee.dev/docs/introduction","title": "Introduction | Crawlee","section_path": "Guides > Quick Start > Introduction","content": "# Introduction\n\nCrawlee covers your crawling...","framework": "docusaurus","chunk_index": 0,"total_chunks": 1,"metadata": {"crawledAt": "2025-12-12T03:34:46.151Z","depth": 0,"wordCount": 358,"charCount": 2475}}
Supported Documentation Frameworks
- Docusaurus (React, Playwright, Crawlee docs)
- GitBook (Many SaaS products)
- MkDocs (Material for MkDocs)
- ReadTheDocs (Python projects with Sphinx)
- VuePress (Vue.js docs)
- Nextra (Next.js docs)
- Generic (Fallback for unknown frameworks)
Input Example
{"startUrls": [{"url": "https://crawlee.dev/docs/introduction"}],"maxPages": 100,"maxDepth": 10,"enableChunking": true,"chunkSize": 2000,"outputFormat": "markdown"}
π LangChain Integration (Python)
from langchain.document_loaders import ApifyDatasetLoaderfrom langchain.docstore.document import Documentloader = ApifyDatasetLoader(dataset_id="YOUR_DATASET_ID",dataset_mapping_function=lambda item: Document(page_content=item["content"],metadata={"source": item["url"],"title": item["title"],"doc_id": item["doc_id"],"section": item["section_path"]}),)docs = loader.load()# Ready for vectorstore!from langchain.vectorstores import Chromavectorstore = Chroma.from_documents(docs, embeddings)
π¦ LlamaIndex Integration
from llama_index.readers.apify import ApifyActorreader = ApifyActor("hedelka/tech-docs-scraper")documents = reader.load_data(run_input={"startUrls": [{"url": "https://docs.example.com"}], "maxPages": 50})# Build index directlyindex = VectorStoreIndex.from_documents(documents)
π‘ API Call
curl -X POST "https://api.apify.com/v2/acts/hedelka~tech-docs-scraper/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": [{"url": "https://docs.example.com"}], "maxPages": 50}'
Use Cases
- RAG Pipelines: Feed documentation to LangChain/LlamaIndex for "Chat with Docs"
- LLM Fine-tuning: Create high-quality datasets from official docs
- Knowledge Bases: Build searchable documentation archives
- AI Assistants: Power coding assistants with up-to-date API references
Pricing
Pay per Result: $0.50 per 1,000 pages
| Pages | Cost |
|---|---|
| 100 | $0.05 |
| 1,000 | $0.50 |
| 10,000 | $5.00 |
Author
Built with β€οΈ by HEDELKA for the LLM/RAG community.
Questions? Issues? Open a GitHub issue or contact on Apify.