Tech Docs to LLM-Ready Markdown avatar
Tech Docs to LLM-Ready Markdown

Pricing

Pay per usage

Go to Apify Store
Tech Docs to LLM-Ready Markdown

Tech Docs to LLM-Ready Markdown

Scrapes technical documentation sites (Docusaurus, GitBook, MkDocs, ReadTheDocs) and converts them to clean, structured Markdown for RAG pipelines, LLM training, and AI assistants. Automatically detects documentation framework and removes navigation elements.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Dmitry Goncharov

Dmitry Goncharov

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 hours ago

Last modified

Categories

Share

Tech Docs to LLM-Ready Markdown Scraper

πŸš€ Convert any technical documentation site to clean, structured Markdown β€” ready for RAG pipelines, LLM training, and AI assistants.

Why This Actor?

While generic web scrapers dump raw HTML, this Actor is specifically designed for technical documentation:

FeatureGeneric ScrapersThis Actor
Code block preservation❌ Lost or brokenβœ… With language tags
Framework-aware extraction❌ One-size-fits-allβœ… Docusaurus, GitBook, MkDocs
Navigation removal❌ Mixed with contentβœ… Clean content only
RAG-ready output❌ Needs post-processingβœ… doc_id, section_path, chunking

🎯 RAG-First Output

Every result includes fields optimized for vector databases and LLM loaders:

{
"doc_id": "acdb145c14f4310b",
"url": "https://crawlee.dev/docs/introduction",
"title": "Introduction | Crawlee",
"section_path": "Guides > Quick Start > Introduction",
"content": "# Introduction\n\nCrawlee covers your crawling...",
"framework": "docusaurus",
"chunk_index": 0,
"total_chunks": 1,
"metadata": {
"crawledAt": "2025-12-12T03:34:46.151Z",
"depth": 0,
"wordCount": 358,
"charCount": 2475
}
}

Supported Documentation Frameworks

  • Docusaurus (React, Playwright, Crawlee docs)
  • GitBook (Many SaaS products)
  • MkDocs (Material for MkDocs)
  • ReadTheDocs (Python projects with Sphinx)
  • VuePress (Vue.js docs)
  • Nextra (Next.js docs)
  • Generic (Fallback for unknown frameworks)

Input Example

{
"startUrls": [{"url": "https://crawlee.dev/docs/introduction"}],
"maxPages": 100,
"maxDepth": 10,
"enableChunking": true,
"chunkSize": 2000,
"outputFormat": "markdown"
}

πŸ”— LangChain Integration (Python)

from langchain.document_loaders import ApifyDatasetLoader
from langchain.docstore.document import Document
loader = ApifyDatasetLoader(
dataset_id="YOUR_DATASET_ID",
dataset_mapping_function=lambda item: Document(
page_content=item["content"],
metadata={
"source": item["url"],
"title": item["title"],
"doc_id": item["doc_id"],
"section": item["section_path"]
}
),
)
docs = loader.load()
# Ready for vectorstore!
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(docs, embeddings)

πŸ¦™ LlamaIndex Integration

from llama_index.readers.apify import ApifyActor
reader = ApifyActor("hedelka/tech-docs-scraper")
documents = reader.load_data(
run_input={"startUrls": [{"url": "https://docs.example.com"}], "maxPages": 50}
)
# Build index directly
index = VectorStoreIndex.from_documents(documents)

πŸ“‘ API Call

curl -X POST "https://api.apify.com/v2/acts/hedelka~tech-docs-scraper/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"startUrls": [{"url": "https://docs.example.com"}], "maxPages": 50}'

Use Cases

  1. RAG Pipelines: Feed documentation to LangChain/LlamaIndex for "Chat with Docs"
  2. LLM Fine-tuning: Create high-quality datasets from official docs
  3. Knowledge Bases: Build searchable documentation archives
  4. AI Assistants: Power coding assistants with up-to-date API references

Pricing

Pay per Result: $0.50 per 1,000 pages

PagesCost
100$0.05
1,000$0.50
10,000$5.00

Author

Built with ❀️ by HEDELKA for the LLM/RAG community.

Questions? Issues? Open a GitHub issue or contact on Apify.