Docs to Markdown + AI Embeddings → Vector DB Crawler avatar
Docs to Markdown + AI Embeddings → Vector DB Crawler

Pricing

from $5.00 / 1,000 document processeds

Go to Apify Store
Docs to Markdown + AI Embeddings → Vector DB Crawler

Docs to Markdown + AI Embeddings → Vector DB Crawler

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

Pricing

from $5.00 / 1,000 document processeds

Rating

0.0

(0)

Developer

Badruddeen Naseem

Badruddeen Naseem

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

2

Monthly active users

7 minutes ago

Last modified

Share

🧩 AI Docs → Markdown + Embeddings → Vector DB Crawler

🌐 Web Crawler | 📄 Markdown Output | 🧠 Embeddings | 💾 Vector Database Support

Crawl documentation sites, convert pages to Markdown, intelligently chunk content for RAG, generate embeddings with Azure/OpenAI, and optionally upsert directly into your vector database — all in one Actor.


Apify Actor
Azure OpenAI
OpenAI

MongoDB Atlas
Pinecone
Weaviate
Qdrant
Milvus


🚀 How It Works

  1. Crawl: Playwright-based crawler for modern JS-heavy documentation sites.
  2. Extract: Clean content using Mozilla Readability → Turndown → Markdown.
  3. Chunk: Intelligent, paragraph-aware splitting for RAG pipelines.
  4. Embed: Azure OpenAI or OpenAI embeddings generated per chunk.
  5. Vector DB: Optionally upsert chunks & embeddings into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus.
  6. Track: Deduplicate URLs & resume safely across runs using Key-Value Store.

🧠 Key Features

  • 🔍 High-Quality Extraction: Removes navigation, ads, footers; handles modern docs.
  • 📝 Markdown Output: Preserves headings, code blocks, lists, and images.
  • 🧩 Intelligent Chunking: Configurable size/overlap; paragraph-aware; max chunks per page.
  • 🧠 Embeddings Generation: Azure OpenAI or OpenAI; batch processing; graceful failure handling.
  • 🗄️ Direct Vector DB Upsert: MongoDB Atlas, Pinecone, Weaviate, Qdrant, Milvus.
  • 🌐 Smart Crawling: Include/exclude globs, pagination, infinite scroll, robots.txt compliance.
  • 🧹 Deduplication: URL hashing + KV store; resume-safe.
  • 📊 Live precision@10 is an approximate online retrieval quality signal based on recently inserted vectors and is intended for debugging and tuning, not benchmarking.

⚡ Streaming vs Batch

ModeDescription
streamChunks = trueEach chunk is pushed individually → faster visibility in dataset, lower memory usage.
streamChunks = falseAll chunks of a page pushed together → cleaner per-page records, optional embeddings storage.

🗄️ Supported Vector Databases

DatabaseAuto IndexBatch UpsertNotes
MongoDB Atlas✅ Yes✅ YesConnection string + collection
Pinecone✅ YesSupports namespace
Weaviate✅ YesCloud or self-hosted
Qdrant✅ Yes✅ YesOn-premise or cloud
Milvus / Zilliz✅ Yes✅ YesToken or username:password auth

🧪 Example Dataset Output

Chunked page with embeddings (streamChunks=true)

{
"url": "https://docs.python.org/3/library/functions.html",
"title": "Built-in Functions",
"docId": "uuid-v5-based-id",
"chunkIndex": 0,
"chunk": "# abs(x)\nReturn the absolute value of a number...",
"embedding": [0.0123, -0.0456, ..., 0.7890]
}
---
## 🧠 Comparison with Similar Apify Actors
| Capability | **Docs → Markdown + AI Embeddings → Vector DB (This Actor)** | DocsToRAG | RAG Spider | Universal RAG Web Scraper | Website Content Crawler Pro |
| ---------------------------------------------- | ------------------------------------------------------------ | ---------- | ------------ | ------------------------- | --------------------------- |
| Browser-based crawling (JS-heavy sites) | ✅ Playwright | ⚠️ Limited | ✅ Playwright | ✅ | ✅ |
| Clean content extraction (Mozilla Readability) | ✅ | ✅ | ✅ | ✅ | ✅ |
| Markdown output | ✅ | ✅ | ✅ | ✅ | ✅ |
| Intelligent, configurable chunking | ✅ | ⚠️ Limited | ⚠️ Basic | ⚠️ Basic | ❌ |
| Azure OpenAI embeddings | ✅ | ❌ | ❌ | ❌ | ❌ |
| OpenAI embeddings | ✅ | ✅ | ❌ | ❌ | ❌ |
| Direct vector DB upsert | ✅ | ⚠️ Partial | ❌ | ❌ | ❌ |
| Pinecone support | ✅ | ⚠️ | ❌ | ❌ | ❌ |
| Weaviate support | ✅ | ❌ | ❌ | ❌ | ❌ |
| Qdrant support | ✅ | ❌ | ❌ | ❌ | ❌ |
| Milvus / Zilliz Cloud support | ✅ | ❌ | ❌ | ❌ | ❌ |
| MongoDB Atlas vector search | ✅ | ❌ | ❌ | ❌ | ❌ |
| Resume support & deduplication | ✅ URL hashing + KV store | ❌ | ❌ | ❌ | ❌ |
| Robots.txt support | ✅ | ❌ | ❌ | ❌ | ✅ |
| Pagination & infinite scroll | ✅ | ⚠️ | ⚠️ | ⚠️ | ✅ |
| User-defined dataset naming | ✅ Safe & sanitized | ❌ | ❌ | ❌ | ❌ |
| End-to-end RAG ingestion | ✅ Yes (single actor) | ⚠️ Partial | ❌ | ❌ | ❌ |
---
## Crawl Summary
```json
{
"crawlSummary": {
"finishedAt": "2026-01-01T23:05:44Z",
"totalPagesProcessed": 24,
"totalChunksGenerated": 1200,
"totalChunksPushed": 1200,
"embeddingsFailed": 0
}
}
  • 🌐 Smart Crawling

    • Inclusion/exclusion glob patterns
    • Pagination detection (next links)
    • Infinite scroll support
    • robots.txt compliance
  • 🧹 Deduplication

    • BLAKE3 URL hashing + Key-Value Store
    • Resume-safe across runs
  • 📊 Live Logging & Summary

    • Optional live page processing logs
    • Final crawl statistics

🗄️ Supported Vector Databases

DatabaseAuto Index CreationBatch UpsertNotes
MongoDB AtlasYesYesFull connection string + collection
PineconeYesNamespace support
WeaviateYesCloud or self-hosted
QdrantYesYesOn-premise or cloud
Milvus / ZillizYesYesToken or username:password auth

🚀 What This Actor Produces

Each processed documentation page is stored as one dataset result and may contain multiple chunks.

Each dataset result includes chunk records like:

{
"url": "https://docs.python.org/3/library/functions.html",
"title": "Built-in Functions",
"docId": "uuid-v5-based-id",
"chunkIndex": 2,
"chunk": "# abs(x)\n\nReturn the absolute value of a number. The argument may be an integer...\n\n```python
"embedding": [0.0123, -0.0456, ..., 0.7890]
}
{
"crawlSummary": {
"finishedAt": "2026-01-01T23:05:44Z",
"totalPagesProcessed": 24,
"totalChunksGenerated": 1200,
"totalChunksPushed": 1200,
"embeddingsFailed": 0
}
}

⚙️ Input Configuration

Required Field, Description

startUrls,One or more URLs to begin crawling from

Common Options

OptionDefaultDescription
linkGlobsExtensive listPatterns to include (docs, guides, reference...)
excludeGlobsBlogs, changelogsPatterns to exclude (takes priority)
nextPageSelectors.next, rel=nextCSS selectors for pagination
chunkSize1000Max characters per chunk
maxChunksPerPage50Prevent overload from huge pages
handleScrolltrueLoad dynamic content
respectRobotsTxttrueObey site rules

Embeddings Options

OptionDefaultNotes
generateEmbeddingstrueEnable embedding generation
embeddingProviderazureazure or openai
azureOpenAiApiKeyRequired for Azure
azureOpenAiEndpointe.g., https://your-resource.openai.azure.com/
azureDeploymentNameDeployment name (not model name)
openAiApiKeyRequired for OpenAI

🧪 Example Input: Python Docs → MongoDB Atlas

{
"startUrls": [{ "url": "https://docs.python.org/3/" }],
"generateEmbeddings": true,
"embeddingProvider": "azure",
"azureOpenAiApiKey": "YOUR_AZURE_KEY",
"azureOpenAiEndpoint": "https://your-resource.openai.azure.com/",
"azureDeploymentName": "embedding-deployment",
"pushToVectorDb": true,
"vectorDbProvider": "mongodb",
"vectorDbEnvironment": "mongodb+srv://user:pass@cluster.mongodb.net",
"vectorDbIndexName": "python_docs",
"vectorDbNamespace": "chunks"
}

🎯 Best Practices

  • Use the official docs root as the start URL
  • Tune linkGlobs / excludeGlobs to avoid blogs or API references
  • For large crawls: enable streaming to reduce memory usage
  • Prefer Azure OpenAI for enterprise compliance
  • Keep chunkSize ≤ 1500 for best embedding performance

⚠️ Notes

  • JSDOM may log harmless CSS parsing warnings on modern sites — safe to ignore
  • Very large index pages are capped by maxChunksPerPage
  • Embedding failures are logged but do not stop the crawl

🛠️ Technologies Used

  • Apify SDK
  • Crawlee + Playwright
  • Mozilla Readability
  • Turndown
  • Azure OpenAI / OpenAI API
  • MongoDB, Pinecone, Weaviate, Qdrant, Milvus clients

🚀 Quick Start

  1. Set your startUrls
  2. Choose embedding provider and credentials
  3. Select vector DB (optional)
  4. Run the Actor
  5. Use the dataset or your vector DB for RAG!

💰 Pricing

This Actor is priced per processed documentation page.

  • 1 result = 1 documentation page
  • Pricing is based on items pushed to the default dataset
  • You are not charged per chunk or per embedding

You bring your own:

  • Embedding provider (Azure OpenAI / OpenAI)
  • Vector database (MongoDB, Pinecone, Weaviate, Qdrant, Milvus)

No token usage, embedding costs, or vector storage costs are included in the Actor price.


Built with ❤️ for the AI + documentation community.
Happy crawling!