Docs to Markdown + AI Embeddings → Vector DB Crawler
Pricing
from $5.00 / 1,000 document processeds
Docs to Markdown + AI Embeddings → Vector DB Crawler
Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.
Pricing
from $5.00 / 1,000 document processeds
Rating
0.0
(0)
Developer

Badruddeen Naseem
Actor stats
0
Bookmarked
4
Total users
2
Monthly active users
7 minutes ago
Last modified
Categories
Share
🧩 AI Docs → Markdown + Embeddings → Vector DB Crawler
🌐 Web Crawler | 📄 Markdown Output | 🧠 Embeddings | 💾 Vector Database Support
Crawl documentation sites, convert pages to Markdown, intelligently chunk content for RAG, generate embeddings with Azure/OpenAI, and optionally upsert directly into your vector database — all in one Actor.
🚀 How It Works
- Crawl: Playwright-based crawler for modern JS-heavy documentation sites.
- Extract: Clean content using Mozilla Readability → Turndown → Markdown.
- Chunk: Intelligent, paragraph-aware splitting for RAG pipelines.
- Embed: Azure OpenAI or OpenAI embeddings generated per chunk.
- Vector DB: Optionally upsert chunks & embeddings into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus.
- Track: Deduplicate URLs & resume safely across runs using Key-Value Store.
🧠 Key Features
- 🔍 High-Quality Extraction: Removes navigation, ads, footers; handles modern docs.
- 📝 Markdown Output: Preserves headings, code blocks, lists, and images.
- 🧩 Intelligent Chunking: Configurable size/overlap; paragraph-aware; max chunks per page.
- 🧠 Embeddings Generation: Azure OpenAI or OpenAI; batch processing; graceful failure handling.
- 🗄️ Direct Vector DB Upsert: MongoDB Atlas, Pinecone, Weaviate, Qdrant, Milvus.
- 🌐 Smart Crawling: Include/exclude globs, pagination, infinite scroll, robots.txt compliance.
- 🧹 Deduplication: URL hashing + KV store; resume-safe.
- 📊 Live precision@10 is an approximate online retrieval quality signal based on recently inserted vectors and is intended for debugging and tuning, not benchmarking.
⚡ Streaming vs Batch
| Mode | Description |
|---|---|
streamChunks = true | Each chunk is pushed individually → faster visibility in dataset, lower memory usage. |
streamChunks = false | All chunks of a page pushed together → cleaner per-page records, optional embeddings storage. |
🗄️ Supported Vector Databases
| Database | Auto Index | Batch Upsert | Notes |
|---|---|---|---|
| MongoDB Atlas | ✅ Yes | ✅ Yes | Connection string + collection |
| Pinecone | ❌ | ✅ Yes | Supports namespace |
| Weaviate | ❌ | ✅ Yes | Cloud or self-hosted |
| Qdrant | ✅ Yes | ✅ Yes | On-premise or cloud |
| Milvus / Zilliz | ✅ Yes | ✅ Yes | Token or username:password auth |
🧪 Example Dataset Output
Chunked page with embeddings (streamChunks=true)
{"url": "https://docs.python.org/3/library/functions.html","title": "Built-in Functions","docId": "uuid-v5-based-id","chunkIndex": 0,"chunk": "# abs(x)\nReturn the absolute value of a number...","embedding": [0.0123, -0.0456, ..., 0.7890]}---## 🧠 Comparison with Similar Apify Actors| Capability | **Docs → Markdown + AI Embeddings → Vector DB (This Actor)** | DocsToRAG | RAG Spider | Universal RAG Web Scraper | Website Content Crawler Pro || ---------------------------------------------- | ------------------------------------------------------------ | ---------- | ------------ | ------------------------- | --------------------------- || Browser-based crawling (JS-heavy sites) | ✅ Playwright | ⚠️ Limited | ✅ Playwright | ✅ | ✅ || Clean content extraction (Mozilla Readability) | ✅ | ✅ | ✅ | ✅ | ✅ || Markdown output | ✅ | ✅ | ✅ | ✅ | ✅ || Intelligent, configurable chunking | ✅ | ⚠️ Limited | ⚠️ Basic | ⚠️ Basic | ❌ || Azure OpenAI embeddings | ✅ | ❌ | ❌ | ❌ | ❌ || OpenAI embeddings | ✅ | ✅ | ❌ | ❌ | ❌ || Direct vector DB upsert | ✅ | ⚠️ Partial | ❌ | ❌ | ❌ || Pinecone support | ✅ | ⚠️ | ❌ | ❌ | ❌ || Weaviate support | ✅ | ❌ | ❌ | ❌ | ❌ || Qdrant support | ✅ | ❌ | ❌ | ❌ | ❌ || Milvus / Zilliz Cloud support | ✅ | ❌ | ❌ | ❌ | ❌ || MongoDB Atlas vector search | ✅ | ❌ | ❌ | ❌ | ❌ || Resume support & deduplication | ✅ URL hashing + KV store | ❌ | ❌ | ❌ | ❌ || Robots.txt support | ✅ | ❌ | ❌ | ❌ | ✅ || Pagination & infinite scroll | ✅ | ⚠️ | ⚠️ | ⚠️ | ✅ || User-defined dataset naming | ✅ Safe & sanitized | ❌ | ❌ | ❌ | ❌ || End-to-end RAG ingestion | ✅ Yes (single actor) | ⚠️ Partial | ❌ | ❌ | ❌ |---## Crawl Summary```json{"crawlSummary": {"finishedAt": "2026-01-01T23:05:44Z","totalPagesProcessed": 24,"totalChunksGenerated": 1200,"totalChunksPushed": 1200,"embeddingsFailed": 0}}
-
🌐 Smart Crawling
- Inclusion/exclusion glob patterns
- Pagination detection (
nextlinks) - Infinite scroll support
robots.txtcompliance
-
🧹 Deduplication
- BLAKE3 URL hashing + Key-Value Store
- Resume-safe across runs
-
📊 Live Logging & Summary
- Optional live page processing logs
- Final crawl statistics
🗄️ Supported Vector Databases
| Database | Auto Index Creation | Batch Upsert | Notes |
|---|---|---|---|
| MongoDB Atlas | Yes | Yes | Full connection string + collection |
| Pinecone | — | Yes | Namespace support |
| Weaviate | — | Yes | Cloud or self-hosted |
| Qdrant | Yes | Yes | On-premise or cloud |
| Milvus / Zilliz | Yes | Yes | Token or username:password auth |
🚀 What This Actor Produces
Each processed documentation page is stored as one dataset result and may contain multiple chunks.
Each dataset result includes chunk records like:
{"url": "https://docs.python.org/3/library/functions.html","title": "Built-in Functions","docId": "uuid-v5-based-id","chunkIndex": 2,"chunk": "# abs(x)\n\nReturn the absolute value of a number. The argument may be an integer...\n\n```python"embedding": [0.0123, -0.0456, ..., 0.7890]}
{"crawlSummary": {"finishedAt": "2026-01-01T23:05:44Z","totalPagesProcessed": 24,"totalChunksGenerated": 1200,"totalChunksPushed": 1200,"embeddingsFailed": 0}}
⚙️ Input Configuration
Required Field, Description
startUrls,One or more URLs to begin crawling from
Common Options
| Option | Default | Description |
|---|---|---|
| linkGlobs | Extensive list | Patterns to include (docs, guides, reference...) |
| excludeGlobs | Blogs, changelogs | Patterns to exclude (takes priority) |
| nextPageSelectors | .next, rel=next | CSS selectors for pagination |
| chunkSize | 1000 | Max characters per chunk |
| maxChunksPerPage | 50 | Prevent overload from huge pages |
| handleScroll | true | Load dynamic content |
| respectRobotsTxt | true | Obey site rules |
Embeddings Options
| Option | Default | Notes |
|---|---|---|
| generateEmbeddings | true | Enable embedding generation |
| embeddingProvider | azure | azure or openai |
| azureOpenAiApiKey | — | Required for Azure |
| azureOpenAiEndpoint | — | e.g., https://your-resource.openai.azure.com/ |
| azureDeploymentName | — | Deployment name (not model name) |
| openAiApiKey | — | Required for OpenAI |
🧪 Example Input: Python Docs → MongoDB Atlas
{"startUrls": [{ "url": "https://docs.python.org/3/" }],"generateEmbeddings": true,"embeddingProvider": "azure","azureOpenAiApiKey": "YOUR_AZURE_KEY","azureOpenAiEndpoint": "https://your-resource.openai.azure.com/","azureDeploymentName": "embedding-deployment","pushToVectorDb": true,"vectorDbProvider": "mongodb","vectorDbEnvironment": "mongodb+srv://user:pass@cluster.mongodb.net","vectorDbIndexName": "python_docs","vectorDbNamespace": "chunks"}
🎯 Best Practices
- Use the official docs root as the start URL
- Tune
linkGlobs/excludeGlobsto avoid blogs or API references - For large crawls: enable streaming to reduce memory usage
- Prefer Azure OpenAI for enterprise compliance
- Keep
chunkSize≤ 1500 for best embedding performance
⚠️ Notes
- JSDOM may log harmless CSS parsing warnings on modern sites — safe to ignore
- Very large index pages are capped by
maxChunksPerPage - Embedding failures are logged but do not stop the crawl
🛠️ Technologies Used
- Apify SDK
- Crawlee + Playwright
- Mozilla Readability
- Turndown
- Azure OpenAI / OpenAI API
- MongoDB, Pinecone, Weaviate, Qdrant, Milvus clients
🚀 Quick Start
- Set your
startUrls - Choose embedding provider and credentials
- Select vector DB (optional)
- Run the Actor
- Use the dataset or your vector DB for RAG!
💰 Pricing
This Actor is priced per processed documentation page.
- 1 result = 1 documentation page
- Pricing is based on items pushed to the default dataset
- You are not charged per chunk or per embedding
You bring your own:
- Embedding provider (Azure OpenAI / OpenAI)
- Vector database (MongoDB, Pinecone, Weaviate, Qdrant, Milvus)
No token usage, embedding costs, or vector storage costs are included in the Actor price.
Built with ❤️ for the AI + documentation community.
Happy crawling!