Docs to Markdown + AI Embeddings → Vector DB Crawler
Pricing
from $5.00 / 1,000 document processeds
Docs to Markdown + AI Embeddings → Vector DB Crawler
Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.
Pricing
from $5.00 / 1,000 document processeds
Rating
5.0
(1)
Developer

Badruddeen Naseem
Actor stats
0
Bookmarked
6
Total users
4
Monthly active users
9 days ago
Last modified
Categories
Share
🧩 AI Docs → Markdown + Embeddings → Vector DB Crawler
Web Crawler | Clean Markdown Output | Smart Chunking | Embeddings | Vector Database Ingestion
Crawl documentation websites, convert pages into high-quality Markdown, intelligently chunk content for RAG pipelines, generate embeddings (OpenAI or Azure OpenAI), and optionally upsert everything directly into your vector database — all in one Apify Actor.
🔌 Integrations
| Provider | Apify Actor | Azure OpenAI | OpenAI | MongoDB Atlas | Pinecone | Weaviate | Qdrant | Milvus / Zilliz |
|---|---|---|---|---|---|---|---|---|
| Supported | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Who is this Actor for?
This Actor is ideal for:
- AI engineers building RAG or semantic search systems
- Teams turning docs into AI assistants or chatbots
- SaaS companies indexing product documentation
- Developers migrating docs into vector databases
🚀 Quick Start (5 minutes)
Minimal configuration to crawl docs and store embeddings in MongoDB:
{"startUrls": ["https://docs.python.org/3/"],"vectorDbProvider": "mongodb","mongoUri": "mongodb+srv://...","collectionName": "python_docs","embeddingProvider": "openai","openaiApiKey": "YOUR_KEY"}
🧩 Docs & Website Crawler → Semantic RAG Flow
Crawler → RAG Pipeline
[Start URLs]|v+--------------------------+| Crawl Pages | <- Playwright (JS + HTML rendering)+--------------------------+|v+--------------------------+| Extract Text | <- Readability -> Turndown -> Clean Markdown+--------------------------+|v+--------------------------+| Chunk Text | <- paragraph-aware • configurable size & overlap+--------------------------+|v+--------------------------+| Generate Embeddings | <- optional • Azure OpenAI / OpenAI+--------------------------+|v+--------------------------+| Vector DB Ingestion | <- Pinecone • Weaviate • Qdrant • Milvus • MongoDB Atlas+--------------------------+|v+--------------------------+| Semantic RAG Pipeline | <- retrieve + context + generate+--------------------------+|v[Final Answer / Insights]
🕵️ Research Crawl Walkthrough
The first phase in your workflow is Research Crawl — a human-in-the-loop content curation step before generating embeddings.
It ensures high-quality RAG input, avoids noisy embeddings, and reduces costs.
1️⃣ Start Crawling (Research Mode)
- Run the Actor with embeddings disabled
- Crawler uses Playwright to render JS-heavy pages
- Main content is extracted using:
- Mozilla Readability
- Turndown → Markdown
- Pages are chunked (paragraph-aware)
- Duplicate URLs are automatically removed
- Pagination & infinite scroll are handled
- Only the content is stored; no embeddings are generated yet
✅ Safe exploratory crawl — fast, cost-efficient, and focused on content discovery.
2️⃣ What Happens After Crawling
The Actor outputs:
a) Dataset
- Stores all chunks per page
- Each record contains:
url,title,chunk,chunkIndex,docId,chunkId
- Location on Apify: Dataset tab → your chosen dataset
b) Key-Value Store (HTML Research File)
- HTML research interface:
{datasetName}-full.html
Example:demo-full.html - Markdown archive:
{datasetName}-full.md
Example:demo-full.md - Location on Apify: Key-Value Stores tab → your chosen store
3️⃣ How to Work on the Research HTML
- Download HTML from Key-Value Store
- Open in a browser
- Use search box to filter by keyword (title, URL, content)
- Expand page previews to check relevance
- Select URLs with checkboxes or “Select all matching URLs”
- Export curated URLs as JSON for RAG ingestion
Only export the pages you want to embed — keeps embeddings high-quality and cost-efficient.
DEMO: Click to see the full quotes (crawled site).

4️⃣ Next Steps After Research
- Feed JSON export into Actor as
startUrls - Enable embeddings (Azure OpenAI / OpenAI)
- Enable vector DB ingestion (MongoDB, Pinecone, Weaviate, Qdrant, Milvus)
- The Actor now generates embeddings only for curated URLs
- RAG system retrieves accurate, relevant content
5️⃣ Summary of Research Crawl
| Step | Purpose | Output |
|---|---|---|
| Crawl (research mode) | Explore site content | Dataset + Markdown + HTML |
| Review (Research HTML) | Search, preview, select | Curated URL list |
| Export | Feed curated URLs | startUrls.json |
| RAG Ingestion | Generate embeddings + vector DB | Semantic search-ready vectors |
🚀 How It Works
- Crawl – Playwright-based crawler for modern JS-heavy sites
- Extract – Clean content via Mozilla Readability → Turndown → Markdown
- Chunk – Paragraph-aware chunking optimized for RAG
- Embed – Generate embeddings per chunk (Azure OpenAI or OpenAI)
- Store – Optionally upsert chunks + embeddings into a vector database
- Track – URL deduplication and resume-safe crawling via Key-Value Store
🧠 Key Features
- 🔍 High-quality content extraction (no navs, footers, ads)
- 📝 Clean Markdown output (headings, lists, code blocks preserved)
- 🧩 Intelligent chunking (configurable size, overlap, limits)
- 🧠 Embedding generation with Azure OpenAI or OpenAI
- 🗄️ Direct vector DB ingestion (MongoDB, Pinecone, Weaviate, Qdrant, Milvus)
- 🌐 Smart crawling (pagination, infinite scroll, robots.txt)
- 🧹 Resume-safe deduplication
- 📊 Optional live debug metrics for retrieval quality
⚡ Streaming vs Batch Mode
| Mode | Description | Memory Usage | Speed | Record Style |
|---|---|---|---|---|
streamChunks: true | Push individual chunks as they are created | Low | Faster | Many small records |
streamChunks: false | Push all chunks of a page together | Higher | Slower | Cleaner per-page |
🗄️ Supported Vector Databases
| Database | Auto Create Collection | Batch Upsert | Notes |
|---|---|---|---|
| MongoDB Atlas | Yes | Yes | Atlas Vector Search |
| Pinecone | Yes | Yes | Namespace support |
| Weaviate | Yes | Yes | Cloud & self-hosted |
| Qdrant | Yes | Yes | Cloud & self-hosted |
| Milvus / Zilliz | Yes | Yes | Cloud & self-hosted |
⚙️ Input Configuration
Required
startUrls
One or more URLs to begin crawling from.
Common Crawl Options
| Option | Default | Description |
|---|---|---|
linkGlobs | Extensive (broad) | URL patterns to include in crawling |
excludeGlobs | Blogs, changelogs | URL patterns to exclude |
nextPageSelectors | .next, rel=next | CSS selectors for detecting pagination |
chunkSize | 1000 | Maximum characters per chunk |
maxChunksPerPage | 50 | Safety limit for very large pages |
handleScroll | true | Enables handling of infinite scroll |
respectRobotsTxt | true | Respects website's robots.txt rules |
🧠 Embedding Provider Setup (Actor Input)
Azure OpenAI
- API Key – Your Azure OpenAI API key
- Azure Endpoint – Example:
https://your-resource.openai.azure.com/ - Deployment Name – Azure deployment name (not model name)
OpenAI
- API Key – Your OpenAI API key
🗄️ Vector Database Accounts (Required)
- Milvus / Zilliz Cloud – https://cloud.zilliz.com/login
- MongoDB Atlas – https://www.mongodb.com/products/platform/atlas-database
- Pinecone – https://www.pinecone.io
- Qdrant Cloud – https://cloud.qdrant.io/
- Weaviate – https://weaviate.io/
🔐 Vector Database Authentication (Actor Input)
Milvus / Zilliz
- Vector DB Provider: Milvus
- API Key: username:password
- Host / Connection String: Public endpoint
- Collection Name: Database name (auto-created if missing)
MongoDB Atlas
- Vector DB Provider: MongoDB
- API Key: Not required
- Host / Connection String:
mongodb+srv://<username>:<password>@<cluster-url>/<database>?retryWrites=true&w=majority- URL-encode special characters in password
- Ensure network access allows your IP (0.0.0.0/0 to allow all)
- Index / Collection Name: Database name
Pinecone
- Vector DB Provider: Pinecone
- API Key: Pinecone API key
- Index Name: Index name (auto-created)
Qdrant
- Vector DB Provider: Qdrant
- API Key: Qdrant API key
- Host: Cluster endpoint
- Collection Name: Auto-created if missing
Weaviate
- Vector DB Provider: Weaviate
- API Key: Weaviate API key
- Host: Cluster endpoint
- Collection Name: Must start with a capital letter (auto-created)
Vector DB Upsert Examples - Right Click and Open the Thumbnail for more details.
![]() Pinecone | ![]() Qdrant | ![]() MongoDB |
![]() Milvus | ![]() Weaviate |
🧪 Example Input (Python Docs → MongoDB Atlas)
{"startUrls": [{ "url": "https://docs.python.org/3/" }],"generateEmbeddings": true,"embeddingProvider": "azure","azureOpenAiApiKey": "YOUR_AZURE_KEY","azureOpenAiEndpoint": "https://your-resource.openai.azure.com/","azureDeploymentName": "embedding-deployment","pushToVectorDb": true,"vectorDbProvider": "mongodb","vectorDbEnvironment": "mongodb+srv://user:pass@cluster.mongodb.net","vectorDbIndexName": "python_docs"}
💰 Pricing
Charged per processed documentation page.
1 dataset item = 1 documentation page
No extra cost per chunk or embedding
You provide:
Embedding provider (Azure OpenAI / OpenAI)
Vector database account
Built with ❤️ for the AI + documentation community.
Happy crawling 🚀






