Docs to Markdown + AI Embeddings → Vector DB Crawler avatar
Docs to Markdown + AI Embeddings → Vector DB Crawler
Under maintenance

Pricing

from $5.00 / 1,000 document processeds

Go to Apify Store
Docs to Markdown + AI Embeddings → Vector DB Crawler

Docs to Markdown + AI Embeddings → Vector DB Crawler

Under maintenance

Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.

Pricing

from $5.00 / 1,000 document processeds

Rating

5.0

(1)

Developer

Badruddeen Naseem

Badruddeen Naseem

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

4

Monthly active users

9 days ago

Last modified

Share

🧩 AI Docs → Markdown + Embeddings → Vector DB Crawler

Docs to Markdown + AI Embeddings → Vector DB Crawler

Web Crawler | Clean Markdown Output | Smart Chunking | Embeddings | Vector Database Ingestion

Crawl documentation websites, convert pages into high-quality Markdown, intelligently chunk content for RAG pipelines, generate embeddings (OpenAI or Azure OpenAI), and optionally upsert everything directly into your vector database — all in one Apify Actor.


🔌 Integrations

ProviderApify ActorAzure OpenAIOpenAIMongoDB AtlasPineconeWeaviateQdrantMilvus / Zilliz
Supported

Who is this Actor for?

This Actor is ideal for:

  • AI engineers building RAG or semantic search systems
  • Teams turning docs into AI assistants or chatbots
  • SaaS companies indexing product documentation
  • Developers migrating docs into vector databases

🚀 Quick Start (5 minutes)

Minimal configuration to crawl docs and store embeddings in MongoDB:

{
"startUrls": ["https://docs.python.org/3/"],
"vectorDbProvider": "mongodb",
"mongoUri": "mongodb+srv://...",
"collectionName": "python_docs",
"embeddingProvider": "openai",
"openaiApiKey": "YOUR_KEY"
}

🧩 Docs & Website Crawler → Semantic RAG Flow

Crawler → RAG Pipeline

[Start URLs]
|
v
+--------------------------+
| Crawl Pages | <- Playwright (JS + HTML rendering)
+--------------------------+
|
v
+--------------------------+
| Extract Text | <- Readability -> Turndown -> Clean Markdown
+--------------------------+
|
v
+--------------------------+
| Chunk Text | <- paragraph-aware • configurable size & overlap
+--------------------------+
|
v
+--------------------------+
| Generate Embeddings | <- optional • Azure OpenAI / OpenAI
+--------------------------+
|
v
+--------------------------+
| Vector DB Ingestion | <- Pinecone • Weaviate • Qdrant • Milvus • MongoDB Atlas
+--------------------------+
|
v
+--------------------------+
| Semantic RAG Pipeline | <- retrieve + context + generate
+--------------------------+
|
v
[Final Answer / Insights]

🕵️ Research Crawl Walkthrough

The first phase in your workflow is Research Crawl — a human-in-the-loop content curation step before generating embeddings.

It ensures high-quality RAG input, avoids noisy embeddings, and reduces costs.


1️⃣ Start Crawling (Research Mode)

  • Run the Actor with embeddings disabled
  • Crawler uses Playwright to render JS-heavy pages
  • Main content is extracted using:
    • Mozilla Readability
    • Turndown → Markdown
  • Pages are chunked (paragraph-aware)
  • Duplicate URLs are automatically removed
  • Pagination & infinite scroll are handled
  • Only the content is stored; no embeddings are generated yet

✅ Safe exploratory crawl — fast, cost-efficient, and focused on content discovery.


2️⃣ What Happens After Crawling

The Actor outputs:

a) Dataset

  • Stores all chunks per page
  • Each record contains:
    • url, title, chunk, chunkIndex, docId, chunkId
  • Location on Apify: Dataset tab → your chosen dataset

b) Key-Value Store (HTML Research File)

  • HTML research interface: {datasetName}-full.html
    Example: demo-full.html
  • Markdown archive: {datasetName}-full.md
    Example: demo-full.md
  • Location on Apify: Key-Value Stores tab → your chosen store

3️⃣ How to Work on the Research HTML

  1. Download HTML from Key-Value Store
  2. Open in a browser
  3. Use search box to filter by keyword (title, URL, content)
  4. Expand page previews to check relevance
  5. Select URLs with checkboxes or “Select all matching URLs”
  6. Export curated URLs as JSON for RAG ingestion

Only export the pages you want to embed — keeps embeddings high-quality and cost-efficient.

DEMO: Click to see the full quotes (crawled site).

Screenshot of search 'rowling' from a crawled quotes dataset


Exported url's list

4️⃣ Next Steps After Research

  • Feed JSON export into Actor as startUrls
  • Enable embeddings (Azure OpenAI / OpenAI)
  • Enable vector DB ingestion (MongoDB, Pinecone, Weaviate, Qdrant, Milvus)
  • The Actor now generates embeddings only for curated URLs
  • RAG system retrieves accurate, relevant content

5️⃣ Summary of Research Crawl

StepPurposeOutput
Crawl (research mode)Explore site contentDataset + Markdown + HTML
Review (Research HTML)Search, preview, selectCurated URL list
ExportFeed curated URLsstartUrls.json
RAG IngestionGenerate embeddings + vector DBSemantic search-ready vectors

🚀 How It Works

  • Crawl – Playwright-based crawler for modern JS-heavy sites
  • Extract – Clean content via Mozilla Readability → Turndown → Markdown
  • Chunk – Paragraph-aware chunking optimized for RAG
  • Embed – Generate embeddings per chunk (Azure OpenAI or OpenAI)
  • Store – Optionally upsert chunks + embeddings into a vector database
  • Track – URL deduplication and resume-safe crawling via Key-Value Store

🧠 Key Features

  • 🔍 High-quality content extraction (no navs, footers, ads)
  • 📝 Clean Markdown output (headings, lists, code blocks preserved)
  • 🧩 Intelligent chunking (configurable size, overlap, limits)
  • 🧠 Embedding generation with Azure OpenAI or OpenAI
  • 🗄️ Direct vector DB ingestion (MongoDB, Pinecone, Weaviate, Qdrant, Milvus)
  • 🌐 Smart crawling (pagination, infinite scroll, robots.txt)
  • 🧹 Resume-safe deduplication
  • 📊 Optional live debug metrics for retrieval quality

⚡ Streaming vs Batch Mode

ModeDescriptionMemory UsageSpeedRecord Style
streamChunks: truePush individual chunks as they are createdLowFasterMany small records
streamChunks: falsePush all chunks of a page togetherHigherSlowerCleaner per-page

🗄️ Supported Vector Databases

DatabaseAuto Create CollectionBatch UpsertNotes
MongoDB AtlasYesYesAtlas Vector Search
PineconeYesYesNamespace support
WeaviateYesYesCloud & self-hosted
QdrantYesYesCloud & self-hosted
Milvus / ZillizYesYesCloud & self-hosted

⚙️ Input Configuration

Required

startUrls

One or more URLs to begin crawling from.


Common Crawl Options

OptionDefaultDescription
linkGlobsExtensive (broad)URL patterns to include in crawling
excludeGlobsBlogs, changelogsURL patterns to exclude
nextPageSelectors.next, rel=nextCSS selectors for detecting pagination
chunkSize1000Maximum characters per chunk
maxChunksPerPage50Safety limit for very large pages
handleScrolltrueEnables handling of infinite scroll
respectRobotsTxttrueRespects website's robots.txt rules

🧠 Embedding Provider Setup (Actor Input)

Azure OpenAI

  • API Key – Your Azure OpenAI API key
  • Azure Endpoint – Example: https://your-resource.openai.azure.com/
  • Deployment Name – Azure deployment name (not model name)

OpenAI

  • API Key – Your OpenAI API key

🗄️ Vector Database Accounts (Required)


🔐 Vector Database Authentication (Actor Input)

Milvus / Zilliz

  • Vector DB Provider: Milvus
  • API Key: username:password
  • Host / Connection String: Public endpoint
  • Collection Name: Database name (auto-created if missing)

MongoDB Atlas

  • Vector DB Provider: MongoDB
  • API Key: Not required
  • Host / Connection String:
    mongodb+srv://<username>:<password>@<cluster-url>/<database>?retryWrites=true&w=majority
    • URL-encode special characters in password
    • Ensure network access allows your IP (0.0.0.0/0 to allow all)
  • Index / Collection Name: Database name

Pinecone

  • Vector DB Provider: Pinecone
  • API Key: Pinecone API key
  • Index Name: Index name (auto-created)

Qdrant

  • Vector DB Provider: Qdrant
  • API Key: Qdrant API key
  • Host: Cluster endpoint
  • Collection Name: Auto-created if missing

Weaviate

  • Vector DB Provider: Weaviate
  • API Key: Weaviate API key
  • Host: Cluster endpoint
  • Collection Name: Must start with a capital letter (auto-created)

Vector DB Upsert Examples - Right Click and Open the Thumbnail for more details.


Pinecone

Qdrant

MongoDB

Milvus

Weaviate

🧪 Example Input (Python Docs → MongoDB Atlas)

{
"startUrls": [{ "url": "https://docs.python.org/3/" }],
"generateEmbeddings": true,
"embeddingProvider": "azure",
"azureOpenAiApiKey": "YOUR_AZURE_KEY",
"azureOpenAiEndpoint": "https://your-resource.openai.azure.com/",
"azureDeploymentName": "embedding-deployment",
"pushToVectorDb": true,
"vectorDbProvider": "mongodb",
"vectorDbEnvironment": "mongodb+srv://user:pass@cluster.mongodb.net",
"vectorDbIndexName": "python_docs"
}

💰 Pricing

Charged per processed documentation page.

1 dataset item = 1 documentation page

No extra cost per chunk or embedding

You provide:

Embedding provider (Azure OpenAI / OpenAI)

Vector database account

Built with ❤️ for the AI + documentation community.

Happy crawling 🚀