Turn any documentation site into clean Markdown, intelligently chunked content with embeddings (Azure/OpenAI), and directly upsert into MongoDB Atlas, Pinecone, Weaviate, Qdrant, or Milvus — ready for RAG, AI assistants, and semantic search in minutes.
Default value of this property is ["https://docs.crawl4ai.com/"]
Link Inclusion Patterns
linkGlobsarrayOptional
Glob patterns to restrict which links are followed (e.g., //docs/).
Default value of this property is ["https://**/docs/**","https://**/documentation/**","https://**/guides/**","https://**/guide/**","https://**/manual/**","https://**/reference/**","https://**/api-reference/**","https://**/developer/**","https://**/learn/**","https://**/tutorial/**","https://**/how-to/**","https://**/kb/**","https://**/*.md"]
Link Exclusion Patterns
excludeGlobsarrayOptional
Glob patterns to exclude links (takes priority over inclusion).
Default value of this property is ["https://**/blog/**","https://**/news/**","https://**/changelog**","https://**/releases/**","https://**/v1/**","https://**/v2/**","https://**/archive/**"]
Respect robots.txt
respectRobotsTxtbooleanOptional
Obey robots.txt rules of the site.
Default value of this property is true
Custom User Agent
userAgentstringOptional
User-Agent string sent during crawling.
Default value of this property is "Mozilla/5.0 (compatible; DocsCrawler/1.0)"
Handle Infinite Scroll
handleScrollbooleanOptional
Scroll pages to load dynamic/lazy content.
Default value of this property is true
Max Scrolls
maxScrollsintegerOptional
Maximum number of scroll actions per page.
Default value of this property is 10
Minimum Content Length (characters)
minContentLengthintegerOptional
Skip pages with less readable text than this.
Default value of this property is 500
Chunk Size (characters)
chunkSizeintegerOptional
Maximum characters per chunk.
Default value of this property is 1000
Chunk Overlap (characters)
chunkOverlapintegerOptional
Overlap between chunks for better context.
Default value of this property is 200
Max Chunks Per Page
maxChunksPerPageintegerOptional
Limit chunks extracted from a single page.
Default value of this property is 50
Stream Chunks to Dataset
streamChunksbooleanOptional
Push each chunk immediately (useful for huge crawls).
Default value of this property is false
Dataset Name (optional)
datasetNamestringOptional
Custom name for output dataset. Leave blank for default.
Default value of this property is ""
Generate Embeddings
generateEmbeddingsbooleanOptional
Create 1536-dim embeddings using text-embedding-3-small. Required for vector DB push.
Default value of this property is true
Embedding Provider
embeddingProviderEnumOptional
Choose Azure OpenAI or direct OpenAI API.
Value options:
"azure": string"openai": string
Default value of this property is "azure"
Azure OpenAI API Key
azureOpenAiApiKeystringOptional
Your Azure OpenAI API key (will be hidden in run details for security). Required if using Azure.
Storage name:
• Pinecone: index name (auto-created)
• Weaviate/Qdrant/Milvus: collection name (auto-created)
• MongoDB Atlas: database name (default: 'vectors')
Default value of this property is "my-docs"
Namespace / Collection Name
vectorDbNamespacestringOptional
Grouping:
• Pinecone: namespace (optional)
• MongoDB Atlas: collection name (default: 'chunks')
• Leave blank for others