Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.
List of URLs to start crawling from. The crawler will follow links within the same domain.
Max Pages to Crawl
maxCrawlPagesintegerOptional
Maximum number of pages to crawl. Set to 0 for unlimited.
Default value of this property is 50
Max Crawl Depth
maxCrawlDepthintegerOptional
Maximum depth of links to follow from start URLs.
Default value of this property is 3
Chunk Size (tokens)
chunkSizeintegerOptional
Target size for each content chunk in tokens. Recommended: 500-1000 for optimal RAG performance.
Default value of this property is 750
Chunk Overlap (tokens)
chunkOverlapintegerOptional
Number of overlapping tokens between consecutive chunks to maintain context.
Default value of this property is 100
Questions per Chunk
questionsPerChunkintegerOptional
Number of hypothetical questions to generate for each chunk.
Default value of this property is 3
LLM Provider
llmProviderEnumOptional
LLM provider for generating hypothetical questions. Use 'native' for free rule-based generation (no API key needed), or 'openai'/'anthropic' for higher quality AI-generated questions.
Specific model to use (ignored for native provider). For OpenAI: gpt-4o-mini (cheap), gpt-4o (better). For Anthropic: claude-3-haiku-20240307 (cheap), claude-3-5-sonnet-20241022 (better).
Default value of this property is "gpt-4o-mini"
OpenAI API Key
openaiApiKeystringOptional
Your OpenAI API key. Required if using OpenAI as LLM provider.
Anthropic API Key
anthropicApiKeystringOptional
Your Anthropic API key. Required if using Anthropic as LLM provider.
Include Page Metadata
includeMetadatabooleanOptional
Include page title, description, and other metadata in output.
Default value of this property is true
Exclude CSS Selectors
excludeSelectorsarrayOptional
CSS selectors for elements to exclude from content extraction (e.g., navigation, footer).
Default value of this property is ["nav","header","footer",".sidebar",".navigation",".menu",".advertisement",".ads","#cookie-banner"]
URL Patterns to Include
urlPatternsarrayOptional
Glob patterns for URLs to include. Leave empty to crawl all URLs on the domain.
Default value of this property is []
URL Patterns to Exclude
excludeUrlPatternsarrayOptional
Glob patterns for URLs to exclude from crawling.
Default value of this property is ["**/*.pdf","**/*.zip","**/*.png","**/*.jpg","**/*.gif","**/login*","**/signup*","**/auth*"]