AI Data Pipeline — Crawl, Chunk & Export to Vector DB
Pricing
Pay per usage
AI Data Pipeline — Crawl, Chunk & Export to Vector DB
Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Ozapp
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
AI Data Pipeline — Website to Vector DB (No-Code)
Crawl any website, clean and chunk content for RAG/LLM applications, score quality, detect language, classify content type, and optionally export directly to Pinecone or Qdrant. Zero coding required.
Pipeline Stages
URL --> Crawl --> Clean HTML --> Chunk Text --> Score Quality --> Export
- Crawl — Spider pages within the same domain using Playwright (JS-rendered content supported)
- Clean — Strip boilerplate (nav, footer, scripts, ads), convert code blocks to markdown fences, extract page title
- Chunk — Split into semantic chunks by headings/paragraphs with configurable overlap. Code blocks are protected (never split mid-block)
- Score — Rate each chunk 0-1 based on word count, structure, lexical diversity, code ratio, boilerplate detection
- Classify — Detect language (7 languages) and content type (documentation, article, blog, product, FAQ)
- Export — Push to Apify dataset (JSON) and/or Pinecone/Qdrant vector databases
Input
| Field | Type | Description | Default |
|---|---|---|---|
startUrls | Array | URLs to start crawling (use the URL editor) | Required |
maxPages | Number | Maximum pages to crawl (1-10000) | 50 |
chunkSize | Number | Target tokens per chunk (100-4000) | 500 |
chunkOverlap | Number | Overlap tokens between chunks (0-500) | 50 |
minQualityScore | Number | Minimum quality score to include a chunk (0-1) | 0.3 |
exportTo | String | Export target: json, pinecone, qdrant | "json" |
pineconeApiKey | String | Pinecone API key (secret) | None |
pineconeIndexName | String | Pinecone index name | None |
pineconeHost | String | Pinecone index host URL | None |
qdrantUrl | String | Qdrant instance URL | None |
qdrantApiKey | String | Qdrant API key (secret) | None |
qdrantCollectionName | String | Qdrant collection name | None |
Example Input — JSON Dataset
{"startUrls": [{ "url": "https://docs.example.com" }],"maxPages": 100,"chunkSize": 500,"chunkOverlap": 50,"minQualityScore": 0.3}
Example Input — Export to Pinecone
{"startUrls": [{ "url": "https://docs.example.com" }],"maxPages": 100,"chunkSize": 500,"exportTo": "pinecone","pineconeApiKey": "your-api-key","pineconeHost": "https://your-index.svc.pinecone.io","pineconeIndexName": "docs"}
Output Per Chunk
{"url": "https://docs.apify.com/academy/getting-started","title": "Getting started | Academy | Apify Documentation","sourceTitle": "Getting started","chunk": "Getting started | Academy | Apify Documentation...","chunkIndex": 0,"totalChunks": 1,"qualityScore": 0.95,"tokenCount": 258,"language": "en","contentType": "documentation","summary": "Getting started | Academy | Apify Documentation...","metadata": {"headings": ["Getting started", "Getting to know the platform", "Next up"],"linkCount": 86,"imageCount": 3,"wordCount": 185},"lastProcessed": "2026-03-06T14:40:18.420Z"}
Pipeline Summary
At the end of each run, the actor logs a summary:
=== Pipeline Summary ===Total pages attempted: 2Pages with content: 2Total chunks created: 4Chunks filtered (minScore): 0Avg quality score: 0.87 (min: 0.80, max: 0.95)Avg token count: 357Languages detected: enContent types: documentationExport format: json
Use Cases
- RAG Chatbots — Prepare knowledge bases for retrieval-augmented generation
- Documentation Search — Index docs sites for semantic search
- Knowledge Management — Convert websites into structured, searchable chunks
- Content Analysis — Score and filter web content by quality
- AI Fine-tuning — Prepare clean training data from websites
Quality Scoring
Each chunk is scored 0-1 based on:
- Word count — Penalizes very short or very long chunks
- Sentence structure — Proper sentences score higher
- Heading presence — Structured content scores higher
- Lexical diversity — Varied vocabulary scores higher
- Code ratio — Moderate code is rewarded, excessive code is penalized
- Boilerplate detection — Cookie notices, "all rights reserved", raw HTML tags lower the score
Language Detection
Supports 7 languages: English, French, German, Spanish, Dutch, Portuguese, Italian. Detection uses keyword matching with 20 marker words per language.
Content Type Classification
Automatically classifies each chunk as: documentation, article, blog, product, faq, or other.
Notes
- Uses same-domain crawling strategy (won't follow external links)
- Playwright-based for JavaScript-rendered content support
- Code blocks are protected during chunking — never split mid-block
- Vector DB export sends metadata alongside text (no embeddings — use your own model)
- Chunks respect heading boundaries to maintain semantic coherence
API Integration
JavaScript
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('ozapp/ai-data-pipeline').call({startUrls: [{ url: 'https://docs.example.com' }],maxPages: 100,chunkSize: 500,minQualityScore: 0.3,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`${items.length} chunks ready for your vector DB`);
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("ozapp/ai-data-pipeline").call(run_input={"startUrls": [{"url": "https://docs.example.com"}],"maxPages": 100,"chunkSize": 500,"minQualityScore": 0.3,})items = client.dataset(run["defaultDatasetId"]).list_items().itemsprint(f"{len(items)} chunks ready for your vector DB")
cURL
curl "https://api.apify.com/v2/acts/ozapp~ai-data-pipeline/runs" \-X POST \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_API_TOKEN" \-d '{"startUrls":[{"url":"https://docs.example.com"}],"maxPages":100,"chunkSize":500}'
Pricing
$4.99 per 1,000 chunks — includes crawling, cleaning, scoring, and classification.