AI Data Pipeline — Crawl, Chunk & Export to Vector DB avatar

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

Pricing

Pay per usage

Go to Apify Store
AI Data Pipeline — Crawl, Chunk & Export to Vector DB

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ozapp

Ozapp

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

AI Data Pipeline — Website to Vector DB (No-Code)

Crawl any website, clean and chunk content for RAG/LLM applications, score quality, detect language, classify content type, and optionally export directly to Pinecone or Qdrant. Zero coding required.

Pipeline Stages

URL --> Crawl --> Clean HTML --> Chunk Text --> Score Quality --> Export
  1. Crawl — Spider pages within the same domain using Playwright (JS-rendered content supported)
  2. Clean — Strip boilerplate (nav, footer, scripts, ads), convert code blocks to markdown fences, extract page title
  3. Chunk — Split into semantic chunks by headings/paragraphs with configurable overlap. Code blocks are protected (never split mid-block)
  4. Score — Rate each chunk 0-1 based on word count, structure, lexical diversity, code ratio, boilerplate detection
  5. Classify — Detect language (7 languages) and content type (documentation, article, blog, product, FAQ)
  6. Export — Push to Apify dataset (JSON) and/or Pinecone/Qdrant vector databases

Input

FieldTypeDescriptionDefault
startUrlsArrayURLs to start crawling (use the URL editor)Required
maxPagesNumberMaximum pages to crawl (1-10000)50
chunkSizeNumberTarget tokens per chunk (100-4000)500
chunkOverlapNumberOverlap tokens between chunks (0-500)50
minQualityScoreNumberMinimum quality score to include a chunk (0-1)0.3
exportToStringExport target: json, pinecone, qdrant"json"
pineconeApiKeyStringPinecone API key (secret)None
pineconeIndexNameStringPinecone index nameNone
pineconeHostStringPinecone index host URLNone
qdrantUrlStringQdrant instance URLNone
qdrantApiKeyStringQdrant API key (secret)None
qdrantCollectionNameStringQdrant collection nameNone

Example Input — JSON Dataset

{
"startUrls": [{ "url": "https://docs.example.com" }],
"maxPages": 100,
"chunkSize": 500,
"chunkOverlap": 50,
"minQualityScore": 0.3
}

Example Input — Export to Pinecone

{
"startUrls": [{ "url": "https://docs.example.com" }],
"maxPages": 100,
"chunkSize": 500,
"exportTo": "pinecone",
"pineconeApiKey": "your-api-key",
"pineconeHost": "https://your-index.svc.pinecone.io",
"pineconeIndexName": "docs"
}

Output Per Chunk

{
"url": "https://docs.apify.com/academy/getting-started",
"title": "Getting started | Academy | Apify Documentation",
"sourceTitle": "Getting started",
"chunk": "Getting started | Academy | Apify Documentation...",
"chunkIndex": 0,
"totalChunks": 1,
"qualityScore": 0.95,
"tokenCount": 258,
"language": "en",
"contentType": "documentation",
"summary": "Getting started | Academy | Apify Documentation...",
"metadata": {
"headings": ["Getting started", "Getting to know the platform", "Next up"],
"linkCount": 86,
"imageCount": 3,
"wordCount": 185
},
"lastProcessed": "2026-03-06T14:40:18.420Z"
}

Pipeline Summary

At the end of each run, the actor logs a summary:

=== Pipeline Summary ===
Total pages attempted: 2
Pages with content: 2
Total chunks created: 4
Chunks filtered (minScore): 0
Avg quality score: 0.87 (min: 0.80, max: 0.95)
Avg token count: 357
Languages detected: en
Content types: documentation
Export format: json

Use Cases

  • RAG Chatbots — Prepare knowledge bases for retrieval-augmented generation
  • Documentation Search — Index docs sites for semantic search
  • Knowledge Management — Convert websites into structured, searchable chunks
  • Content Analysis — Score and filter web content by quality
  • AI Fine-tuning — Prepare clean training data from websites

Quality Scoring

Each chunk is scored 0-1 based on:

  • Word count — Penalizes very short or very long chunks
  • Sentence structure — Proper sentences score higher
  • Heading presence — Structured content scores higher
  • Lexical diversity — Varied vocabulary scores higher
  • Code ratio — Moderate code is rewarded, excessive code is penalized
  • Boilerplate detection — Cookie notices, "all rights reserved", raw HTML tags lower the score

Language Detection

Supports 7 languages: English, French, German, Spanish, Dutch, Portuguese, Italian. Detection uses keyword matching with 20 marker words per language.

Content Type Classification

Automatically classifies each chunk as: documentation, article, blog, product, faq, or other.

Notes

  • Uses same-domain crawling strategy (won't follow external links)
  • Playwright-based for JavaScript-rendered content support
  • Code blocks are protected during chunking — never split mid-block
  • Vector DB export sends metadata alongside text (no embeddings — use your own model)
  • Chunks respect heading boundaries to maintain semantic coherence

API Integration

JavaScript

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('ozapp/ai-data-pipeline').call({
startUrls: [{ url: 'https://docs.example.com' }],
maxPages: 100,
chunkSize: 500,
minQualityScore: 0.3,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`${items.length} chunks ready for your vector DB`);

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ozapp/ai-data-pipeline").call(run_input={
"startUrls": [{"url": "https://docs.example.com"}],
"maxPages": 100,
"chunkSize": 500,
"minQualityScore": 0.3,
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"{len(items)} chunks ready for your vector DB")

cURL

curl "https://api.apify.com/v2/acts/ozapp~ai-data-pipeline/runs" \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{"startUrls":[{"url":"https://docs.example.com"}],"maxPages":100,"chunkSize":500}'

Pricing

$4.99 per 1,000 chunks — includes crawling, cleaning, scoring, and classification.