Pricing

from $4.99 / 1,000 results

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

Ozapp

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

AI Data Pipeline — Website to Vector DB (No-Code)

Crawl any website, clean and chunk content for RAG/LLM applications, score quality, detect language, classify content type, and optionally export directly to Pinecone or Qdrant. Zero coding required.

Pipeline Stages

URL --> Crawl --> Clean HTML --> Chunk Text --> Score Quality --> Export

Crawl — Spider pages within the same domain using Playwright (JS-rendered content supported)
Clean — Strip boilerplate (nav, footer, scripts, ads), convert code blocks to markdown fences, extract page title
Chunk — Split into semantic chunks by headings/paragraphs with configurable overlap. Code blocks are protected (never split mid-block)
Score — Rate each chunk 0-1 based on word count, structure, lexical diversity, code ratio, boilerplate detection
Classify — Detect language (7 languages) and content type (documentation, article, blog, product, FAQ)
Export — Push to Apify dataset (JSON) and/or Pinecone/Qdrant vector databases

Input

Field	Type	Description	Default
`startUrls`	Array	URLs to start crawling (use the URL editor)	Required
`maxPages`	Number	Maximum pages to crawl (1-10000)	`50`
`chunkSize`	Number	Target tokens per chunk (100-4000)	`500`
`chunkOverlap`	Number	Overlap tokens between chunks (0-500)	`50`
`minQualityScore`	Number	Minimum quality score to include a chunk (0-1)	`0.3`
`exportTo`	String	Export target: `json`, `pinecone`, `qdrant`	`"json"`
`pineconeApiKey`	String	Pinecone API key (secret)	None
`pineconeIndexName`	String	Pinecone index name	None
`pineconeHost`	String	Pinecone index host URL	None
`qdrantUrl`	String	Qdrant instance URL	None
`qdrantApiKey`	String	Qdrant API key (secret)	None
`qdrantCollectionName`	String	Qdrant collection name	None

Example Input — JSON Dataset

{
  "startUrls": [{ "url": "https://docs.example.com" }],
  "maxPages": 100,
  "chunkSize": 500,
  "chunkOverlap": 50,
  "minQualityScore": 0.3
}

Example Input — Export to Pinecone

{
  "startUrls": [{ "url": "https://docs.example.com" }],
  "maxPages": 100,
  "chunkSize": 500,
  "exportTo": "pinecone",
  "pineconeApiKey": "your-api-key",
  "pineconeHost": "https://your-index.svc.pinecone.io",
  "pineconeIndexName": "docs"
}

Output Per Chunk

{
  "url": "https://docs.apify.com/academy/getting-started",
  "title": "Getting started | Academy | Apify Documentation",
  "sourceTitle": "Getting started",
  "chunk": "Getting started | Academy | Apify Documentation...",
  "chunkIndex": 0,
  "totalChunks": 1,
  "qualityScore": 0.95,
  "tokenCount": 258,
  "language": "en",
  "contentType": "documentation",
  "summary": "Getting started | Academy | Apify Documentation...",
  "metadata": {
    "headings": ["Getting started", "Getting to know the platform", "Next up"],
    "linkCount": 86,
    "imageCount": 3,
    "wordCount": 185
  },
  "lastProcessed": "2026-03-06T14:40:18.420Z"
}

Pipeline Summary

At the end of each run, the actor logs a summary:

=== Pipeline Summary ===
Total pages attempted:     2
Pages with content:        2
Total chunks created:      4
Chunks filtered (minScore): 0
Avg quality score:         0.87 (min: 0.80, max: 0.95)
Avg token count:           357
Languages detected:        en
Content types:             documentation
Export format:             json

Use Cases

RAG Chatbots — Prepare knowledge bases for retrieval-augmented generation
Documentation Search — Index docs sites for semantic search
Knowledge Management — Convert websites into structured, searchable chunks
Content Analysis — Score and filter web content by quality
AI Fine-tuning — Prepare clean training data from websites

Quality Scoring

Each chunk is scored 0-1 based on:

Word count — Penalizes very short or very long chunks
Sentence structure — Proper sentences score higher
Heading presence — Structured content scores higher
Lexical diversity — Varied vocabulary scores higher
Code ratio — Moderate code is rewarded, excessive code is penalized
Boilerplate detection — Cookie notices, "all rights reserved", raw HTML tags lower the score

Language Detection

Supports 7 languages: English, French, German, Spanish, Dutch, Portuguese, Italian. Detection uses keyword matching with 20 marker words per language.

Content Type Classification

Automatically classifies each chunk as: documentation, article, blog, product, faq, or other.

Notes

Uses same-domain crawling strategy (won't follow external links)
Playwright-based for JavaScript-rendered content support
Code blocks are protected during chunking — never split mid-block
Vector DB export sends metadata alongside text (no embeddings — use your own model)
Chunks respect heading boundaries to maintain semantic coherence

API Integration

JavaScript

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('ozapp/ai-data-pipeline').call({
    startUrls: [{ url: 'https://docs.example.com' }],
    maxPages: 100,
    chunkSize: 500,
    minQualityScore: 0.3,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`${items.length} chunks ready for your vector DB`);

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ozapp/ai-data-pipeline").call(run_input={
    "startUrls": [{"url": "https://docs.example.com"}],
    "maxPages": 100,
    "chunkSize": 500,
    "minQualityScore": 0.3,
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"{len(items)} chunks ready for your vector DB")

cURL

curl "https://api.apify.com/v2/acts/ozapp~ai-data-pipeline/runs" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{"startUrls":[{"url":"https://docs.example.com"}],"maxPages":100,"chunkSize":500}'

Pricing

$4.99 per 1,000 chunks — includes crawling, cleaning, scoring, and classification.

LLM Data Pipeline Pro

sanztheo/llm-data-pipeline-pro

Transform websites into LLM training data. Scrape, validate, deduplicate, chunk for RAG, and export to OpenAI/Anthropic/Mistral formats. Built-in PII detection and GDPR compliance. Vector DB export to Pinecone & Qdrant.

Theo Sanz

🤖 RAG Website Crawler — Markdown Chunks for LLMs · $2/1k

themineworks/rag-crawler

Crawl any website into clean, pre-chunked Markdown with per-chunk token counts for RAG pipelines, vector DBs (Pinecone, Qdrant) and LLM context. MCP-native for Claude & ChatGPT. SPA support via Playwright. Pay only for pages that crawl. A Firecrawl alternative.

The Mine Works

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Ken M

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Hitman studio

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Group Oject

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

ai-llm-data-pipeline-scraper

yanxie_77/my-actor-18

Extracts clean website data and documentation, automatically formatting and slicing text into 1,000-character chunks perfectly optimized for AI training, RAG pipelines, and Vector Databases.

Peter Ngugi

PDF to RAG Markdown Chunks for Embeddings

awesome_highboy/docforge

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.

Adam

RAG-Ready Website Crawler — Clean Content for LLMs & Vector DBs

yourwingman/rag-ready-crawler

Crawl websites and output clean, chunked content optimized for RAG pipelines, LLM training data, and vector databases. Built for AI knowledge bases and semantic search.