RAG Pipeline Data Collector avatar
RAG Pipeline Data Collector

Pricing

$5.00 / 1,000 results

Go to Apify Store
RAG Pipeline Data Collector

RAG Pipeline Data Collector

AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.

Pricing

$5.00 / 1,000 results

Rating

0.0

(0)

Developer

LIAICHI MUSTAPHA

LIAICHI MUSTAPHA

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

a day ago

Last modified

Share

RAG Pipeline Data Collector - AI-Ready Web Content Extraction

Extract clean, structured web content optimized for RAG systems, LLMs, and AI agents. Built with Crawl4AI for lightning-fast parallel processing and intelligent content filtering.

🎯 What is RAG Pipeline Data Collector?

The RAG Pipeline Data Collector is a specialized web scraping Actor designed specifically for AI and machine learning workflows. It transforms raw web pages into clean, structured Markdown or HTML content that's ready to feed into RAG (Retrieval-Augmented Generation) systems, vector databases, LLM training pipelines, and AI agents.

Unlike traditional web scrapers, this Actor focuses on extracting meaningful content while removing navigation menus, ads, footers, and other noise that would pollute your AI training data or RAG knowledge base.

Perfect for:

  • πŸ€– Building RAG systems and AI chatbots
  • πŸ“š Creating knowledge bases for LLMs
  • πŸ” Training data collection for machine learning
  • πŸ’¬ Content ingestion for vector databases (Pinecone, Weaviate, Chroma)
  • πŸ”— n8n, Zapier, and Make.com automation workflows
  • πŸ“Š Large-scale content analysis and research

πŸš€ Key Features

Dual Operating Modes

Single Page Mode - Fast API-style extraction

  • Extract individual pages in 15-30 seconds
  • Perfect for real-time integrations
  • Ideal for n8n/Zapier/Make workflows
  • On-demand content processing

Multi-Page Mode - Bulk extraction with parallel processing

  • Process 50+ pages simultaneously
  • 5-10x faster than sequential scraping
  • Three intelligent crawl strategies
  • Complete knowledge base extraction

Three Crawl Strategies

  1. Sitemap Strategy πŸ“‹

    • Automatically parse sitemap.xml
    • Fastest parallel processing
    • Complete site coverage
    • Best for: Documentation sites, blogs, news sites
  2. Deep Crawl Strategy πŸ•ΈοΈ

    • Follow internal links recursively
    • Control depth (1-5 levels)
    • Discover hidden content
    • Best for: Sites without sitemaps, complex navigation
  3. Archive Discovery πŸ“°

    • Intelligent pattern detection (/blog, /posts, /archive)
    • Targeted content discovery
    • Blog-focused extraction
    • Best for: Content-heavy sites, news archives

Clean, AI-Ready Output

βœ… Markdown Output - Perfectly formatted for LLMs βœ… Noise Removal - Intelligent filtering of ads, navigation, footers βœ… Metadata Extraction - Title, description, author, language βœ… Image URLs - All images with full URLs βœ… Link Extraction - Internal and external links separated βœ… Statistics - Word count, character count, image count

πŸ’‘ Use Cases

RAG Systems & Vector Databases

Feed clean, structured content directly into your RAG pipeline:

# LangChain Integration
from apify_client import ApifyClient
from langchain.document_loaders import ApifyDatasetLoader
client = ApifyClient("your-token")
run = client.actor("YOUR_ACTOR_ID").call(run_input={
"scrape_mode": "multi",
"start_url": "https://docs.example.com",
"crawl_strategy": "sitemap",
"max_pages": 100,
"output_format": "markdown",
"remove_noise": True
})
loader = ApifyDatasetLoader(
dataset_id=run["defaultDatasetId"],
dataset_mapping_function=lambda item: item["content"]
)
docs = loader.load()
# Now feed docs to your vector database

n8n Automation Workflows

  1. Add Apify node to your workflow
  2. Select RAG Pipeline Data Collector
  3. Configure single or multi-page mode
  4. Connect to Pinecone, Weaviate, or Supabase nodes
  5. Automate your RAG data pipeline

Content Analysis & Research

Extract and analyze large volumes of content:

  • Competitor research and monitoring
  • Market intelligence gathering
  • Academic research data collection
  • Content aggregation for newsletters

AI Training Data Collection

Build high-quality training datasets:

  • Clean, structured text for fine-tuning
  • Consistent format across sources
  • Metadata for context preservation
  • Scalable bulk extraction

πŸ“₯ Input Configuration

Single Page Mode

{
"scrape_mode": "single",
"url": "https://example.com/article",
"output_format": "markdown",
"remove_noise": true,
"include_images": true,
"include_links": true,
"include_metadata": true
}

Multi-Page Mode (Sitemap)

{
"scrape_mode": "multi",
"start_url": "https://docs.example.com",
"crawl_strategy": "sitemap",
"max_pages": 100,
"output_format": "markdown",
"remove_noise": true
}

Multi-Page Mode (Deep Crawl)

{
"scrape_mode": "multi",
"start_url": "https://blog.example.com",
"crawl_strategy": "deep",
"max_depth": 2,
"max_pages": 50,
"output_format": "markdown"
}

Multi-Page Mode (Archive Discovery)

{
"scrape_mode": "multi",
"start_url": "https://news.example.com/archive",
"crawl_strategy": "archive",
"max_pages": 200,
"output_format": "markdown"
}

πŸ“€ Output Format

Each scraped page returns a structured JSON object:

{
"url": "https://example.com/article",
"content": "# Article Title\n\nClean markdown content...",
"format": "markdown",
"statistics": {
"word_count": 1500,
"character_count": 8500,
"image_count": 5,
"internal_links": 12,
"external_links": 3
},
"images": [
"https://example.com/image1.jpg",
"https://example.com/image2.jpg"
],
"links": {
"internal": ["https://example.com/page1", "https://example.com/page2"],
"external": ["https://external.com"]
},
"metadata": {
"title": "Article Title",
"description": "Article description",
"author": "Author Name",
"language": "en"
},
"scrape_mode": "single",
"scraped_at": "2024-12-11T10:30:00Z"
}

πŸ”§ How It Works

The Actor uses Crawl4AI, a cutting-edge web scraping framework optimized for AI applications:

  1. Intelligent Rendering - Handles JavaScript-heavy sites with Playwright
  2. Parallel Processing - Scrapes multiple pages simultaneously (5-20x faster)
  3. Noise Filtering - Removes ads, navigation, footers using fit_markdown algorithm
  4. LLM-Optimized Output - Clean Markdown perfect for AI consumption
  5. Smart Crawling - Three strategies to handle any site structure

Performance Expectations

  • Single Page Mode: 15-30 seconds per page
  • Multi-Page (Sitemap): 1-2 minutes for 50 pages
  • Multi-Page (Deep Crawl): 2-5 minutes for 50 pages (varies by depth)
  • Multi-Page (Archive): 1-3 minutes for 50 pages

πŸ’° Pricing & Compute Units

This Actor is optimized for cost-effective operation:

  • Single Page Mode: ~0.05-0.1 CU per page
  • Multi-Page Mode: ~2-5 CU per 50 pages (parallel processing advantage)

Recommended Memory: 4096 MB for optimal performance

πŸ“Š Example Runs

Coming soon! Check back for public run examples.

πŸ› οΈ Advanced Features

Output Format Options

  • Markdown: Clean, LLM-friendly format (recommended for RAG)
  • HTML: Cleaned HTML with noise removed
  • Raw HTML: Original HTML without processing

Content Filtering

  • Noise Removal: Automatically removes navigation, ads, footers
  • Image Filtering: Include/exclude images
  • Link Filtering: Include/exclude links
  • Metadata Control: Include/exclude page metadata

Crawl Configuration

  • Max Pages: Control total pages (1-500)
  • Max Depth: Control crawl depth (1-5 levels)
  • Same Domain Only: Restrict to starting domain
  • Pattern Matching: Custom URL filtering (coming soon)

πŸ”— Integration Examples

Make.com (Integromat)

  1. Add Apify module
  2. Select Run Actor
  3. Choose RAG Pipeline Data Collector
  4. Configure input parameters
  5. Map output to your RAG pipeline modules

Zapier

  1. Add Apify action
  2. Select Run Actor
  3. Choose RAG Pipeline Data Collector
  4. Configure trigger and input
  5. Connect to vector database action

Python SDK

from apify_client import ApifyClient
client = ApifyClient("your-token")
# Single page extraction
run = client.actor("YOUR_ACTOR_ID").call(run_input={
"scrape_mode": "single",
"url": "https://example.com/article",
"output_format": "markdown"
})
# Get results
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["content"])

JavaScript/Node.js

const { ApifyClient } = require('apify-client');
const client = new ApifyClient({
token: 'your-token',
});
const run = await client.actor('YOUR_ACTOR_ID').call({
scrape_mode: 'multi',
start_url: 'https://docs.example.com',
crawl_strategy: 'sitemap',
max_pages: 50
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
console.log(item.url, item.statistics.word_count);
});

βš™οΈ Configuration Tips

For Best RAG Results

  • βœ… Enable remove_noise for cleaner content
  • βœ… Use markdown output format
  • βœ… Include metadata for context
  • βœ… Set appropriate max_pages based on your needs

For Faster Scraping

  • ⚑ Use sitemap strategy when available
  • ⚑ Limit max_depth to 1-2 for deep crawl
  • ⚑ Process in batches of 50-100 pages
  • ⚑ Use 4096 MB memory allocation

For Cost Optimization

  • πŸ’° Use single mode for small jobs
  • πŸ’° Batch requests in multi-page mode
  • πŸ’° Set reasonable max_pages limits
  • πŸ’° Monitor compute unit usage

πŸ› Troubleshooting

Sitemap Not Found

If sitemap strategy fails, the Actor automatically falls back to deep crawl.

JavaScript-Heavy Sites

Some sites may require additional wait time. The Actor handles this automatically with Playwright.

Rate Limiting

The Actor respects robots.txt and includes configurable delays between requests.

Missing Content

If content is missing, try:

  • Disable noise removal temporarily
  • Use raw_html format to inspect
  • Increase timeout settings

πŸ“š Documentation & Support

  • GitHub Issues: [Report bugs or request features]
  • Apify Discord: Join our community for support
  • Documentation: Full API documentation

🏷️ Tags

web-scraping rag llm ai machine-learning vector-database langchain chatbot knowledge-base content-extraction markdown automation n8n zapier make

πŸ“„ License

This Actor is provided as-is for use on the Apify platform. Web scraping should be done responsibly and in accordance with website terms of service.

🀝 Ethical Scraping

This Actor:

  • βœ… Respects robots.txt
  • βœ… Only extracts publicly available content
  • βœ… Does not extract personal data
  • βœ… Includes configurable rate limiting
  • βœ… Identifies itself properly in requests

Always ensure you have the right to scrape content from target websites and respect their terms of service.


Built with ❀️ using Crawl4AI

Need custom features or enterprise support? Contact us through the Apify platform!