RAG Pipeline Data Collector
Pricing
$5.00 / 1,000 results
RAG Pipeline Data Collector
AI-ready web content extraction for RAG systems, LLMs, and AI agents. Single-page or multi-page scraping with parallel processing.
Pricing
$5.00 / 1,000 results
Rating
0.0
(0)
Developer

LIAICHI MUSTAPHA
Actor stats
0
Bookmarked
1
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
RAG Pipeline Data Collector - AI-Ready Web Content Extraction
Extract clean, structured web content optimized for RAG systems, LLMs, and AI agents. Built with Crawl4AI for lightning-fast parallel processing and intelligent content filtering.
π― What is RAG Pipeline Data Collector?
The RAG Pipeline Data Collector is a specialized web scraping Actor designed specifically for AI and machine learning workflows. It transforms raw web pages into clean, structured Markdown or HTML content that's ready to feed into RAG (Retrieval-Augmented Generation) systems, vector databases, LLM training pipelines, and AI agents.
Unlike traditional web scrapers, this Actor focuses on extracting meaningful content while removing navigation menus, ads, footers, and other noise that would pollute your AI training data or RAG knowledge base.
Perfect for:
- π€ Building RAG systems and AI chatbots
- π Creating knowledge bases for LLMs
- π Training data collection for machine learning
- π¬ Content ingestion for vector databases (Pinecone, Weaviate, Chroma)
- π n8n, Zapier, and Make.com automation workflows
- π Large-scale content analysis and research
π Key Features
Dual Operating Modes
Single Page Mode - Fast API-style extraction
- Extract individual pages in 15-30 seconds
- Perfect for real-time integrations
- Ideal for n8n/Zapier/Make workflows
- On-demand content processing
Multi-Page Mode - Bulk extraction with parallel processing
- Process 50+ pages simultaneously
- 5-10x faster than sequential scraping
- Three intelligent crawl strategies
- Complete knowledge base extraction
Three Crawl Strategies
-
Sitemap Strategy π
- Automatically parse sitemap.xml
- Fastest parallel processing
- Complete site coverage
- Best for: Documentation sites, blogs, news sites
-
Deep Crawl Strategy πΈοΈ
- Follow internal links recursively
- Control depth (1-5 levels)
- Discover hidden content
- Best for: Sites without sitemaps, complex navigation
-
Archive Discovery π°
- Intelligent pattern detection (/blog, /posts, /archive)
- Targeted content discovery
- Blog-focused extraction
- Best for: Content-heavy sites, news archives
Clean, AI-Ready Output
β Markdown Output - Perfectly formatted for LLMs β Noise Removal - Intelligent filtering of ads, navigation, footers β Metadata Extraction - Title, description, author, language β Image URLs - All images with full URLs β Link Extraction - Internal and external links separated β Statistics - Word count, character count, image count
π‘ Use Cases
RAG Systems & Vector Databases
Feed clean, structured content directly into your RAG pipeline:
# LangChain Integrationfrom apify_client import ApifyClientfrom langchain.document_loaders import ApifyDatasetLoaderclient = ApifyClient("your-token")run = client.actor("YOUR_ACTOR_ID").call(run_input={"scrape_mode": "multi","start_url": "https://docs.example.com","crawl_strategy": "sitemap","max_pages": 100,"output_format": "markdown","remove_noise": True})loader = ApifyDatasetLoader(dataset_id=run["defaultDatasetId"],dataset_mapping_function=lambda item: item["content"])docs = loader.load()# Now feed docs to your vector database
n8n Automation Workflows
- Add Apify node to your workflow
- Select RAG Pipeline Data Collector
- Configure single or multi-page mode
- Connect to Pinecone, Weaviate, or Supabase nodes
- Automate your RAG data pipeline
Content Analysis & Research
Extract and analyze large volumes of content:
- Competitor research and monitoring
- Market intelligence gathering
- Academic research data collection
- Content aggregation for newsletters
AI Training Data Collection
Build high-quality training datasets:
- Clean, structured text for fine-tuning
- Consistent format across sources
- Metadata for context preservation
- Scalable bulk extraction
π₯ Input Configuration
Single Page Mode
{"scrape_mode": "single","url": "https://example.com/article","output_format": "markdown","remove_noise": true,"include_images": true,"include_links": true,"include_metadata": true}
Multi-Page Mode (Sitemap)
{"scrape_mode": "multi","start_url": "https://docs.example.com","crawl_strategy": "sitemap","max_pages": 100,"output_format": "markdown","remove_noise": true}
Multi-Page Mode (Deep Crawl)
{"scrape_mode": "multi","start_url": "https://blog.example.com","crawl_strategy": "deep","max_depth": 2,"max_pages": 50,"output_format": "markdown"}
Multi-Page Mode (Archive Discovery)
{"scrape_mode": "multi","start_url": "https://news.example.com/archive","crawl_strategy": "archive","max_pages": 200,"output_format": "markdown"}
π€ Output Format
Each scraped page returns a structured JSON object:
{"url": "https://example.com/article","content": "# Article Title\n\nClean markdown content...","format": "markdown","statistics": {"word_count": 1500,"character_count": 8500,"image_count": 5,"internal_links": 12,"external_links": 3},"images": ["https://example.com/image1.jpg","https://example.com/image2.jpg"],"links": {"internal": ["https://example.com/page1", "https://example.com/page2"],"external": ["https://external.com"]},"metadata": {"title": "Article Title","description": "Article description","author": "Author Name","language": "en"},"scrape_mode": "single","scraped_at": "2024-12-11T10:30:00Z"}
π§ How It Works
The Actor uses Crawl4AI, a cutting-edge web scraping framework optimized for AI applications:
- Intelligent Rendering - Handles JavaScript-heavy sites with Playwright
- Parallel Processing - Scrapes multiple pages simultaneously (5-20x faster)
- Noise Filtering - Removes ads, navigation, footers using
fit_markdownalgorithm - LLM-Optimized Output - Clean Markdown perfect for AI consumption
- Smart Crawling - Three strategies to handle any site structure
Performance Expectations
- Single Page Mode: 15-30 seconds per page
- Multi-Page (Sitemap): 1-2 minutes for 50 pages
- Multi-Page (Deep Crawl): 2-5 minutes for 50 pages (varies by depth)
- Multi-Page (Archive): 1-3 minutes for 50 pages
π° Pricing & Compute Units
This Actor is optimized for cost-effective operation:
- Single Page Mode: ~0.05-0.1 CU per page
- Multi-Page Mode: ~2-5 CU per 50 pages (parallel processing advantage)
Recommended Memory: 4096 MB for optimal performance
π Example Runs
Coming soon! Check back for public run examples.
π οΈ Advanced Features
Output Format Options
- Markdown: Clean, LLM-friendly format (recommended for RAG)
- HTML: Cleaned HTML with noise removed
- Raw HTML: Original HTML without processing
Content Filtering
- Noise Removal: Automatically removes navigation, ads, footers
- Image Filtering: Include/exclude images
- Link Filtering: Include/exclude links
- Metadata Control: Include/exclude page metadata
Crawl Configuration
- Max Pages: Control total pages (1-500)
- Max Depth: Control crawl depth (1-5 levels)
- Same Domain Only: Restrict to starting domain
- Pattern Matching: Custom URL filtering (coming soon)
π Integration Examples
Make.com (Integromat)
- Add Apify module
- Select Run Actor
- Choose RAG Pipeline Data Collector
- Configure input parameters
- Map output to your RAG pipeline modules
Zapier
- Add Apify action
- Select Run Actor
- Choose RAG Pipeline Data Collector
- Configure trigger and input
- Connect to vector database action
Python SDK
from apify_client import ApifyClientclient = ApifyClient("your-token")# Single page extractionrun = client.actor("YOUR_ACTOR_ID").call(run_input={"scrape_mode": "single","url": "https://example.com/article","output_format": "markdown"})# Get resultsfor item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["content"])
JavaScript/Node.js
const { ApifyClient } = require('apify-client');const client = new ApifyClient({token: 'your-token',});const run = await client.actor('YOUR_ACTOR_ID').call({scrape_mode: 'multi',start_url: 'https://docs.example.com',crawl_strategy: 'sitemap',max_pages: 50});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach((item) => {console.log(item.url, item.statistics.word_count);});
βοΈ Configuration Tips
For Best RAG Results
- β
Enable
remove_noisefor cleaner content - β
Use
markdownoutput format - β Include metadata for context
- β
Set appropriate
max_pagesbased on your needs
For Faster Scraping
- β‘ Use
sitemapstrategy when available - β‘ Limit
max_depthto 1-2 for deep crawl - β‘ Process in batches of 50-100 pages
- β‘ Use 4096 MB memory allocation
For Cost Optimization
- π° Use single mode for small jobs
- π° Batch requests in multi-page mode
- π° Set reasonable
max_pageslimits - π° Monitor compute unit usage
π Troubleshooting
Sitemap Not Found
If sitemap strategy fails, the Actor automatically falls back to deep crawl.
JavaScript-Heavy Sites
Some sites may require additional wait time. The Actor handles this automatically with Playwright.
Rate Limiting
The Actor respects robots.txt and includes configurable delays between requests.
Missing Content
If content is missing, try:
- Disable noise removal temporarily
- Use
raw_htmlformat to inspect - Increase timeout settings
π Documentation & Support
- GitHub Issues: [Report bugs or request features]
- Apify Discord: Join our community for support
- Documentation: Full API documentation
π·οΈ Tags
web-scraping rag llm ai machine-learning vector-database langchain chatbot knowledge-base content-extraction markdown automation n8n zapier make
π License
This Actor is provided as-is for use on the Apify platform. Web scraping should be done responsibly and in accordance with website terms of service.
π€ Ethical Scraping
This Actor:
- β
Respects
robots.txt - β Only extracts publicly available content
- β Does not extract personal data
- β Includes configurable rate limiting
- β Identifies itself properly in requests
Always ensure you have the right to scrape content from target websites and respect their terms of service.
Built with β€οΈ using Crawl4AI
Need custom features or enterprise support? Contact us through the Apify platform!