Universal RAG Web Scraper
Pricing
from $0.01 / 1,000 results
Universal RAG Web Scraper
Turn any website into clean, LLM-ready Markdown. Automatically strips ads, navigation, and noise using Mozilla Readability. Perfect for feeding data to ChatGPT, Claude, or Vector Databases (RAG).
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer

Prince Raj
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
🕷️ Universal RAG Web Scraper
Turn any website into LLM-ready Markdown.
This Actor is designed specifically for AI engineers and developers building RAG (Retrieval-Augmented Generation) pipelines. It scrapes websites and converts them into clean, structured Markdown, stripping away noise like navigation bars, ads, and footers.
🚀 Why use this Actor?
- LLM-Optimized Output: Uses Mozilla's Readability engine + Turndown to produce noise-free Markdown.
- Metadata Rich: Extracts title, byline, published time, and site name for better vector embedding.
- Smart Chunking Ready: The output format is perfect for semantic chunking strategies.
- Deep & Focused: Supports glob patterns to scrape complete documentation sections (e.g.,
**/docs/**).
🛠️ Usage
Inputs
| Field | Type | Description | Default |
|---|---|---|---|
startUrls | Array | List of URLs to start crawling. | (Required) |
maxDepth | Integer | How deep to crawl. 0 = single page, 1 = links on page. | 1 |
includeGlobs | Array | Only visit URLs matching these patterns (e.g., **/blog/**). | [] |
excludeGlobs | Array | Skip URLs matching these patterns. | [] |
cleanMarkdown | Boolean | Remove images and links for text-only output. | false |
Output Example
{"url": "https://example.com/blog/article","title": "How to Build a Scraper","description": "A comprehensive guide to scraping.","markdown": "# How to Build a Scraper\n\nScraping is fun...","metadata": {"siteName": "Example Blog","publishedTime": "2023-10-01"},"crawledAt": "2023-10-25T10:00:00.000Z"}
📦 Integration
Easily integrate with LangChain, Haystack, or your custom Python/JS scripts using the Apify API.
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("antigravity/rag-web-scraper").call(run_input={"startUrls": [{"url": "https://react.dev/reference/react"}],"maxDepth": 2,"includeGlobs": ["**/reference/**"]})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["markdown"])
🏆 Goals
Built to solve the "Context Retrieval" problem for modern AI applications.