Universal RAG Web Scraper avatar
Universal RAG Web Scraper

Pricing

from $0.01 / 1,000 results

Go to Apify Store
Universal RAG Web Scraper

Universal RAG Web Scraper

Turn any website into clean, LLM-ready Markdown. Automatically strips ads, navigation, and noise using Mozilla Readability. Perfect for feeding data to ChatGPT, Claude, or Vector Databases (RAG).

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Prince Raj

Prince Raj

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

🕷️ Universal RAG Web Scraper

Turn any website into LLM-ready Markdown.

This Actor is designed specifically for AI engineers and developers building RAG (Retrieval-Augmented Generation) pipelines. It scrapes websites and converts them into clean, structured Markdown, stripping away noise like navigation bars, ads, and footers.

🚀 Why use this Actor?

  • LLM-Optimized Output: Uses Mozilla's Readability engine + Turndown to produce noise-free Markdown.
  • Metadata Rich: Extracts title, byline, published time, and site name for better vector embedding.
  • Smart Chunking Ready: The output format is perfect for semantic chunking strategies.
  • Deep & Focused: Supports glob patterns to scrape complete documentation sections (e.g., **/docs/**).

🛠️ Usage

Inputs

FieldTypeDescriptionDefault
startUrlsArrayList of URLs to start crawling.(Required)
maxDepthIntegerHow deep to crawl. 0 = single page, 1 = links on page.1
includeGlobsArrayOnly visit URLs matching these patterns (e.g., **/blog/**).[]
excludeGlobsArraySkip URLs matching these patterns.[]
cleanMarkdownBooleanRemove images and links for text-only output.false

Output Example

{
"url": "https://example.com/blog/article",
"title": "How to Build a Scraper",
"description": "A comprehensive guide to scraping.",
"markdown": "# How to Build a Scraper\n\nScraping is fun...",
"metadata": {
"siteName": "Example Blog",
"publishedTime": "2023-10-01"
},
"crawledAt": "2023-10-25T10:00:00.000Z"
}

📦 Integration

Easily integrate with LangChain, Haystack, or your custom Python/JS scripts using the Apify API.

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("antigravity/rag-web-scraper").call(run_input={
"startUrls": [{"url": "https://react.dev/reference/react"}],
"maxDepth": 2,
"includeGlobs": ["**/reference/**"]
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["markdown"])

🏆 Goals

Built to solve the "Context Retrieval" problem for modern AI applications.