Under maintenance

Pricing

from $0.01 / 1,000 results

Try for free

Go to Apify Store

Universal RAG Web Scraper

Under maintenance

Try for free

Turn any website into clean, LLM-ready Markdown. Automatically strips ads, navigation, and noise using Mozilla Readability. Perfect for feeding data to ChatGPT, Claude, or Vector Databases (RAG).

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Prince Raj

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

🕷️ Universal RAG Web Scraper

Turn any website into LLM-ready Markdown.

This Actor is designed specifically for AI engineers and developers building RAG (Retrieval-Augmented Generation) pipelines. It scrapes websites and converts them into clean, structured Markdown, stripping away noise like navigation bars, ads, and footers.

🚀 Why use this Actor?

LLM-Optimized Output: Uses Mozilla's Readability engine + Turndown to produce noise-free Markdown.
Metadata Rich: Extracts title, byline, published time, and site name for better vector embedding.
Smart Chunking Ready: The output format is perfect for semantic chunking strategies.
Deep & Focused: Supports glob patterns to scrape complete documentation sections (e.g., **/docs/**).

🛠️ Usage

Inputs

Field	Type	Description	Default
`startUrls`	Array	List of URLs to start crawling.	(Required)
`maxDepth`	Integer	How deep to crawl. `0` = single page, `1` = links on page.	`1`
`includeGlobs`	Array	Only visit URLs matching these patterns (e.g., `/blog/`).	`[]`
`excludeGlobs`	Array	Skip URLs matching these patterns.	`[]`
`cleanMarkdown`	Boolean	Remove images and links for text-only output.	`false`

Output Example

{
  "url": "https://example.com/blog/article",
  "title": "How to Build a Scraper",
  "description": "A comprehensive guide to scraping.",
  "markdown": "# How to Build a Scraper\n\nScraping is fun...",
  "metadata": {
    "siteName": "Example Blog",
    "publishedTime": "2023-10-01"
  },
  "crawledAt": "2023-10-25T10:00:00.000Z"
}

📦 Integration

Easily integrate with LangChain, Haystack, or your custom Python/JS scripts using the Apify API.

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("antigravity/rag-web-scraper").call(run_input={
    "startUrls": [{"url": "https://react.dev/reference/react"}],
    "maxDepth": 2,
    "includeGlobs": ["**/reference/**"]
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["markdown"])

🏆 Goals

Built to solve the "Context Retrieval" problem for modern AI applications.

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

Universal Web to Markdown (Bulk & AI-Ready)

lentic_october/web-to-markdown-converter

Bulk convert any website URLs to clean Markdown for AI & LLMs. Universal scraper that removes ads, scripts, and clutter. Optimized for RAG, ChatGPT, Claude, and LangChain. Fast, async, and API-ready.

kalthireddy Abhishek

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Zendesk to RAG Markdown Scraper

inclusive_insect/Zendesk-to-RAG-Markdown-Pipeline

Crawl any Zendesk Help Center and extract pristine, semantic Markdown optimized for LLMs, RAG pipelines, and Vector Databases. Automatically strips HTML junk, navigation bars, and footers to provide high-accuracy AI training data.

Gonds Studio

Universal Knowledge Base Scraper (RAG Ready)

actums/universal-rag-scraper

Turn any Help Center into LLM-ready Markdown. Supports Zendesk, Intercom, Docusaurus, and generic sites. Perfect for RAG and AI Agents.

Actums

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.

Ahmed Jasarevic

AI RAG Feeder V2

mickeywmoore/ai-rag-feeder-v2

Turn any website into AI-ready Markdown. Scrapes entire domains, removes ads/clutter, and formats text specifically for RAG pipelines and LLM training data.