RAG-Markdown Extractor
Pricing
from $17.70 / 1,000 results
RAG-Markdown Extractor
The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.
Pricing
from $17.70 / 1,000 results
Rating
0.0
(0)
Developer

JI JUN
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
🧹 Extracts the main content from any web page and outputs it as clean, structured Markdown — optimized for Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and vector databases.
✨ Why This Actor?
| Problem | Solution |
|---|---|
| Web pages are full of ads, navbars, cookie banners, and boilerplate | 50+ noise selectors + text-based heuristics strip everything automatically |
| SPAs don't render with simple HTTP requests | Playwright headless browser waits for dynamic content to fully render |
| Cookie consent dialogs leak into scraped text | Auto-dismiss consent popups before extraction |
| You need structured metadata alongside content | Every output includes title, source URL, date, category, keywords, author |
🚀 Features
- Deep Noise Removal — 50+ CSS selectors + text-matching heuristics remove ads, navbars, footers, sidebars, cookie/GDPR banners, modals, and more.
- Cookie Consent Auto-Dismiss — Automatically clicks "Accept"/"Allow All" buttons so they don't pollute the output.
- Smart Markdown Formatting — Preserves headings, lists, code blocks, and links using Turndown.
- SPA Support — Uses Playwright to fully render JavaScript-heavy Single Page Applications before extraction.
- Proxy Support — Bypass anti-bot protections with Apify Proxy.
- Metadata Enrichment — Outputs word count, character count, description, and structured metadata header.
- Empty Image Cleanup — Strips decorative images with no alt text to reduce noise.
📦 Output Format
Each item in the output dataset contains:
| Field | Type | Description |
|---|---|---|
url | string | Source URL of the page |
title | string | Page title |
description | string | Meta description of the page |
wordCount | number | Number of words in the extracted content |
charCount | number | Number of characters in the extracted content |
markdown | string | The cleaned Markdown with metadata header |
Example Output
# Building RAG Pipelines> **Source:** https://example.com/rag-pipelines> **Extracted:** 2026-03-01> **Category:** AI Engineering> **Author:** Jane Doe---Retrieval-Augmented Generation (RAG) is a technique that...
⚙️ Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | Array | required | List of URLs to extract Markdown from |
maxConcurrency | Integer | 5 | Max pages processed in parallel |
waitForSPA | Integer (ms) | 2000 | Extra wait time for SPA rendering |
proxyConfiguration | Object | Apify Proxy | Proxy settings to bypass blocks |
🎯 Use Cases
- RAG Pipeline Data Ingestion — Feed clean Markdown directly into LangChain, LlamaIndex, or custom RAG systems.
- Knowledge Base Building — Bulk-extract documentation, articles, or blog posts into a structured format.
- AI Training Data — Collect clean text from the web for fine-tuning language models.
- Content Monitoring — Track changes in competitor content or news articles over time.
- Research & Analysis — Extract and analyze articles at scale without manual copy-pasting.
💻 Usage
Run on the Apify Platform via the UI, or locally:
$apify run -p