RAG-Markdown Extractor avatar

RAG-Markdown Extractor

Pricing

from $17.70 / 1,000 results

Go to Apify Store
RAG-Markdown Extractor

RAG-Markdown Extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.

Pricing

from $17.70 / 1,000 results

Rating

0.0

(0)

Developer

JI JUN

JI JUN

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

🧹 Extracts the main content from any web page and outputs it as clean, structured Markdown — optimized for Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and vector databases.

✨ Why This Actor?

ProblemSolution
Web pages are full of ads, navbars, cookie banners, and boilerplate50+ noise selectors + text-based heuristics strip everything automatically
SPAs don't render with simple HTTP requestsPlaywright headless browser waits for dynamic content to fully render
Cookie consent dialogs leak into scraped textAuto-dismiss consent popups before extraction
You need structured metadata alongside contentEvery output includes title, source URL, date, category, keywords, author

🚀 Features

  • Deep Noise Removal — 50+ CSS selectors + text-matching heuristics remove ads, navbars, footers, sidebars, cookie/GDPR banners, modals, and more.
  • Cookie Consent Auto-Dismiss — Automatically clicks "Accept"/"Allow All" buttons so they don't pollute the output.
  • Smart Markdown Formatting — Preserves headings, lists, code blocks, and links using Turndown.
  • SPA Support — Uses Playwright to fully render JavaScript-heavy Single Page Applications before extraction.
  • Proxy Support — Bypass anti-bot protections with Apify Proxy.
  • Metadata Enrichment — Outputs word count, character count, description, and structured metadata header.
  • Empty Image Cleanup — Strips decorative images with no alt text to reduce noise.

📦 Output Format

Each item in the output dataset contains:

FieldTypeDescription
urlstringSource URL of the page
titlestringPage title
descriptionstringMeta description of the page
wordCountnumberNumber of words in the extracted content
charCountnumberNumber of characters in the extracted content
markdownstringThe cleaned Markdown with metadata header

Example Output

# Building RAG Pipelines
> **Source:** https://example.com/rag-pipelines
> **Extracted:** 2026-03-01
> **Category:** AI Engineering
> **Author:** Jane Doe
---
Retrieval-Augmented Generation (RAG) is a technique that...

⚙️ Input Parameters

ParameterTypeDefaultDescription
startUrlsArrayrequiredList of URLs to extract Markdown from
maxConcurrencyInteger5Max pages processed in parallel
waitForSPAInteger (ms)2000Extra wait time for SPA rendering
proxyConfigurationObjectApify ProxyProxy settings to bypass blocks

🎯 Use Cases

  • RAG Pipeline Data Ingestion — Feed clean Markdown directly into LangChain, LlamaIndex, or custom RAG systems.
  • Knowledge Base Building — Bulk-extract documentation, articles, or blog posts into a structured format.
  • AI Training Data — Collect clean text from the web for fine-tuning language models.
  • Content Monitoring — Track changes in competitor content or news articles over time.
  • Research & Analysis — Extract and analyze articles at scale without manual copy-pasting.

💻 Usage

Run on the Apify Platform via the UI, or locally:

$apify run -p