Pricing

from $17.70 / 1,000 results

RAG-Markdown Extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.

Pricing

from $17.70 / 1,000 results

Rating

0.0

(0)

Developer

JI JUN

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

✨ Why This Actor?

Problem	Solution
Web pages are full of ads, navbars, cookie banners, and boilerplate	50+ noise selectors + text-based heuristics strip everything automatically
SPAs don't render with simple HTTP requests	Playwright headless browser waits for dynamic content to fully render
Cookie consent dialogs leak into scraped text	Auto-dismiss consent popups before extraction
You need structured metadata alongside content	Every output includes title, source URL, date, category, keywords, author

🚀 Features

Deep Noise Removal — 50+ CSS selectors + text-matching heuristics remove ads, navbars, footers, sidebars, cookie/GDPR banners, modals, and more.
Cookie Consent Auto-Dismiss — Automatically clicks "Accept"/"Allow All" buttons so they don't pollute the output.
Smart Markdown Formatting — Preserves headings, lists, code blocks, and links using Turndown.
SPA Support — Uses Playwright to fully render JavaScript-heavy Single Page Applications before extraction.
Proxy Support — Bypass anti-bot protections with Apify Proxy.
Metadata Enrichment — Outputs word count, character count, description, and structured metadata header.
Empty Image Cleanup — Strips decorative images with no alt text to reduce noise.

📦 Output Format

Each item in the output dataset contains:

Field	Type	Description
`url`	string	Source URL of the page
`title`	string	Page title
`description`	string	Meta description of the page
`wordCount`	number	Number of words in the extracted content
`charCount`	number	Number of characters in the extracted content
`markdown`	string	The cleaned Markdown with metadata header

Example Output

# Building RAG Pipelines

> **Source:** https://example.com/rag-pipelines
> **Extracted:** 2026-03-01
> **Category:** AI Engineering
> **Author:** Jane Doe

---

Retrieval-Augmented Generation (RAG) is a technique that...

⚙️ Input Parameters

Parameter	Type	Default	Description
`startUrls`	Array	required	List of URLs to extract Markdown from
`maxConcurrency`	Integer	`5`	Max pages processed in parallel
`waitForSPA`	Integer (ms)	`2000`	Extra wait time for SPA rendering
`proxyConfiguration`	Object	Apify Proxy	Proxy settings to bypass blocks

🎯 Use Cases

RAG Pipeline Data Ingestion — Feed clean Markdown directly into LangChain, LlamaIndex, or custom RAG systems.
Knowledge Base Building — Bulk-extract documentation, articles, or blog posts into a structured format.
AI Training Data — Collect clean text from the web for fine-tuning language models.
Content Monitoring — Track changes in competitor content or news articles over time.
Research & Analysis — Extract and analyze articles at scale without manual copy-pasting.

💻 Usage

Run on the Apify Platform via the UI, or locally:

$apify run -p

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.

Alexy Traore

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Web Page to Clean Markdown

consistent_tradition/web-to-markdown

Extracts clean Markdown text from any web page. Perfect for AI/RAG datasets, research corpora, and content analysis.

Peter PANG

Web Content Extractor - Clean Markdown for AI

geekguymj/web-content-extractor

Extract clean, readable markdown content from any web page. Removes navigation, ads, footers, and boilerplate — outputs structured markdown optimized for LLM training, RAG pipelines, and AI agents. Pay-per-event pricing. $0.002/page.

Matthew Jenkins

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.