Pricing

from $5.00 / 1,000 url extracteds

Web Page → Markdown Converter (Trafilatura, LLM-ready)

Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura — the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.

Pricing

from $5.00 / 1,000 url extracteds

Rating

0.0

(0)

Developer

Hojun Lee

Actor stats

Bookmarked

Total users

Monthly active users

9 days ago

Last modified

Web Page → Markdown Converter

Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura — the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.

⚡ Run in 30 seconds

Click Start with default settings — fetches the example URL and returns clean Markdown text plus title, author, publication date, language, and image metadata, all ready to pipe into an LLM. No API key needed.

Input Parameters

Parameter	Type	Default	Description
`urls`	array	`[]`	List of URLs to convert. Each is billed separately.
`url`	string	``	Single URL (used when 'urls' is empty).
`includeComments`	boolean	`false`	Include reader comments / discussion sections.
`includeTables`	boolean	`true`	Render HTML tables as Markdown tables.
`deduplicate`	boolean	`true`	Drop repeated boilerplate (nav, footer, ads).
`userAgent`	string	``	Custom UA string. Default looks like a desktop browser.

Why this exists

Most LLM pipelines need clean article-body text — but raw HTML is 60-90% boilerplate (nav, footer, ads, JS, related stories). Existing solutions:

Browserless / Puppeteer: complex setup, $30+/mo
Mercury Parser: deprecated
Diffbot: $299/mo minimum
Readability.js: requires running Node

This actor wraps trafilatura — the gold-standard Python library used by Common Crawl and most LLM training pipelines — into a one-call API. Pass a URL list, get clean Markdown + metadata back.

What you get per row

Field	Example	Notes
`url`	`https://...`	input URL
`ok`	`true`	did extraction succeed
`title`	`Bitcoin — Wikipedia`	from `<title>` or og
`author`	`Wikipedia contributors`
`description`	`Bitcoin is a cryptocurrency...`
`date_published`	`2025-12-01`
`language`	`en`	auto-detected
`sitename`	`Wikipedia`
`tags`	`["cryptocurrency", "blockchain"]`
`categories`	`["Technology"]`
`image`	`https://...`	hero image
`markdown`	`# Bitcoin\n\nBitcoin is...`	clean body
`char_count`	`48230`
`word_count`	`7842`

Quick start

Single URL

{
  "url": "https://en.wikipedia.org/wiki/Bitcoin"
}

Batch of URLs

{
  "urls": [
    "https://techcrunch.com/article-1",
    "https://www.theverge.com/article-2",
    "https://www.wired.com/article-3"
  ],
  "includeTables": true,
  "deduplicate": true
}

Custom User-Agent (some sites require it)

{
  "url": "https://...",
  "userAgent": "Mozilla/5.0 (compatible; YourBot/1.0; +https://yourdomain.com/bot)"
}

Pricing

Pay-Per-Event: $0.005 per URL processed.

Run	URLs	Cost
Single article	1	$0.005
Batch of 100	100	$0.50
Daily crawl of 1K URLs	1000	$5.00

Vs Diffbot ($299/mo), Mercury ($199/mo for similar tier), this is 40-60x cheaper for typical volumes.

Common pipeline patterns

Feed to Claude / GPT for summarization

# 1. Extract clean text
curl -X POST "https://api.apify.com/v2/acts/gochujang~web-to-markdown/runs?token=$T" \
  -d '{"url":"..."}'

# 2. Pipe markdown to Claude
curl -X POST https://api.anthropic.com/v1/messages \
  -d "{\"messages\":[{\"role\":\"user\",\"content\":\"Summarize: $MARKDOWN\"}]}"

RSS-style aggregator

Sitemap URL Discovery to get all article URLs
Filter by lastmod (recent only)
This actor to convert each to Markdown
Store in your DB / Notion / Obsidian

Personal read-it-later

Schedule this actor with your "saved articles" Google Sheet → get clean markdown into Obsidian / Logseq daily.

Use cases

LLM input prep — Clean text for RAG / fine-tuning / summarization
Content curation — Newsletter / digest aggregation
SEO research — Compare clean content across competitors
Archiving — Read-it-later in Markdown format
Translation pipelines — Strip boilerplate before sending to MT

Data source / engine

Engine: trafilatura — actively maintained, used by Common Crawl
Fallback: Returns ok: false with error message if a page can't be extracted (paywall, JS-heavy SPA without SSR, etc.)

Limitations

JS-only sites: Pages that render entirely in client-side JS may return empty markdown. For those, use a browser-rendering actor (Playwright/Puppeteer-based).
Paywalls: This actor doesn't bypass paywalls.
Comments / discussion sections: Off by default; enable with includeComments: true.

HTML Metadata Extractor — Just metadata (OG, Twitter, JSON-LD) without article body
Sitemap URL Discovery — Find all URLs to feed into this actor
PDF Text Extractor — PDF version
JSON Schema Generator

Feedback

A short review helps content/AI engineers find it: Leave a review on Apify Store

Website to Markdown Converter

lofomachines/website-to-markdown-converter

Best faster and cheaper way to convert any web page into clean, structured, LLM-ready Markdown.

Lofomachines

URL to Clean Markdown — for AI Agents

great_saint/url-to-markdown

Convert any web page into clean, LLM-ready Markdown. Charged only on successful conversion. MCP & x402 ready.

Öge

AI Web Extractor: URL → Clean Markdown + JSON for LLM/RAG

boxbox10/ai-web-extractor

Turn any URL into clean, LLM-ready Markdown + structured JSON (title, headings, main content, links, metadata, token count). Perfect for RAG pipelines, AI agents, and LLM context.

Marvin Eguilos

LLM-Ready Web Extractor: URL to Clean Markdown & JSON

f0rty7even/llm-web-extractor

Turn any web page or site into clean, LLM-ready Markdown and structured JSON for RAG, agents, and fine-tuning. Strips nav/ads/boilerplate; returns main content + metadata.

Michael Yousrie

URL to Markdown MCP

reverberant_equality/mcp-url-to-markdown

Convert any web page to clean markdown for AI agents. Uses Firefox Reader Mode engine for content extraction. Perfect for RAG pipelines, research, and LLM content ingestion.

Jordan C

Site to Markdown — any site to clean, LLM-ready markdown

topsail/site-to-markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

Connor Teskey

Website to Markdown – Clean LLM & RAG Content Extractor

dataquarry/website-to-markdown

Convert any public web page to clean, LLM-ready Markdown with metadata — by URL, a list of URLs, or a whole-site crawl. Strips nav/ads/boilerplate, keeps headings/lists/tables/code. Respects robots.txt. No API key.

Daniel Brenner

Web to Markdown — AI-Ready Text from Any URL

wsgcjj/web-to-markdown

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

陈俊杰

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

329

5.0

Website to Markdown

cool_ya/website-to-markdown

Convert any web page into clean, LLM-ready Markdown. Strips nav, ads and boilerplate and returns the main article text plus title, description and word count. Perfect for RAG and AI pipelines.