Web Page → Markdown Converter (Trafilatura, LLM-ready)
Pricing
Pay per usage
Web Page → Markdown Converter (Trafilatura, LLM-ready)
Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura — the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Hojun Lee
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Web Page → Markdown Converter
Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura — the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.
Why this exists
Most LLM pipelines need clean article-body text — but raw HTML is 60-90% boilerplate (nav, footer, ads, JS, related stories). Existing solutions:
- Browserless / Puppeteer: complex setup, $30+/mo
- Mercury Parser: deprecated
- Diffbot: $299/mo minimum
- Readability.js: requires running Node
This actor wraps trafilatura — the gold-standard Python library used by Common Crawl and most LLM training pipelines — into a one-call API. Pass a URL list, get clean Markdown + metadata back.
What you get per row
| Field | Example | Notes |
|---|---|---|
url | https://... | input URL |
ok | true | did extraction succeed |
title | Bitcoin — Wikipedia | from <title> or og |
author | Wikipedia contributors | |
description | Bitcoin is a cryptocurrency... | |
date_published | 2025-12-01 | |
language | en | auto-detected |
sitename | Wikipedia | |
tags | ["cryptocurrency", "blockchain"] | |
categories | ["Technology"] | |
image | https://... | hero image |
markdown | # Bitcoin\n\nBitcoin is... | clean body |
char_count | 48230 | |
word_count | 7842 |
Quick start
Single URL
{"url": "https://en.wikipedia.org/wiki/Bitcoin"}
Batch of URLs
{"urls": ["https://techcrunch.com/article-1","https://www.theverge.com/article-2","https://www.wired.com/article-3"],"includeTables": true,"deduplicate": true}
Custom User-Agent (some sites require it)
{"url": "https://...","userAgent": "Mozilla/5.0 (compatible; YourBot/1.0; +https://yourdomain.com/bot)"}
Pricing
Pay-Per-Event: $0.005 per URL processed.
| Run | URLs | Cost |
|---|---|---|
| Single article | 1 | $0.005 |
| Batch of 100 | 100 | $0.50 |
| Daily crawl of 1K URLs | 1000 | $5.00 |
Vs Diffbot ($299/mo), Mercury ($199/mo for similar tier), this is 40-60x cheaper for typical volumes.
Common pipeline patterns
Feed to Claude / GPT for summarization
# 1. Extract clean textcurl -X POST "https://api.apify.com/v2/acts/gochujang~web-to-markdown/runs?token=$T" \-d '{"url":"..."}'# 2. Pipe markdown to Claudecurl -X POST https://api.anthropic.com/v1/messages \-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Summarize: $MARKDOWN\"}]}"
RSS-style aggregator
- Sitemap URL Discovery to get all article URLs
- Filter by lastmod (recent only)
- This actor to convert each to Markdown
- Store in your DB / Notion / Obsidian
Personal read-it-later
Schedule this actor with your "saved articles" Google Sheet → get clean markdown into Obsidian / Logseq daily.
Use cases
- LLM input prep — Clean text for RAG / fine-tuning / summarization
- Content curation — Newsletter / digest aggregation
- SEO research — Compare clean content across competitors
- Archiving — Read-it-later in Markdown format
- Translation pipelines — Strip boilerplate before sending to MT
Data source / engine
- Engine: trafilatura — actively maintained, used by Common Crawl
- Fallback: Returns
ok: falsewith error message if a page can't be extracted (paywall, JS-heavy SPA without SSR, etc.)
Limitations
- JS-only sites: Pages that render entirely in client-side JS may return empty markdown. For those, use a browser-rendering actor (Playwright/Puppeteer-based).
- Paywalls: This actor doesn't bypass paywalls.
- Comments / discussion sections: Off by default; enable with
includeComments: true.
Related actors (same author)
- HTML Metadata Extractor — Just metadata (OG, Twitter, JSON-LD) without article body
- Sitemap URL Discovery — Find all URLs to feed into this actor
- PDF Text Extractor — PDF version
- JSON Schema Generator
Feedback
A short review helps content/AI engineers find it: Leave a review on Apify Store