Web Page → Markdown Converter (Trafilatura, LLM-ready) avatar

Web Page → Markdown Converter (Trafilatura, LLM-ready)

Pricing

Pay per usage

Go to Apify Store
Web Page → Markdown Converter (Trafilatura, LLM-ready)

Web Page → Markdown Converter (Trafilatura, LLM-ready)

Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura — the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Hojun Lee

Hojun Lee

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Web Page → Markdown Converter

Convert any URL to clean Markdown plus structured metadata (title, author, date, lang, image, tags). Uses trafilatura — the same library Common Crawl uses. LLM-ready output. Batch up to 500 URLs. $0.005 per URL.


Why this exists

Most LLM pipelines need clean article-body text — but raw HTML is 60-90% boilerplate (nav, footer, ads, JS, related stories). Existing solutions:

  • Browserless / Puppeteer: complex setup, $30+/mo
  • Mercury Parser: deprecated
  • Diffbot: $299/mo minimum
  • Readability.js: requires running Node

This actor wraps trafilatura — the gold-standard Python library used by Common Crawl and most LLM training pipelines — into a one-call API. Pass a URL list, get clean Markdown + metadata back.


What you get per row

FieldExampleNotes
urlhttps://...input URL
oktruedid extraction succeed
titleBitcoin — Wikipediafrom <title> or og
authorWikipedia contributors
descriptionBitcoin is a cryptocurrency...
date_published2025-12-01
languageenauto-detected
sitenameWikipedia
tags["cryptocurrency", "blockchain"]
categories["Technology"]
imagehttps://...hero image
markdown# Bitcoin\n\nBitcoin is...clean body
char_count48230
word_count7842

Quick start

Single URL

{
"url": "https://en.wikipedia.org/wiki/Bitcoin"
}

Batch of URLs

{
"urls": [
"https://techcrunch.com/article-1",
"https://www.theverge.com/article-2",
"https://www.wired.com/article-3"
],
"includeTables": true,
"deduplicate": true
}

Custom User-Agent (some sites require it)

{
"url": "https://...",
"userAgent": "Mozilla/5.0 (compatible; YourBot/1.0; +https://yourdomain.com/bot)"
}

Pricing

Pay-Per-Event: $0.005 per URL processed.

RunURLsCost
Single article1$0.005
Batch of 100100$0.50
Daily crawl of 1K URLs1000$5.00

Vs Diffbot ($299/mo), Mercury ($199/mo for similar tier), this is 40-60x cheaper for typical volumes.


Common pipeline patterns

Feed to Claude / GPT for summarization

# 1. Extract clean text
curl -X POST "https://api.apify.com/v2/acts/gochujang~web-to-markdown/runs?token=$T" \
-d '{"url":"..."}'
# 2. Pipe markdown to Claude
curl -X POST https://api.anthropic.com/v1/messages \
-d "{\"messages\":[{\"role\":\"user\",\"content\":\"Summarize: $MARKDOWN\"}]}"

RSS-style aggregator

  1. Sitemap URL Discovery to get all article URLs
  2. Filter by lastmod (recent only)
  3. This actor to convert each to Markdown
  4. Store in your DB / Notion / Obsidian

Personal read-it-later

Schedule this actor with your "saved articles" Google Sheet → get clean markdown into Obsidian / Logseq daily.


Use cases

  1. LLM input prep — Clean text for RAG / fine-tuning / summarization
  2. Content curation — Newsletter / digest aggregation
  3. SEO research — Compare clean content across competitors
  4. Archiving — Read-it-later in Markdown format
  5. Translation pipelines — Strip boilerplate before sending to MT

Data source / engine

  • Engine: trafilatura — actively maintained, used by Common Crawl
  • Fallback: Returns ok: false with error message if a page can't be extracted (paywall, JS-heavy SPA without SSR, etc.)

Limitations

  • JS-only sites: Pages that render entirely in client-side JS may return empty markdown. For those, use a browser-rendering actor (Playwright/Puppeteer-based).
  • Paywalls: This actor doesn't bypass paywalls.
  • Comments / discussion sections: Off by default; enable with includeComments: true.


Feedback

A short review helps content/AI engineers find it: Leave a review on Apify Store