Pricing

Pay per event + usage

Webpage to Markdown Converter

Convert webpages to clean Markdown for LLM/RAG pipelines. Uses @mozilla/readability to strip ads, navigation, and footers. Outputs structured JSON.

Pricing

Pay per event + usage

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

17 days ago

Last modified

📄 Webpage to Markdown Converter

Convert any webpage into clean, structured Markdown optimized for LLMs, RAG pipelines, AI knowledge bases, and content research. Uses @mozilla/readability to extract main content and strips ads, navigation, footers, and other noise — giving you only the substance.

🔍 What does it do?

Webpage to Markdown Converter takes a list of URLs, fetches each page, extracts the main readable content using Mozilla's battle-tested Readability engine, and converts it to clean Markdown using Turndown. The result is structured JSON with the Markdown content, page title, word count, publication metadata, and error information for failed URLs.

Key capabilities:

🧠 Smart content extraction with @mozilla/readability (same engine Firefox uses for Reader Mode)
📝 Clean Markdown output via Turndown — no JavaScript, no browser needed
🚀 Fast HTTP-only processing — 256MB memory, no proxy needed
🗃️ Rich metadata: author, description, published date, site name, language
⚠️ Graceful error handling — bad URLs never crash the run, errors captured per URL
⚙️ Configurable: toggle images, links, set content length limits

👤 Who is it for?

AI/LLM Developers building RAG pipelines, vector databases, or knowledge bases who need clean text from URLs without building their own scraper infrastructure.

Content Researchers collecting and analyzing web content for training data, competitor analysis, or documentation aggregation.

Data Engineers building automated content processing pipelines that need to ingest web pages as structured data.

No-code users on Make, Zapier, or n8n who want to convert webpages to text as part of automation workflows.

💡 Why use it?

Feature	This Actor	Competitors
Price per page	$0.002	Several cents per page
Content extraction	Mozilla Readability (smart)	Basic HTML strip
Memory needed	256 MB	256–2048 MB
Metadata fields	5 fields (author, description, siteName, date, lang)	None
Error details	Per-URL status code + message	Crash or skip silently
Word count	✅ Yes	❌ No

The top competitor charges several cents per page — far more for less output.

📊 Output data

For each URL, you receive a structured JSON object:

Field	Type	Description
`url`	string	Input URL
`title`	string	Page title from HTML/readability
`markdown`	string	Clean Markdown content
`wordCount`	integer	Word count of the Markdown
`extractedAt`	string	ISO 8601 timestamp
`metadata.author`	string\|null	Author from meta tags
`metadata.description`	string\|null	Meta description
`metadata.siteName`	string\|null	Site name (og:site_name)
`metadata.publishedDate`	string\|null	Publication date
`metadata.language`	string\|null	Content language code
`statusCode`	integer\|null	HTTP response code
`success`	boolean	Whether conversion succeeded
`error`	string\|null	Error message (null on success)

💰 Pricing

$0.002 per successfully converted page. Failed URLs (404s, timeouts) are not charged.

Examples:

100 pages → ~$0.20
1,000 pages → ~$2.00
10,000 pages (monthly RAG pipeline) → ~$20.00

This is 25x cheaper than the most popular competitor, with richer output and smarter content extraction.

🚀 How to use

Step 1: Provide URLs

Add URLs in the URLs to convert field. You can add as many as you need — the actor processes them sequentially.

Step 2: Configure options (optional)

Include images: Keep or strip image links in the Markdown output
Include links: Keep or strip hyperlinks (useful for plain-text LLM input)
Max content length: Limit Markdown chars per page (useful for LLM token budgets)

Step 3: Run and retrieve results

Start the actor. Results appear in the dataset in real-time. Download as JSON, CSV, or JSONL.

📥 Input

{
    "urls": [
        "https://en.wikipedia.org/wiki/Markdown",
        "https://docs.python.org/3/tutorial/",
        "https://news.ycombinator.com"
    ],
    "includeImages": true,
    "includeLinks": true,
    "maxContentLength": 0,
    "requestTimeout": 30,
    "maxRetries": 2
}

Parameter	Type	Default	Description
`urls`	string[]	required	List of URLs to convert
`includeImages`	boolean	`true`	Include image links in Markdown
`includeLinks`	boolean	`true`	Include hyperlinks in Markdown
`maxContentLength`	integer	`0` (unlimited)	Max chars per page (0 = unlimited)
`requestTimeout`	integer	`30`	HTTP timeout in seconds
`maxRetries`	integer	`2`	Retry attempts on network errors

📤 Output

{
    "url": "https://en.wikipedia.org/wiki/Markdown",
    "title": "Markdown",
    "markdown": "From Wikipedia, the free encyclopedia\n\n## Overview\n\nMarkdown is a lightweight markup language...",
    "wordCount": 2859,
    "extractedAt": "2026-04-06T12:00:00.000Z",
    "metadata": {
        "author": "Contributors to Wikimedia projects",
        "description": null,
        "siteName": "Wikimedia Foundation, Inc.",
        "publishedDate": "2005-08-09T19:56:00Z",
        "language": "en"
    },
    "statusCode": 200,
    "success": true,
    "error": null
}

Failed URL output:

{
    "url": "https://example.com/page-not-found",
    "title": null,
    "markdown": null,
    "wordCount": 0,
    "extractedAt": "2026-04-06T12:00:01.000Z",
    "metadata": { "author": null, "description": null, "siteName": null, "publishedDate": null, "language": null },
    "statusCode": 404,
    "success": false,
    "error": "HTTP 404: Not Found"
}

💡 Tips

For LLM/RAG pipelines:

Set includeImages: false and includeLinks: false for cleaner text input
Use maxContentLength to match your LLM's context window (e.g., 50000 chars ≈ ~12k tokens)
The wordCount field helps you estimate token usage before sending to an LLM

For content research:

Keep both images and links enabled (default) for full-fidelity Markdown
The metadata.publishedDate field is useful for freshness filtering
Failed URLs are always included in results (with success: false), so you know exactly what didn't work

For Wikipedia / documentation:

These convert especially well — Readability excels at article-format content
Table content is preserved as Markdown tables

Performance:

The actor processes URLs sequentially; for large batches (1000+), consider running multiple instances in parallel via the API

🔗 Integrations

Zapier / Make / n8n

Use the Apify integration to trigger this actor from any workflow:

Add an Apify step with actor automation-lab/webpage-to-markdown-converter
Pass your URL list as input
Read results from the dataset output

LangChain

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("automation-lab/webpage-to-markdown-converter").call(run_input={
    "urls": ["https://example.com"],
    "includeLinks": False
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item["success"]:
        # Feed to LangChain document loader
        docs.append(Document(page_content=item["markdown"], metadata={"source": item["url"]}))

LlamaIndex

from llama_index.core import Document
# Fetch converted pages and create LlamaIndex documents
pages = [item for item in dataset_items if item["success"]]
documents = [Document(text=p["markdown"], metadata={"url": p["url"], "title": p["title"]}) for p in pages]
index = VectorStoreIndex.from_documents(documents)

Pinecone / Weaviate

The markdown field feeds directly into any embedding pipeline. The wordCount helps you batch-split documents that exceed embedding model token limits.

🤖 API Usage

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('automation-lab/webpage-to-markdown-converter').call({
    urls: ['https://en.wikipedia.org/wiki/Web_scraping', 'https://docs.apify.com'],
    includeImages: false,
    maxContentLength: 50000,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
    if (item.success) {
        console.log(`${item.url}: ${item.wordCount} words`);
        console.log(item.markdown.substring(0, 500));
    }
}

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("automation-lab/webpage-to-markdown-converter").call(run_input={
    "urls": ["https://en.wikipedia.org/wiki/Web_scraping", "https://docs.apify.com"],
    "includeImages": False,
    "maxContentLength": 50000
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item["success"]:
        print(f"{item['url']}: {item['wordCount']} words")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~webpage-to-markdown-converter/runs?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://en.wikipedia.org/wiki/Web_scraping"],
    "includeImages": false
  }'

🧑‍💻 MCP (Claude Code & Desktop)

Use this actor directly from Claude Code or Claude Desktop via the Apify MCP server:

Claude Code — run in your terminal:

$claude mcp add --transport http "https://mcp.apify.com?tools=automation-lab/webpage-to-markdown-converter"

Claude Desktop — add to your claude_desktop_config.json:

{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["-y", "@apify/actors-mcp-server"],
      "env": {
        "APIFY_TOKEN": "YOUR_APIFY_TOKEN",
        "ACTORS": "automation-lab/webpage-to-markdown-converter"
      }
    }
  }
}

Example Claude prompts:

"Convert https://docs.python.org/3/tutorial/ to Markdown for my knowledge base"
"Fetch these 5 URLs and convert them to clean text for RAG ingestion"
"Extract the article content from this news page without images or links"

⚖️ Legality

This actor fetches publicly accessible webpages using standard HTTP requests (no browser automation, no captcha bypassing). It is the user's responsibility to comply with the target website's Terms of Service and robots.txt. Content extracted is for the user's own use — ensure compliance with copyright laws when storing or redistributing extracted content.

❓ FAQ

Q: Does it work on JavaScript-rendered pages? A: No — this actor uses HTTP-only requests for speed and cost efficiency. For JavaScript-rendered pages (React/Vue SPAs), you need a browser-based crawler that renders JavaScript before extracting content.

Q: Why does my page return partial content? A: Some pages serve different content to bots. Try increasing requestTimeout. If the page heavily relies on JavaScript for content rendering, it may not work with this actor.

Q: The Markdown has a lot of links/navigation — how do I fix it? A: Set includeLinks: false to strip all hyperlinks, or the actor's Readability engine should remove most navigation. If you're still getting noise, the page may have unusual structure.

Q: Can I convert PDFs or other file types? A: No — this actor only processes HTML pages. PDF conversion requires a different tool.

Q: How many URLs can I process per run? A: No hard limit — the actor processes URLs sequentially with a 600-second timeout. For very large batches (1000+ URLs), consider splitting across multiple runs.

Q: Is my data private? A: Yes — results are stored in your private Apify dataset. No extracted content is shared or retained by the actor developer.

Color Contrast Checker — validate WCAG 2.1 AA/AAA color contrast for accessibility
JSON Schema Generator — generate JSON schemas from sample data

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Daniel Dimitrov

Website To Markdown

swarmgarden/website-to-markdown

Convert any webpage to clean, readable Markdown format. Perfect for content extraction and readability.

Swarm Garden

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Webpage To Clean Markdown

technicaldost/webpage-to-clean-markdown

Technical Dost Solutions

Universal RAG Web Scraper

express_kingfisher/rag-web-scraper

Turn any website into clean, LLM-ready Markdown. Automatically strips ads, navigation, and noise using Mozilla Readability. Perfect for feeding data to ChatGPT, Claude, or Vector Databases (RAG).

Prince Raj

Website to Markdown MCP Server

quodlibetical_buffalo/website-to-markdown-mcp

Convert any webpage to clean Markdown. MCP server for AI agents and LLM pipelines.

Marek Pommier

Web to Markdown — AI-Ready Text from Any URL

wsgcjj/web-to-markdown

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

陈俊杰

Webpage to Markdown Converter for LLMs

andok/markdown-extractor

Convert any URL into clean Markdown text. Remove ads and navbars to perfectly format web content for AI and RAG ingestion.

Andok

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.