Webpage to Markdown Converter avatar

Webpage to Markdown Converter

Pricing

Pay per event

Go to Apify Store
Webpage to Markdown Converter

Webpage to Markdown Converter

Convert webpages to clean Markdown for LLM/RAG pipelines. Uses @mozilla/readability to strip ads, navigation, and footers. Outputs structured JSON.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 days ago

Last modified

Share

πŸ“„ Webpage to Markdown Converter

Convert any webpage into clean, structured Markdown optimized for LLMs, RAG pipelines, AI knowledge bases, and content research. Uses @mozilla/readability to extract main content and strips ads, navigation, footers, and other noise β€” giving you only the substance.

πŸ” What does it do?

Webpage to Markdown Converter takes a list of URLs, fetches each page, extracts the main readable content using Mozilla's battle-tested Readability engine, and converts it to clean Markdown using Turndown. The result is structured JSON with the Markdown content, page title, word count, publication metadata, and error information for failed URLs.

Key capabilities:

  • 🧠 Smart content extraction with @mozilla/readability (same engine Firefox uses for Reader Mode)
  • πŸ“ Clean Markdown output via Turndown β€” no JavaScript, no browser needed
  • πŸš€ Fast HTTP-only processing β€” 256MB memory, no proxy needed
  • πŸ—ƒοΈ Rich metadata: author, description, published date, site name, language
  • ⚠️ Graceful error handling β€” bad URLs never crash the run, errors captured per URL
  • βš™οΈ Configurable: toggle images, links, set content length limits

πŸ‘€ Who is it for?

AI/LLM Developers building RAG pipelines, vector databases, or knowledge bases who need clean text from URLs without building their own scraper infrastructure.

Content Researchers collecting and analyzing web content for training data, competitor analysis, or documentation aggregation.

Data Engineers building automated content processing pipelines that need to ingest web pages as structured data.

No-code users on Make, Zapier, or n8n who want to convert webpages to text as part of automation workflows.

πŸ’‘ Why use it?

FeatureThis ActorCompetitors
Price per page$0.002$0.005–$0.05
Content extractionMozilla Readability (smart)Basic HTML strip
Memory needed256 MB256–2048 MB
Metadata fields5 fields (author, description, siteName, date, lang)None
Error detailsPer-URL status code + messageCrash or skip silently
Word countβœ… Yes❌ No

The top competitor charges $0.05/page β€” 25x more for less output.

πŸ“Š Output data

For each URL, you receive a structured JSON object:

FieldTypeDescription
urlstringInput URL
titlestringPage title from HTML/readability
markdownstringClean Markdown content
wordCountintegerWord count of the Markdown
extractedAtstringISO 8601 timestamp
metadata.authorstring|nullAuthor from meta tags
metadata.descriptionstring|nullMeta description
metadata.siteNamestring|nullSite name (og:site_name)
metadata.publishedDatestring|nullPublication date
metadata.languagestring|nullContent language code
statusCodeinteger|nullHTTP response code
successbooleanWhether conversion succeeded
errorstring|nullError message (null on success)

πŸ’° How much does it cost to convert webpages to Markdown?

$0.002 per successfully converted page. Failed URLs (404s, timeouts) are not charged.

Examples:

  • 100 pages β†’ ~$0.20
  • 1,000 pages β†’ ~$2.00
  • 10,000 pages (monthly RAG pipeline) β†’ ~$20.00

This is 25x cheaper than the most popular competitor, with richer output and smarter content extraction.

πŸš€ How to use

Step 1: Provide URLs

Add URLs in the URLs to convert field. You can add as many as you need β€” the actor processes them sequentially.

Step 2: Configure options (optional)

  • Include images: Keep or strip image links in the Markdown output
  • Include links: Keep or strip hyperlinks (useful for plain-text LLM input)
  • Max content length: Limit Markdown chars per page (useful for LLM token budgets)

Step 3: Run and retrieve results

Start the actor. Results appear in the dataset in real-time. Download as JSON, CSV, or JSONL.

πŸ“₯ Input

{
"urls": [
"https://en.wikipedia.org/wiki/Markdown",
"https://docs.python.org/3/tutorial/",
"https://news.ycombinator.com"
],
"includeImages": true,
"includeLinks": true,
"maxContentLength": 0,
"requestTimeout": 30,
"maxRetries": 2
}
ParameterTypeDefaultDescription
urlsstring[]requiredList of URLs to convert
includeImagesbooleantrueInclude image links in Markdown
includeLinksbooleantrueInclude hyperlinks in Markdown
maxContentLengthinteger0 (unlimited)Max chars per page (0 = unlimited)
requestTimeoutinteger30HTTP timeout in seconds
maxRetriesinteger2Retry attempts on network errors

πŸ“€ Output

{
"url": "https://en.wikipedia.org/wiki/Markdown",
"title": "Markdown",
"markdown": "From Wikipedia, the free encyclopedia\n\n## Overview\n\nMarkdown is a lightweight markup language...",
"wordCount": 2859,
"extractedAt": "2026-04-06T12:00:00.000Z",
"metadata": {
"author": "Contributors to Wikimedia projects",
"description": null,
"siteName": "Wikimedia Foundation, Inc.",
"publishedDate": "2005-08-09T19:56:00Z",
"language": "en"
},
"statusCode": 200,
"success": true,
"error": null
}

Failed URL output:

{
"url": "https://example.com/page-not-found",
"title": null,
"markdown": null,
"wordCount": 0,
"extractedAt": "2026-04-06T12:00:01.000Z",
"metadata": { "author": null, "description": null, "siteName": null, "publishedDate": null, "language": null },
"statusCode": 404,
"success": false,
"error": "HTTP 404: Not Found"
}

πŸ’‘ Tips

For LLM/RAG pipelines:

  • Set includeImages: false and includeLinks: false for cleaner text input
  • Use maxContentLength to match your LLM's context window (e.g., 50000 chars β‰ˆ ~12k tokens)
  • The wordCount field helps you estimate token usage before sending to an LLM

For content research:

  • Keep both images and links enabled (default) for full-fidelity Markdown
  • The metadata.publishedDate field is useful for freshness filtering
  • Failed URLs are always included in results (with success: false), so you know exactly what didn't work

For Wikipedia / documentation:

  • These convert especially well β€” Readability excels at article-format content
  • Table content is preserved as Markdown tables

Performance:

  • The actor processes URLs sequentially; for large batches (1000+), consider running multiple instances in parallel via the API

πŸ”— Integrations

Zapier / Make / n8n

Use the Apify integration to trigger this actor from any workflow:

  1. Add an Apify step with actor automation-lab/webpage-to-markdown-converter
  2. Pass your URL list as input
  3. Read results from the dataset output

LangChain

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("automation-lab/webpage-to-markdown-converter").call(run_input={
"urls": ["https://example.com"],
"includeLinks": False
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
if item["success"]:
# Feed to LangChain document loader
docs.append(Document(page_content=item["markdown"], metadata={"source": item["url"]}))

LlamaIndex

from llama_index.core import Document
# Fetch converted pages and create LlamaIndex documents
pages = [item for item in dataset_items if item["success"]]
documents = [Document(text=p["markdown"], metadata={"url": p["url"], "title": p["title"]}) for p in pages]
index = VectorStoreIndex.from_documents(documents)

Pinecone / Weaviate

The markdown field feeds directly into any embedding pipeline. The wordCount helps you batch-split documents that exceed embedding model token limits.

πŸ€– API Usage

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('automation-lab/webpage-to-markdown-converter').call({
urls: ['https://en.wikipedia.org/wiki/Web_scraping', 'https://docs.apify.com'],
includeImages: false,
maxContentLength: 50000,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
if (item.success) {
console.log(`${item.url}: ${item.wordCount} words`);
console.log(item.markdown.substring(0, 500));
}
}

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("automation-lab/webpage-to-markdown-converter").call(run_input={
"urls": ["https://en.wikipedia.org/wiki/Web_scraping", "https://docs.apify.com"],
"includeImages": False,
"maxContentLength": 50000
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
if item["success"]:
print(f"{item['url']}: {item['wordCount']} words")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~webpage-to-markdown-converter/runs?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://en.wikipedia.org/wiki/Web_scraping"],
"includeImages": false
}'

πŸ§‘β€πŸ’» MCP (Claude Code & Desktop)

Use this actor directly from Claude Code or Claude Desktop via the Apify MCP server:

Claude Code β€” run in your terminal:

$claude mcp add --transport http "https://mcp.apify.com?tools=automation-lab/webpage-to-markdown-converter"

Claude Desktop β€” add to your claude_desktop_config.json:

{
"mcpServers": {
"apify": {
"command": "npx",
"args": ["-y", "@apify/actors-mcp-server"],
"env": {
"APIFY_TOKEN": "YOUR_APIFY_TOKEN",
"ACTORS": "automation-lab/webpage-to-markdown-converter"
}
}
}
}

Example Claude prompts:

  • "Convert https://docs.python.org/3/tutorial/ to Markdown for my knowledge base"
  • "Fetch these 5 URLs and convert them to clean text for RAG ingestion"
  • "Extract the article content from this news page without images or links"

βš–οΈ Legality

This actor fetches publicly accessible webpages using standard HTTP requests (no browser automation, no captcha bypassing). It is the user's responsibility to comply with the target website's Terms of Service and robots.txt. Content extracted is for the user's own use β€” ensure compliance with copyright laws when storing or redistributing extracted content.

❓ FAQ

Q: Does it work on JavaScript-rendered pages? A: No β€” this actor uses HTTP-only requests for speed and cost efficiency. For JavaScript-rendered pages (React/Vue SPAs), you need a browser-based crawler that renders JavaScript before extracting content.

Q: Why does my page return partial content? A: Some pages serve different content to bots. Try increasing requestTimeout. If the page heavily relies on JavaScript for content rendering, it may not work with this actor.

Q: The Markdown has a lot of links/navigation β€” how do I fix it? A: Set includeLinks: false to strip all hyperlinks, or the actor's Readability engine should remove most navigation. If you're still getting noise, the page may have unusual structure.

Q: Can I convert PDFs or other file types? A: No β€” this actor only processes HTML pages. PDF conversion requires a different tool.

Q: How many URLs can I process per run? A: No hard limit β€” the actor processes URLs sequentially with a 600-second timeout. For very large batches (1000+ URLs), consider splitting across multiple runs.

Q: Is my data private? A: Yes β€” results are stored in your private Apify dataset. No extracted content is shared or retained by the actor developer.