Webpage to Markdown Converter
Pricing
Pay per event
Webpage to Markdown Converter
Convert webpages to clean Markdown for LLM/RAG pipelines. Uses @mozilla/readability to strip ads, navigation, and footers. Outputs structured JSON.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
π Webpage to Markdown Converter
Convert any webpage into clean, structured Markdown optimized for LLMs, RAG pipelines, AI knowledge bases, and content research. Uses @mozilla/readability to extract main content and strips ads, navigation, footers, and other noise β giving you only the substance.
π What does it do?
Webpage to Markdown Converter takes a list of URLs, fetches each page, extracts the main readable content using Mozilla's battle-tested Readability engine, and converts it to clean Markdown using Turndown. The result is structured JSON with the Markdown content, page title, word count, publication metadata, and error information for failed URLs.
Key capabilities:
- π§ Smart content extraction with @mozilla/readability (same engine Firefox uses for Reader Mode)
- π Clean Markdown output via Turndown β no JavaScript, no browser needed
- π Fast HTTP-only processing β 256MB memory, no proxy needed
- ποΈ Rich metadata: author, description, published date, site name, language
- β οΈ Graceful error handling β bad URLs never crash the run, errors captured per URL
- βοΈ Configurable: toggle images, links, set content length limits
π€ Who is it for?
AI/LLM Developers building RAG pipelines, vector databases, or knowledge bases who need clean text from URLs without building their own scraper infrastructure.
Content Researchers collecting and analyzing web content for training data, competitor analysis, or documentation aggregation.
Data Engineers building automated content processing pipelines that need to ingest web pages as structured data.
No-code users on Make, Zapier, or n8n who want to convert webpages to text as part of automation workflows.
π‘ Why use it?
| Feature | This Actor | Competitors |
|---|---|---|
| Price per page | $0.002 | $0.005β$0.05 |
| Content extraction | Mozilla Readability (smart) | Basic HTML strip |
| Memory needed | 256 MB | 256β2048 MB |
| Metadata fields | 5 fields (author, description, siteName, date, lang) | None |
| Error details | Per-URL status code + message | Crash or skip silently |
| Word count | β Yes | β No |
The top competitor charges $0.05/page β 25x more for less output.
π Output data
For each URL, you receive a structured JSON object:
| Field | Type | Description |
|---|---|---|
url | string | Input URL |
title | string | Page title from HTML/readability |
markdown | string | Clean Markdown content |
wordCount | integer | Word count of the Markdown |
extractedAt | string | ISO 8601 timestamp |
metadata.author | string|null | Author from meta tags |
metadata.description | string|null | Meta description |
metadata.siteName | string|null | Site name (og:site_name) |
metadata.publishedDate | string|null | Publication date |
metadata.language | string|null | Content language code |
statusCode | integer|null | HTTP response code |
success | boolean | Whether conversion succeeded |
error | string|null | Error message (null on success) |
π° How much does it cost to convert webpages to Markdown?
$0.002 per successfully converted page. Failed URLs (404s, timeouts) are not charged.
Examples:
- 100 pages β ~$0.20
- 1,000 pages β ~$2.00
- 10,000 pages (monthly RAG pipeline) β ~$20.00
This is 25x cheaper than the most popular competitor, with richer output and smarter content extraction.
π How to use
Step 1: Provide URLs
Add URLs in the URLs to convert field. You can add as many as you need β the actor processes them sequentially.
Step 2: Configure options (optional)
- Include images: Keep or strip image links in the Markdown output
- Include links: Keep or strip hyperlinks (useful for plain-text LLM input)
- Max content length: Limit Markdown chars per page (useful for LLM token budgets)
Step 3: Run and retrieve results
Start the actor. Results appear in the dataset in real-time. Download as JSON, CSV, or JSONL.
π₯ Input
{"urls": ["https://en.wikipedia.org/wiki/Markdown","https://docs.python.org/3/tutorial/","https://news.ycombinator.com"],"includeImages": true,"includeLinks": true,"maxContentLength": 0,"requestTimeout": 30,"maxRetries": 2}
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | List of URLs to convert |
includeImages | boolean | true | Include image links in Markdown |
includeLinks | boolean | true | Include hyperlinks in Markdown |
maxContentLength | integer | 0 (unlimited) | Max chars per page (0 = unlimited) |
requestTimeout | integer | 30 | HTTP timeout in seconds |
maxRetries | integer | 2 | Retry attempts on network errors |
π€ Output
{"url": "https://en.wikipedia.org/wiki/Markdown","title": "Markdown","markdown": "From Wikipedia, the free encyclopedia\n\n## Overview\n\nMarkdown is a lightweight markup language...","wordCount": 2859,"extractedAt": "2026-04-06T12:00:00.000Z","metadata": {"author": "Contributors to Wikimedia projects","description": null,"siteName": "Wikimedia Foundation, Inc.","publishedDate": "2005-08-09T19:56:00Z","language": "en"},"statusCode": 200,"success": true,"error": null}
Failed URL output:
{"url": "https://example.com/page-not-found","title": null,"markdown": null,"wordCount": 0,"extractedAt": "2026-04-06T12:00:01.000Z","metadata": { "author": null, "description": null, "siteName": null, "publishedDate": null, "language": null },"statusCode": 404,"success": false,"error": "HTTP 404: Not Found"}
π‘ Tips
For LLM/RAG pipelines:
- Set
includeImages: falseandincludeLinks: falsefor cleaner text input - Use
maxContentLengthto match your LLM's context window (e.g.,50000chars β ~12k tokens) - The
wordCountfield helps you estimate token usage before sending to an LLM
For content research:
- Keep both images and links enabled (default) for full-fidelity Markdown
- The
metadata.publishedDatefield is useful for freshness filtering - Failed URLs are always included in results (with
success: false), so you know exactly what didn't work
For Wikipedia / documentation:
- These convert especially well β Readability excels at article-format content
- Table content is preserved as Markdown tables
Performance:
- The actor processes URLs sequentially; for large batches (1000+), consider running multiple instances in parallel via the API
π Integrations
Zapier / Make / n8n
Use the Apify integration to trigger this actor from any workflow:
- Add an Apify step with actor
automation-lab/webpage-to-markdown-converter - Pass your URL list as input
- Read results from the dataset output
LangChain
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("automation-lab/webpage-to-markdown-converter").call(run_input={"urls": ["https://example.com"],"includeLinks": False})for item in client.dataset(run["defaultDatasetId"]).iterate_items():if item["success"]:# Feed to LangChain document loaderdocs.append(Document(page_content=item["markdown"], metadata={"source": item["url"]}))
LlamaIndex
from llama_index.core import Document# Fetch converted pages and create LlamaIndex documentspages = [item for item in dataset_items if item["success"]]documents = [Document(text=p["markdown"], metadata={"url": p["url"], "title": p["title"]}) for p in pages]index = VectorStoreIndex.from_documents(documents)
Pinecone / Weaviate
The markdown field feeds directly into any embedding pipeline. The wordCount helps you batch-split documents that exceed embedding model token limits.
π€ API Usage
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('automation-lab/webpage-to-markdown-converter').call({urls: ['https://en.wikipedia.org/wiki/Web_scraping', 'https://docs.apify.com'],includeImages: false,maxContentLength: 50000,});const { items } = await client.dataset(run.defaultDatasetId).listItems();for (const item of items) {if (item.success) {console.log(`${item.url}: ${item.wordCount} words`);console.log(item.markdown.substring(0, 500));}}
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("automation-lab/webpage-to-markdown-converter").call(run_input={"urls": ["https://en.wikipedia.org/wiki/Web_scraping", "https://docs.apify.com"],"includeImages": False,"maxContentLength": 50000})for item in client.dataset(run["defaultDatasetId"]).iterate_items():if item["success"]:print(f"{item['url']}: {item['wordCount']} words")
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~webpage-to-markdown-converter/runs?token=YOUR_APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://en.wikipedia.org/wiki/Web_scraping"],"includeImages": false}'
π§βπ» MCP (Claude Code & Desktop)
Use this actor directly from Claude Code or Claude Desktop via the Apify MCP server:
Claude Code β run in your terminal:
$claude mcp add --transport http "https://mcp.apify.com?tools=automation-lab/webpage-to-markdown-converter"
Claude Desktop β add to your claude_desktop_config.json:
{"mcpServers": {"apify": {"command": "npx","args": ["-y", "@apify/actors-mcp-server"],"env": {"APIFY_TOKEN": "YOUR_APIFY_TOKEN","ACTORS": "automation-lab/webpage-to-markdown-converter"}}}}
Example Claude prompts:
- "Convert https://docs.python.org/3/tutorial/ to Markdown for my knowledge base"
- "Fetch these 5 URLs and convert them to clean text for RAG ingestion"
- "Extract the article content from this news page without images or links"
βοΈ Legality
This actor fetches publicly accessible webpages using standard HTTP requests (no browser automation, no captcha bypassing). It is the user's responsibility to comply with the target website's Terms of Service and robots.txt. Content extracted is for the user's own use β ensure compliance with copyright laws when storing or redistributing extracted content.
β FAQ
Q: Does it work on JavaScript-rendered pages? A: No β this actor uses HTTP-only requests for speed and cost efficiency. For JavaScript-rendered pages (React/Vue SPAs), you need a browser-based crawler that renders JavaScript before extracting content.
Q: Why does my page return partial content?
A: Some pages serve different content to bots. Try increasing requestTimeout. If the page heavily relies on JavaScript for content rendering, it may not work with this actor.
Q: The Markdown has a lot of links/navigation β how do I fix it?
A: Set includeLinks: false to strip all hyperlinks, or the actor's Readability engine should remove most navigation. If you're still getting noise, the page may have unusual structure.
Q: Can I convert PDFs or other file types? A: No β this actor only processes HTML pages. PDF conversion requires a different tool.
Q: How many URLs can I process per run? A: No hard limit β the actor processes URLs sequentially with a 600-second timeout. For very large batches (1000+ URLs), consider splitting across multiple runs.
Q: Is my data private? A: Yes β results are stored in your private Apify dataset. No extracted content is shared or retained by the actor developer.
π Related actors
- Color Contrast Checker β validate WCAG 2.1 AA/AAA color contrast for accessibility
- JSON Schema Generator β generate JSON schemas from sample data