Webpage to Markdown Converter
Pricing
Pay per usage
Webpage to Markdown Converter
Convert any webpage URL to clean Markdown format. Preserves headings, lists, tables, links, and code blocks. Optimized for LLM consumption, RAG pipelines, and vector database ingestion.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Donny
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Webpage to Markdown
What is Webpage to Markdown?
Webpage to Markdown is an Apify actor that converts any webpage into clean, well-structured Markdown format. It fetches HTML from one or more URLs, strips away scripts, styles, navigation, headers, footers, and other non-content elements, then converts the remaining content into proper Markdown with headings, lists, tables, code blocks, bold, italic, links, and images. The output includes the page title, word count, character count, and a timestamp alongside the Markdown content. This actor is ideal for building RAG (Retrieval-Augmented Generation) pipelines, content archiving systems, knowledge bases, and any application where you need structured, LLM-ready text from web pages.
Unlike simple text extraction, Webpage to Markdown preserves the document structure. Headings remain as headings, tables remain as tables, code blocks keep their formatting, and lists maintain their hierarchy. This structural preservation is critical for LLMs that benefit from understanding document organization and for downstream applications that need to render or further process the content.
Why use Webpage to Markdown?
- Preserves document structure -- Headings (H1-H6), lists (ordered and unordered), tables, code blocks, blockquotes, bold, and italic formatting are all correctly converted to Markdown syntax.
- Clean content extraction -- Scripts, styles, navigation, headers, footers, forms, buttons, hidden elements, and iframes are all automatically removed before conversion.
- Batch processing -- Process multiple URLs in a single run. Just provide a list of URLs and the actor handles them sequentially.
- Configurable output -- Choose whether to include images and hyperlinks in the Markdown output. Control maximum content length to stay within LLM token limits.
- LLM-ready output -- The clean Markdown format is directly consumable by Claude, GPT, and other LLMs for question answering, summarization, and analysis.
- Full table support -- HTML tables are converted to pipe-delimited Markdown tables with proper header separation.
- Error resilience -- Failed URLs produce error records in the dataset rather than crashing the entire run, so batch processing always completes.
How to use Webpage to Markdown
- Go to the actor page on Apify and click "Start".
- Add your URLs to the
urlsinput field. You can add one URL or dozens. - Configure options: Toggle image inclusion, link preservation, and set the maximum content length.
- Run the actor and download results from the dataset in JSON, CSV, or Excel format.
Using the Apify API
curl -X POST "https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://en.wikipedia.org/wiki/Web_scraping", "https://docs.apify.com"],"includeImages": false,"includeLinks": true,"maxContentLength": 50000}'
Using the Apify SDK
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('YOUR_ACTOR_ID').call({urls: ['https://en.wikipedia.org/wiki/Web_scraping'],includeLinks: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items[0].markdown);
Input configuration
| Field | Type | Default | Description |
|---|---|---|---|
urls | String[] | ["https://example.com"] | List of webpage URLs to convert to Markdown format |
includeImages | Boolean | false | Whether to include image references () in the Markdown output |
includeLinks | Boolean | true | Whether to preserve hyperlinks ([text](href)) in the Markdown output |
maxContentLength | Integer | 50000 | Maximum number of characters in the output Markdown content. Content exceeding this limit is truncated with a notice. |
Output data
Each processed URL produces one record in the dataset with the following fields:
{"url": "https://en.wikipedia.org/wiki/Web_scraping","title": "Web scraping - Wikipedia","markdown": "# Web scraping\n\nWeb scraping, web harvesting, or web data extraction is [data scraping](https://en.wikipedia.org/wiki/Data_scraping) used for extracting data from websites...\n\n## Techniques\n\n- Human copy-and-paste\n- Text pattern matching\n- HTTP programming\n- DOM parsing...","wordCount": 4523,"charCount": 28190,"timestamp": "2026-03-03T12:00:00.000Z"}
| Field | Type | Description |
|---|---|---|
url | String | The original URL that was processed |
title | String | The page title extracted from the HTML <title> tag |
markdown | String | The converted Markdown content with proper formatting |
wordCount | Integer | Number of words in the Markdown output |
charCount | Integer | Number of characters in the Markdown output |
timestamp | String | ISO 8601 timestamp of when the conversion was performed |
error | String | Error message if the URL could not be processed (only present on failures) |
Cost of usage
Webpage to Markdown is a lightweight actor that uses minimal compute resources. Each URL typically processes in 1-3 seconds depending on page size. With default memory (2048 MB), processing a batch of 10 URLs costs approximately $0.01-$0.03. For large-scale operations processing 1000 URLs, expect costs around $1-$3. The actor uses no external paid APIs or browser automation, keeping costs minimal. The per-event pricing is $0.05 per actor run plus $0.001 per result.
Tips and tricks
- Set includeLinks to true for RAG: When building RAG pipelines, keeping hyperlinks in the Markdown helps LLMs provide source attribution and follow-up references.
- Set includeImages to false for text-only LLMs: If your LLM cannot process images, disable image inclusion to keep the Markdown clean and reduce token usage.
- Adjust maxContentLength for your LLM context window: If you are using a model with a 4K token limit, set maxContentLength to around 12000 characters. For 128K context models, the default 50000 is usually fine.
- Batch URLs for efficiency: Processing multiple URLs in one run is more efficient than starting separate runs for each URL due to reduced actor startup overhead.
- Handle errors gracefully: URLs that fail (404, timeout, etc.) produce records with an
errorfield and empty content. Filter these out in your downstream processing. - Combine with chunking: If you need text chunks for vector database ingestion, pair this actor with the URL to Clean Text actor which supports automatic text chunking with configurable overlap.
- Export as CSV: The Apify dataset can be exported as CSV, making it easy to import into spreadsheets or databases for further analysis.