Webpage to Markdown Converter avatar

Webpage to Markdown Converter

Pricing

Pay per usage

Go to Apify Store
Webpage to Markdown Converter

Webpage to Markdown Converter

Convert any webpage URL to clean Markdown format. Preserves headings, lists, tables, links, and code blocks. Optimized for LLM consumption, RAG pipelines, and vector database ingestion.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny

Donny

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Webpage to Markdown

What is Webpage to Markdown?

Webpage to Markdown is an Apify actor that converts any webpage into clean, well-structured Markdown format. It fetches HTML from one or more URLs, strips away scripts, styles, navigation, headers, footers, and other non-content elements, then converts the remaining content into proper Markdown with headings, lists, tables, code blocks, bold, italic, links, and images. The output includes the page title, word count, character count, and a timestamp alongside the Markdown content. This actor is ideal for building RAG (Retrieval-Augmented Generation) pipelines, content archiving systems, knowledge bases, and any application where you need structured, LLM-ready text from web pages.

Unlike simple text extraction, Webpage to Markdown preserves the document structure. Headings remain as headings, tables remain as tables, code blocks keep their formatting, and lists maintain their hierarchy. This structural preservation is critical for LLMs that benefit from understanding document organization and for downstream applications that need to render or further process the content.

Why use Webpage to Markdown?

  • Preserves document structure -- Headings (H1-H6), lists (ordered and unordered), tables, code blocks, blockquotes, bold, and italic formatting are all correctly converted to Markdown syntax.
  • Clean content extraction -- Scripts, styles, navigation, headers, footers, forms, buttons, hidden elements, and iframes are all automatically removed before conversion.
  • Batch processing -- Process multiple URLs in a single run. Just provide a list of URLs and the actor handles them sequentially.
  • Configurable output -- Choose whether to include images and hyperlinks in the Markdown output. Control maximum content length to stay within LLM token limits.
  • LLM-ready output -- The clean Markdown format is directly consumable by Claude, GPT, and other LLMs for question answering, summarization, and analysis.
  • Full table support -- HTML tables are converted to pipe-delimited Markdown tables with proper header separation.
  • Error resilience -- Failed URLs produce error records in the dataset rather than crashing the entire run, so batch processing always completes.

How to use Webpage to Markdown

  1. Go to the actor page on Apify and click "Start".
  2. Add your URLs to the urls input field. You can add one URL or dozens.
  3. Configure options: Toggle image inclusion, link preservation, and set the maximum content length.
  4. Run the actor and download results from the dataset in JSON, CSV, or Excel format.

Using the Apify API

curl -X POST "https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://en.wikipedia.org/wiki/Web_scraping", "https://docs.apify.com"],
"includeImages": false,
"includeLinks": true,
"maxContentLength": 50000
}'

Using the Apify SDK

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('YOUR_ACTOR_ID').call({
urls: ['https://en.wikipedia.org/wiki/Web_scraping'],
includeLinks: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].markdown);

Input configuration

FieldTypeDefaultDescription
urlsString[]["https://example.com"]List of webpage URLs to convert to Markdown format
includeImagesBooleanfalseWhether to include image references (![alt](src)) in the Markdown output
includeLinksBooleantrueWhether to preserve hyperlinks ([text](href)) in the Markdown output
maxContentLengthInteger50000Maximum number of characters in the output Markdown content. Content exceeding this limit is truncated with a notice.

Output data

Each processed URL produces one record in the dataset with the following fields:

{
"url": "https://en.wikipedia.org/wiki/Web_scraping",
"title": "Web scraping - Wikipedia",
"markdown": "# Web scraping\n\nWeb scraping, web harvesting, or web data extraction is [data scraping](https://en.wikipedia.org/wiki/Data_scraping) used for extracting data from websites...\n\n## Techniques\n\n- Human copy-and-paste\n- Text pattern matching\n- HTTP programming\n- DOM parsing...",
"wordCount": 4523,
"charCount": 28190,
"timestamp": "2026-03-03T12:00:00.000Z"
}
FieldTypeDescription
urlStringThe original URL that was processed
titleStringThe page title extracted from the HTML <title> tag
markdownStringThe converted Markdown content with proper formatting
wordCountIntegerNumber of words in the Markdown output
charCountIntegerNumber of characters in the Markdown output
timestampStringISO 8601 timestamp of when the conversion was performed
errorStringError message if the URL could not be processed (only present on failures)

Cost of usage

Webpage to Markdown is a lightweight actor that uses minimal compute resources. Each URL typically processes in 1-3 seconds depending on page size. With default memory (2048 MB), processing a batch of 10 URLs costs approximately $0.01-$0.03. For large-scale operations processing 1000 URLs, expect costs around $1-$3. The actor uses no external paid APIs or browser automation, keeping costs minimal. The per-event pricing is $0.05 per actor run plus $0.001 per result.

Tips and tricks

  • Set includeLinks to true for RAG: When building RAG pipelines, keeping hyperlinks in the Markdown helps LLMs provide source attribution and follow-up references.
  • Set includeImages to false for text-only LLMs: If your LLM cannot process images, disable image inclusion to keep the Markdown clean and reduce token usage.
  • Adjust maxContentLength for your LLM context window: If you are using a model with a 4K token limit, set maxContentLength to around 12000 characters. For 128K context models, the default 50000 is usually fine.
  • Batch URLs for efficiency: Processing multiple URLs in one run is more efficient than starting separate runs for each URL due to reduced actor startup overhead.
  • Handle errors gracefully: URLs that fail (404, timeout, etc.) produce records with an error field and empty content. Filter these out in your downstream processing.
  • Combine with chunking: If you need text chunks for vector database ingestion, pair this actor with the URL to Clean Text actor which supports automatic text chunking with configurable overlap.
  • Export as CSV: The Apify dataset can be exported as CSV, making it easy to import into spreadsheets or databases for further analysis.