Pricing

$5.99 / 1,000 results

Webpage To Markdown

Convert any webpage into clean, structured, LLM-ready Markdown. Handles JavaScript-rendered sites, strips ads and navigation clutter, and outputs metadata alongside content built for RAG pipelines, AI training, SEO audits, and content archiving.

Pricing

$5.99 / 1,000 results

Rating

0.0

(0)

Developer

Kawsar

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Webpage to Markdown Converter

Convert any public webpage into clean, structured, LLM-ready Markdown in seconds. This actor fetches fully rendered pages, strips away noise like ads, navigation, and cookie banners, and outputs high-quality Markdown alongside structured metadata — ready for RAG pipelines, AI training, SEO audits, and content archiving.

Why use this actor?

Most web pages are full of clutter: navigation bars, cookie notices, social share widgets, footer links. When you feed raw HTML into an LLM or a vector database, that noise degrades retrieval quality and inflates token usage. This actor does the heavy lifting — it fetches the page, extracts the meaningful content, and delivers clean Markdown that your pipelines can use directly.

Works on JavaScript-rendered pages (React, Vue, Next.js, Angular, and more)
Extracts semantic main content — isolates articles and body text from site chrome
Supports bulk processing — up to 1,000 URLs per run
Outputs structured metadata — title, description, URL, and timestamp alongside the Markdown
Fully configurable — control what gets included or excluded with CSS selector rules

Use cases

Use case	How this actor helps
RAG / vector search	Feed noise-free page text directly into embedding pipelines for higher retrieval accuracy
LLM fine-tuning	Compile large, clean web corpora without manual preprocessing
SEO auditing	Inspect heading structure, body copy, and semantic layout across multiple URLs
Content archiving	Save readable offline copies of blog posts, documentation, and news articles
AI agent memory	Convert reference pages into Markdown for use as context in agent workflows
Research automation	Batch-convert dozens of sources into a uniform format for analysis

What data does this actor extract?

Every processed URL yields one structured record in the output dataset:

Field	Type	Description
`url`	string	The original URL that was processed
`pageTitle`	string	The HTML `<title>` tag content
`pageDescription`	string	The `<meta name="description">` or Open Graph description
`markdown`	string	Clean, clutter-free Markdown of the page content
`scrapedAt`	string	UTC ISO 8601 timestamp of when the page was processed

Input parameters

Parameter	Type	Default	Required	Description
`urls`	array	`["https://apify.com"]`	Yes	List of webpage URLs to convert. Enter one URL per line.
`onlyMainContent`	boolean	`true`	No	Extract only the core article or body, dropping navigation, headers, and footers.
`includeImages`	boolean	`true`	No	Keep image references in the Markdown output.
`includeLinks`	boolean	`true`	No	Keep hyperlinks in the Markdown output.
`removeSelectors`	array	See below	No	CSS selectors to strip from the page before conversion.
`maxItems`	integer	`100`	No	Maximum number of URLs to process in this run (cap: 1,000).
`requestTimeoutSecs`	integer	`30`	No	Per-request timeout in seconds (range: 5–120).

Default removeSelectors:

script, style, nav, footer, header, noscript, iframe, aside, .ads, .menu

Example input

{
    "urls": [
        "https://apify.com",
        "https://docs.apify.com/academy/getting-started"
    ],
    "onlyMainContent": true,
    "includeImages": true,
    "includeLinks": true,
    "removeSelectors": [
        "script",
        "style",
        "nav",
        "footer",
        "header",
        "noscript",
        "iframe",
        "aside",
        ".ads",
        ".cookie-banner"
    ],
    "maxItems": 50,
    "requestTimeoutSecs": 30
}

Output example

Each converted page is saved as a dataset record. Here is a typical result:

{
    "url": "https://apify.com",
    "pageTitle": "Apify: The web scraping and automation platform",
    "pageDescription": "Apify is the platform where developers build, deploy, and share web scraping, data extraction, and automation tools.",
    "markdown": "# Apify\n\nApify is the platform where developers build, run, and share web scrapers and automation tools.\n\n## Get structured data from any website\n\nWe provide the hosting and infrastructure for scrapers...",
    "scrapedAt": "2026-06-10T04:15:00.000Z"
}

Failed records

If a URL cannot be fetched, the record is still saved with null content fields and an error message so your pipeline knows what to skip or retry:

{
    "url": "https://example.com/404-page",
    "pageTitle": null,
    "pageDescription": null,
    "markdown": null,
    "error": "Page not found: https://example.com/404-page",
    "scrapedAt": "2026-06-10T04:15:05.000Z"
}

How it works

URL validation — Each URL is validated for a correct scheme and host before any request is made.
Page retrieval — Pages are fetched with full JavaScript rendering support, so single-page apps and dynamic sites work out of the box.
HTML cleaning — Unwanted elements are removed using the configured CSS selector list before any content analysis begins.
Main content extraction — When enabled, the actor locates semantic content containers (<main>, <article>, #content, .content, [role="main"]) and discards surrounding site chrome. If no semantic container is found, it falls back to the full page body.
Markdown conversion — The cleaned HTML is converted to properly structured ATX-style Markdown, with configurable handling for images and links.
Metadata extraction — The page title and meta description are captured alongside the Markdown.
Dataset output — Each result is pushed to the Apify dataset immediately, so you can inspect partial results during a long run.

FAQ

Does this actor handle JavaScript-rendered pages?
Yes. The actor retrieves fully rendered page content, so sites built with React, Vue, Next.js, Angular, or any other client-side framework are handled correctly.

How does main content extraction work?
When onlyMainContent is enabled, the actor scans the page for semantic HTML elements — <main>, <article> — and common class/ID patterns like #content, .content, #main. If a match is found, only that block is converted. If no match is found, the full page body is used as a fallback.

Can I target specific sections to remove?
Yes. Use the removeSelectors input to provide any CSS selectors you want stripped before conversion. This works for custom widgets, related posts lists, tracking banners, comment sections, or any other element you want to exclude.

What is the URL limit per run?
The actor processes up to 1,000 URLs per run. For larger batches, split your list across multiple runs.

What happens if a page fails?
Failed pages are recorded in the dataset with null content and a descriptive error message. The run continues processing the remaining URLs rather than stopping on the first failure.

What Markdown format is used?
Headings use ATX style (#, ##, ###), lists use hyphens (-), and inline formatting uses standard CommonMark conventions. The output is compatible with any Markdown renderer or LLM tokenizer.

Can I increase the request timeout for slow sites?
Yes. Set requestTimeoutSecs to up to 120 seconds for sites that take longer to respond.

Integrations and webhooks

Connect this actor to your existing tools using Apify integrations:

Make (formerly Integromat) — trigger workflows when new results arrive
Zapier — connect to thousands of apps automatically
Google Sheets / Google Drive — export results directly to spreadsheets or Drive
Slack — send notifications when a run finishes
Airbyte / GitHub — sync output to data warehouses or version control
Webhooks — call any HTTP endpoint as soon as results are added to the dataset

Get started

Open the actor on Apify and click Try for free
Paste one or more URLs into the Webpage URLs field
Adjust content and selector options as needed
Click Start and view results in the Dataset tab

For API usage, API docs are available for programmatic runs and dataset retrieval.

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Website to Markdown - Clean LLM-Ready Content

ambitious_door/web-to-markdown

Convert any webpage into clean markdown stripped of navigation, ads, and boilerplate. Perfect for RAG pipelines, LLM context, and content extraction. Token counts included.

C. K.

Webpage to Markdown

technicaldost/webpage-to-clean-markdown

Convert any web page into clean, LLM-ready Markdown. Strips ads, nav and boilerplate, keeping headings, links, tables and code. Perfect for RAG pipelines and AI agents.

Technical Dost Solutions

Web to Markdown Converter: AI-Ready Scraper for RAG & LLMs

raional/web-to-markdown-converter

Convert any webpage into clean Markdown or JSON for AI, RAG, and LLM pipelines. Strips ads, navigation, and cookie banners. Optionally follows links to convert an entire site. Powered by the open-source Crawl4AI library.

Raion Al

Web to Markdown — AI-Ready Text from Any URL

wsgcjj/web-to-markdown

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

陈俊杰

Webpage Text & Markdown Extractor

snapperwapper/webpage-text-markdown-extractor

Convert up to 1,000 webpage URLs into clean readable text, Markdown, metadata, canonical URLs, images, and deduplicated links for AI and content workflows.

snapperwapper

AI Markdown Maker

onescales/bulk-ai-markdown-maker

Convert any web page into clean, AI ready markdown format in seconds. This markdown generator is perfect for content for AI models, creating documentation, or archiving web content. It intelligently parses web content, removing ads, navigation, and other clutter. Generate Markdown Today!

One Scales

145

5.0

Web Content Extractor - Clean Markdown for AI

geekguymj/web-content-extractor

Extract clean, readable markdown content from any web page. Removes navigation, ads, footers, and boilerplate — outputs structured markdown optimized for LLM training, RAG pipelines, and AI agents. Pay-per-event pricing. $0.002/page.

Matthew Jenkins

AI Web Reader (RAG Ready)

viinaysonii/ai-web-reader-rag-ready

Convert any webpage into clean, structured, AI-ready Markdown. Removes ads, images, and UI noise, normalizes content, and outputs data optimized for LLMs, RAG pipelines, and AI agents. Fast, scalable, and built for real-world AI workflows.