Webpage To Markdown
Pricing
Pay per usage
Webpage To Markdown
Convert any webpage into clean, structured, LLM-ready Markdown. Handles JavaScript-rendered sites, strips ads and navigation clutter, and outputs metadata alongside content built for RAG pipelines, AI training, SEO audits, and content archiving.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Kawsar
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Webpage to Markdown Converter
Convert any public webpage into clean, structured, LLM-ready Markdown in seconds. This actor fetches fully rendered pages, strips away noise like ads, navigation, and cookie banners, and outputs high-quality Markdown alongside structured metadata — ready for RAG pipelines, AI training, SEO audits, and content archiving.
Why use this actor?
Most web pages are full of clutter: navigation bars, cookie notices, social share widgets, footer links. When you feed raw HTML into an LLM or a vector database, that noise degrades retrieval quality and inflates token usage. This actor does the heavy lifting — it fetches the page, extracts the meaningful content, and delivers clean Markdown that your pipelines can use directly.
- Works on JavaScript-rendered pages (React, Vue, Next.js, Angular, and more)
- Extracts semantic main content — isolates articles and body text from site chrome
- Supports bulk processing — up to 1,000 URLs per run
- Outputs structured metadata — title, description, URL, and timestamp alongside the Markdown
- Fully configurable — control what gets included or excluded with CSS selector rules
Use cases
| Use case | How this actor helps |
|---|---|
| RAG / vector search | Feed noise-free page text directly into embedding pipelines for higher retrieval accuracy |
| LLM fine-tuning | Compile large, clean web corpora without manual preprocessing |
| SEO auditing | Inspect heading structure, body copy, and semantic layout across multiple URLs |
| Content archiving | Save readable offline copies of blog posts, documentation, and news articles |
| AI agent memory | Convert reference pages into Markdown for use as context in agent workflows |
| Research automation | Batch-convert dozens of sources into a uniform format for analysis |
What data does this actor extract?
Every processed URL yields one structured record in the output dataset:
| Field | Type | Description |
|---|---|---|
url | string | The original URL that was processed |
pageTitle | string | The HTML <title> tag content |
pageDescription | string | The <meta name="description"> or Open Graph description |
markdown | string | Clean, clutter-free Markdown of the page content |
scrapedAt | string | UTC ISO 8601 timestamp of when the page was processed |
Input parameters
| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
urls | array | ["https://apify.com"] | Yes | List of webpage URLs to convert. Enter one URL per line. |
onlyMainContent | boolean | true | No | Extract only the core article or body, dropping navigation, headers, and footers. |
includeImages | boolean | true | No | Keep image references in the Markdown output. |
includeLinks | boolean | true | No | Keep hyperlinks in the Markdown output. |
removeSelectors | array | See below | No | CSS selectors to strip from the page before conversion. |
maxItems | integer | 100 | No | Maximum number of URLs to process in this run (cap: 1,000). |
requestTimeoutSecs | integer | 30 | No | Per-request timeout in seconds (range: 5–120). |
Default removeSelectors:
script, style, nav, footer, header, noscript, iframe, aside, .ads, .menu
Example input
{"urls": ["https://apify.com","https://docs.apify.com/academy/getting-started"],"onlyMainContent": true,"includeImages": true,"includeLinks": true,"removeSelectors": ["script","style","nav","footer","header","noscript","iframe","aside",".ads",".cookie-banner"],"maxItems": 50,"requestTimeoutSecs": 30}
Output example
Each converted page is saved as a dataset record. Here is a typical result:
{"url": "https://apify.com","pageTitle": "Apify: The web scraping and automation platform","pageDescription": "Apify is the platform where developers build, deploy, and share web scraping, data extraction, and automation tools.","markdown": "# Apify\n\nApify is the platform where developers build, run, and share web scrapers and automation tools.\n\n## Get structured data from any website\n\nWe provide the hosting and infrastructure for scrapers...","scrapedAt": "2026-06-10T04:15:00.000Z"}
Failed records
If a URL cannot be fetched, the record is still saved with null content fields and an error message so your pipeline knows what to skip or retry:
{"url": "https://example.com/404-page","pageTitle": null,"pageDescription": null,"markdown": null,"error": "Page not found: https://example.com/404-page","scrapedAt": "2026-06-10T04:15:05.000Z"}
How it works
- URL validation — Each URL is validated for a correct scheme and host before any request is made.
- Page retrieval — Pages are fetched with full JavaScript rendering support, so single-page apps and dynamic sites work out of the box.
- HTML cleaning — Unwanted elements are removed using the configured CSS selector list before any content analysis begins.
- Main content extraction — When enabled, the actor locates semantic content containers (
<main>,<article>,#content,.content,[role="main"]) and discards surrounding site chrome. If no semantic container is found, it falls back to the full page body. - Markdown conversion — The cleaned HTML is converted to properly structured ATX-style Markdown, with configurable handling for images and links.
- Metadata extraction — The page title and meta description are captured alongside the Markdown.
- Dataset output — Each result is pushed to the Apify dataset immediately, so you can inspect partial results during a long run.
FAQ
Does this actor handle JavaScript-rendered pages?
Yes. The actor retrieves fully rendered page content, so sites built with React, Vue, Next.js, Angular, or any other client-side framework are handled correctly.
How does main content extraction work?
When onlyMainContent is enabled, the actor scans the page for semantic HTML elements — <main>, <article> — and common class/ID patterns like #content, .content, #main. If a match is found, only that block is converted. If no match is found, the full page body is used as a fallback.
Can I target specific sections to remove?
Yes. Use the removeSelectors input to provide any CSS selectors you want stripped before conversion. This works for custom widgets, related posts lists, tracking banners, comment sections, or any other element you want to exclude.
What is the URL limit per run?
The actor processes up to 1,000 URLs per run. For larger batches, split your list across multiple runs.
What happens if a page fails?
Failed pages are recorded in the dataset with null content and a descriptive error message. The run continues processing the remaining URLs rather than stopping on the first failure.
What Markdown format is used?
Headings use ATX style (#, ##, ###), lists use hyphens (-), and inline formatting uses standard CommonMark conventions. The output is compatible with any Markdown renderer or LLM tokenizer.
Can I increase the request timeout for slow sites?
Yes. Set requestTimeoutSecs to up to 120 seconds for sites that take longer to respond.
Integrations and webhooks
Connect this actor to your existing tools using Apify integrations:
- Make (formerly Integromat) — trigger workflows when new results arrive
- Zapier — connect to thousands of apps automatically
- Google Sheets / Google Drive — export results directly to spreadsheets or Drive
- Slack — send notifications when a run finishes
- Airbyte / GitHub — sync output to data warehouses or version control
- Webhooks — call any HTTP endpoint as soon as results are added to the dataset
Get started
- Open the actor on Apify and click Try for free
- Paste one or more URLs into the Webpage URLs field
- Adjust content and selector options as needed
- Click Start and view results in the Dataset tab
For API usage, API docs are available for programmatic runs and dataset retrieval.