Website To Markdown avatar

Website To Markdown

Pricing

Pay per usage

Go to Apify Store
Website To Markdown

Website To Markdown

Pricing

Pay per usage

Rating

0.0

(0)

Developer

ryan clinton

ryan clinton

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

Convert any website into clean, structured Markdown text ready for RAG pipelines, LLM context windows, and knowledge bases. Provide one or more URLs and get back well-formatted Markdown with metadata, links, and word counts.

Website to Markdown crawls web pages, strips out navigation, ads, scripts, and other non-content elements, then converts the remaining HTML into clean Markdown. It follows links to configurable depth, respects page limits, and outputs one structured record per page with the Markdown text plus optional metadata (title, description, H1, discovered links).

What data can you extract?

Data PointSourceExample
Markdown contentHTML body converted to Markdown# Getting Started\n\nThis guide covers...
Page title<title> tag and meta tagsGetting Started - Acme Docs
Meta description<meta name="description">Learn how to set up Acme in 5 minutes
H1 headingFirst <h1> on the pageGetting Started
Word countComputed from Markdown output1,247
Discovered linksAll <a> href links on the page["https://docs.acme.com/install", ...]
Page URLCanonical URL after redirectshttps://docs.acme.com/getting-started

Why use Website to Markdown?

LLMs and RAG systems need clean text, not raw HTML full of navigation bars, cookie banners, and script tags. Manually copying and cleaning website content doesn't scale past a handful of pages. This actor automates the entire conversion pipeline: crawl pages, strip non-content elements, convert to Markdown, and return structured data with metadata.

Feed the output directly into vector databases, LLM context windows, fine-tuning datasets, or knowledge base systems without any post-processing.

Built on the Apify platform, Website to Markdown gives you capabilities you won't get from a simple script:

  • Scheduling -- run daily to keep your knowledge base in sync with source websites
  • API access -- trigger conversions programmatically from Python, JavaScript, or any HTTP client
  • Proxy rotation -- crawl at scale without IP blocks using Apify's built-in proxy infrastructure
  • Monitoring -- get notified when runs fail or produce unexpected results
  • Integrations -- connect directly to Zapier, Make, Google Sheets, or webhooks

Features

  • HTML to Markdown conversion that preserves headings, lists, tables, code blocks, bold/italic text, links, and images while stripping scripts, styles, navigation, footers, ads, and cookie banners
  • Configurable crawl depth (0-5 levels) to convert a single page or crawl an entire documentation site
  • Page limit control (1-1,000 pages) to manage costs and scope
  • Same-domain filtering to stay within the target website and avoid crawling external links
  • Custom CSS selector exclusion to remove specific page elements (sidebars, headers, footers) before conversion
  • Metadata extraction including page title, meta description, H1 heading, and word count
  • Link discovery that captures all links found on each page for further processing or site mapping
  • Automatic non-HTML filtering that skips PDFs, images, CSS, JavaScript, fonts, and other binary files
  • Pay-per-event pricing with per-page charging so you only pay for pages actually converted
  • Concurrent crawling at up to 10 simultaneous connections for fast processing of large sites

Use cases for converting websites to Markdown

RAG pipeline ingestion

Convert documentation sites, knowledge bases, or help centers into Markdown chunks ready for embedding into vector databases like Pinecone, Weaviate, or Chroma.

LLM context preparation

Extract clean text from web pages to use as context in ChatGPT, Claude, or other LLM prompts. Markdown is more token-efficient than HTML and preserves document structure.

Knowledge base building

Crawl competitor documentation, industry resources, or internal wikis and convert them into a structured Markdown corpus for search and retrieval systems.

Content migration

Moving content between CMS platforms? Convert existing web pages to Markdown for import into Hugo, Jekyll, Gatsby, Docusaurus, or any Markdown-based system.

Fine-tuning dataset creation

Build training datasets from web content by converting pages to clean Markdown. Filter by word count and metadata to select high-quality content.

Web research and archiving

Archive web content in a clean, readable format. Markdown is future-proof, version-controllable, and human-readable without any rendering engine.

How to convert websites to Markdown

  1. Provide URLs -- Enter one or more website URLs to convert. Each URL is a starting point for the crawl (e.g., https://docs.example.com).
  2. Set crawl depth -- Choose how deep to follow links. Depth 0 converts only the input URLs. Depth 1 also converts pages linked from those URLs. Depth 2 goes one level further, and so on (max 5).
  3. Set page limit -- Choose the maximum number of pages to convert (default 10, max 1,000). This applies globally across all input URLs.
  4. Run the actor -- Click "Start" to begin. The actor crawls each URL, strips non-content HTML, converts to Markdown, and outputs structured results.
  5. Download results -- Once finished, download your data as JSON, CSV, or Excel from the Dataset tab. Each row contains one page's Markdown content with metadata.

Input parameters

ParameterTypeRequiredDefaultDescription
urlsstring[]Yes--Starting URLs to crawl and convert to Markdown.
maxPagesintegerNo10Maximum total pages to crawl across all URLs (1-1,000).
maxDepthintegerNo1How deep to follow links. 0 = only input URLs. 1 = input URLs + pages they link to. Max 5.
sameDomainOnlybooleanNotrueOnly follow links within the same domain as the starting URL. Disable to crawl across domains.
includeMetadatabooleanNotrueInclude page title, description, H1, and discovered links in the output.
excludeSelectorsstring[]No[]Additional CSS selectors to remove before conversion (e.g. nav, footer, .sidebar). Common non-content elements are removed by default.

Input examples

Convert a single page:

{
"urls": ["https://docs.example.com/getting-started"]
}

Crawl an entire documentation site:

{
"urls": ["https://docs.example.com"],
"maxPages": 200,
"maxDepth": 3,
"sameDomainOnly": true
}

Convert multiple pages with no link following:

{
"urls": [
"https://example.com/blog/post-1",
"https://example.com/blog/post-2",
"https://example.com/blog/post-3"
],
"maxDepth": 0
}

Remove sidebar and banner before conversion:

{
"urls": ["https://example.com"],
"maxPages": 50,
"maxDepth": 2,
"excludeSelectors": [".sidebar", ".banner", "#cookie-notice"]
}

Input tips

  • Use depth 0 for specific pages -- if you have exact URLs, set maxDepth to 0 to skip link following and convert only the pages you specified.
  • Use depth 1-2 for documentation sites -- this captures the main content and one or two levels of subpages.
  • Exclude noisy elements -- if the default stripping leaves unwanted content (sidebars, promotional banners), add their CSS selectors to excludeSelectors.
  • Start with a small page limit to preview the output quality before crawling hundreds of pages.

Output example

Each item in the output dataset represents one converted page:

{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started - Example Docs",
"description": "Learn how to set up Example in 5 minutes.",
"h1": "Getting Started",
"markdown": "# Getting Started\n\nWelcome to Example. This guide walks you through setting up your first project.\n\n## Prerequisites\n\n- Node.js 18 or later\n- An Example account ([sign up here](https://example.com/signup))\n\n## Installation\n\n```bash\nnpm install @example/sdk\n```\n\n## Quick start\n\n1. Import the SDK\n2. Initialize with your API key\n3. Make your first request\n\n```javascript\nimport { Example } from '@example/sdk';\n\nconst client = new Example({ apiKey: 'YOUR_KEY' });\nconst result = await client.query('hello world');\nconsole.log(result);\n```",
"wordCount": 87,
"links": [
"https://docs.example.com/installation",
"https://docs.example.com/authentication",
"https://docs.example.com/api-reference"
],
"crawledAt": "2026-03-18T14:32:18.456Z"
}

Output fields

FieldTypeDescription
urlstringCanonical URL of the page (after redirects, fragment stripped)
titlestringPage title from <title> tag (only when includeMetadata is true)
descriptionstringMeta description from <meta name="description"> (only when includeMetadata is true)
h1stringText of the first <h1> heading on the page (only when includeMetadata is true)
markdownstringClean Markdown conversion of the page content with non-content elements stripped
wordCountnumberNumber of words in the Markdown output
linksstring[]All links discovered on the page (only when includeMetadata is true)
crawledAtstringISO 8601 timestamp of when the page was crawled

How much does it cost to convert websites to Markdown?

Website to Markdown uses pay-per-event pricing -- you pay $0.05 per page converted. Platform usage costs are included in the price.

ScenarioPagesCost per pageTotal cost
Quick test1$0.05$0.05
Blog post batch10$0.05$0.50
Documentation site100$0.05$5.00
Large knowledge base500$0.05$25.00
Full site archive1,000$0.05$50.00

You can set a maximum spending limit per run to control costs. The actor charges per page converted, so you only pay for pages that produce output.

Convert websites to Markdown using the API

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ryanclinton/website-to-markdown").call(run_input={
"urls": ["https://docs.example.com"],
"maxPages": 50,
"maxDepth": 2,
"sameDomainOnly": True,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{item['url']} ({item['wordCount']} words)")
# Save markdown to file
filename = item["url"].split("/")[-1] or "index"
with open(f"{filename}.md", "w") as f:
f.write(item["markdown"])

JavaScript

import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "YOUR_API_TOKEN" });
const run = await client.actor("ryanclinton/website-to-markdown").call({
urls: ["https://docs.example.com"],
maxPages: 50,
maxDepth: 2,
sameDomainOnly: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
console.log(`${item.url} (${item.wordCount} words)`);
console.log(item.markdown.substring(0, 200) + "...\n");
}

cURL

# Start the actor run
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-to-markdown/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://docs.example.com"],
"maxPages": 50,
"maxDepth": 2,
"sameDomainOnly": true
}'
# Fetch results (replace DATASET_ID from the run response)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"

How Website to Markdown works

The actor runs a pipeline with two main phases:

Phase 1: Crawl and discover

The actor normalizes each input URL (forces HTTPS if missing), validates it, and creates initial crawl requests at depth 0. CheerioCrawler processes pages with up to 10 concurrent connections, 120 requests/minute rate limit, 30-second timeout per page, and 2 automatic retries. For each page, the actor checks the global page counter against maxPages and deduplicates URLs by stripping fragments.

Phase 2: Strip and convert

For each crawled page, the actor:

  1. Strips non-content elements -- Removes <script>, <style>, <nav>, <footer>, <header>, ads, cookie banners, and any elements matching excludeSelectors. This runs before link extraction to preserve navigation links for crawling.
  2. Extracts links -- Discovers all <a> href links on the page for both crawl enqueuing and output.
  3. Extracts metadata -- Pulls the page title from <title>, description from <meta name="description">, and H1 text from the first <h1> element.
  4. Converts to Markdown -- Transforms the cleaned HTML into Markdown, preserving headings, lists, tables, code blocks, links, images, bold, and italic formatting.
  5. Counts words -- Computes the word count from the Markdown output. Pages with fewer than 5 words are skipped as empty.

If maxDepth allows further crawling, discovered links are enqueued with incremented depth. When sameDomainOnly is enabled, links to other domains are filtered out. Non-HTML URLs (PDFs, images, CSS, JS, fonts, archives) are automatically skipped.

Limitations

  • HTTP-only scraping -- uses CheerioCrawler (server-side HTML parsing), not a browser. JavaScript-rendered content (React, Angular, Vue SPAs) will not be captured. Only the server-rendered HTML is converted.
  • Maximum 1,000 pages per run -- this is a hard cap to prevent runaway crawls. For larger sites, run multiple batches with different starting URLs.
  • Maximum crawl depth of 5 -- deep link chains beyond 5 levels are not followed.
  • No authentication -- only processes publicly accessible pages. Login-gated content is not supported.
  • Same-domain default -- by default, only follows links within the same domain. Cross-domain crawling can be enabled but requires careful page limits to avoid unbounded crawling.
  • Markdown fidelity varies -- complex HTML layouts (multi-column, heavily styled content, interactive elements) may not convert perfectly to Markdown. The output prioritizes clean text extraction over pixel-perfect layout preservation.
  • No image downloading -- images are represented as Markdown image links (![alt](url)) but the actual image files are not downloaded or stored.

Integrations

  • Zapier -- Trigger a Zap when new Markdown content is extracted. Push to Notion, Google Docs, or any connected app.
  • Make -- Build automated workflows that feed converted Markdown into knowledge bases, CMS platforms, or AI pipelines.
  • Google Sheets -- Export page URLs, titles, and word counts to Google Sheets for content audits.
  • Apify API -- Call the actor programmatically. Start runs, poll for completion, and download results in JSON, CSV, XML, or Excel format.
  • Webhooks -- Receive notifications when a run completes, then automatically process results in your backend.
  • LangChain / LlamaIndex -- Feed Markdown output directly into LLM agent frameworks for RAG, summarization, or question-answering workflows.

Responsible use

  • This actor only accesses publicly visible web pages.
  • Respect website terms of service and robots.txt directives.
  • Do not use this actor to scrape copyrighted content for redistribution without permission.
  • Be mindful of crawl rates when targeting small websites -- the default rate limits are designed to be respectful.
  • For guidance on web scraping legality, see Apify's guide.

FAQ

How many pages can I convert in one run? Up to 1,000 pages per run. The default is 10. Set maxPages to control the total. For sites with more than 1,000 pages, run multiple batches starting from different sections of the site.

Does this work with JavaScript-heavy websites (React, Vue, Angular)? No. The actor uses CheerioCrawler which parses server-rendered HTML only. If a website loads its main content via JavaScript, the Markdown output will be empty or incomplete. For JS-rendered sites, consider using a browser-based scraper first and then converting the rendered HTML.

What HTML elements are stripped before conversion? By default, the actor removes <script>, <style>, <nav>, <footer>, <header>, <iframe>, cookie consent banners, ad containers, and other non-content elements. You can add additional elements to remove using the excludeSelectors parameter.

Can I use the Markdown output directly in ChatGPT or Claude? Yes. The Markdown output is clean text optimized for LLM consumption. It preserves document structure (headings, lists, code blocks) while being more token-efficient than raw HTML. Copy the markdown field directly into your prompt or use the API to feed it programmatically.

How is this different from just copying text from a webpage? Copy-paste loses structure -- headings become plain text, lists lose formatting, code blocks merge with surrounding text. This actor preserves Markdown formatting, handles multiple pages automatically, extracts metadata, and outputs structured JSON ready for programmatic use.

Can I schedule this to run periodically? Yes. Use Apify Schedules to run the actor daily, weekly, or at any custom interval. This is useful for keeping knowledge bases in sync with source documentation that changes frequently.

What happens if a page fails to load? The crawler automatically retries failed pages up to 2 times. If a page still fails, it is skipped and a warning is logged. Other pages continue processing normally.

Can I crawl across multiple domains? Yes. Set sameDomainOnly to false to follow links to any domain. Be sure to set a reasonable maxPages limit to prevent the crawl from expanding indefinitely.

ActorHow to combine
Website Contact ScraperExtract emails, phones, and team members from websites instead of converting to Markdown
Google Maps Email ExtractorFind businesses on Google Maps and extract their contact information
Website Tech Stack DetectorIdentify what technologies a website uses alongside its content
WHOIS Domain LookupLook up domain registration details for websites you're converting

Support

Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom scraping solutions or enterprise integrations, reach out through the Apify platform.