Website To Markdown
Pricing
Pay per usage
Website To Markdown
Pricing
Pay per usage
Rating
0.0
(0)
Developer

ryan clinton
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Convert any website into clean, structured Markdown text ready for RAG pipelines, LLM context windows, and knowledge bases. Provide one or more URLs and get back well-formatted Markdown with metadata, links, and word counts.
Website to Markdown crawls web pages, strips out navigation, ads, scripts, and other non-content elements, then converts the remaining HTML into clean Markdown. It follows links to configurable depth, respects page limits, and outputs one structured record per page with the Markdown text plus optional metadata (title, description, H1, discovered links).
What data can you extract?
| Data Point | Source | Example |
|---|---|---|
| Markdown content | HTML body converted to Markdown | # Getting Started\n\nThis guide covers... |
| Page title | <title> tag and meta tags | Getting Started - Acme Docs |
| Meta description | <meta name="description"> | Learn how to set up Acme in 5 minutes |
| H1 heading | First <h1> on the page | Getting Started |
| Word count | Computed from Markdown output | 1,247 |
| Discovered links | All <a> href links on the page | ["https://docs.acme.com/install", ...] |
| Page URL | Canonical URL after redirects | https://docs.acme.com/getting-started |
Why use Website to Markdown?
LLMs and RAG systems need clean text, not raw HTML full of navigation bars, cookie banners, and script tags. Manually copying and cleaning website content doesn't scale past a handful of pages. This actor automates the entire conversion pipeline: crawl pages, strip non-content elements, convert to Markdown, and return structured data with metadata.
Feed the output directly into vector databases, LLM context windows, fine-tuning datasets, or knowledge base systems without any post-processing.
Built on the Apify platform, Website to Markdown gives you capabilities you won't get from a simple script:
- Scheduling -- run daily to keep your knowledge base in sync with source websites
- API access -- trigger conversions programmatically from Python, JavaScript, or any HTTP client
- Proxy rotation -- crawl at scale without IP blocks using Apify's built-in proxy infrastructure
- Monitoring -- get notified when runs fail or produce unexpected results
- Integrations -- connect directly to Zapier, Make, Google Sheets, or webhooks
Features
- HTML to Markdown conversion that preserves headings, lists, tables, code blocks, bold/italic text, links, and images while stripping scripts, styles, navigation, footers, ads, and cookie banners
- Configurable crawl depth (0-5 levels) to convert a single page or crawl an entire documentation site
- Page limit control (1-1,000 pages) to manage costs and scope
- Same-domain filtering to stay within the target website and avoid crawling external links
- Custom CSS selector exclusion to remove specific page elements (sidebars, headers, footers) before conversion
- Metadata extraction including page title, meta description, H1 heading, and word count
- Link discovery that captures all links found on each page for further processing or site mapping
- Automatic non-HTML filtering that skips PDFs, images, CSS, JavaScript, fonts, and other binary files
- Pay-per-event pricing with per-page charging so you only pay for pages actually converted
- Concurrent crawling at up to 10 simultaneous connections for fast processing of large sites
Use cases for converting websites to Markdown
RAG pipeline ingestion
Convert documentation sites, knowledge bases, or help centers into Markdown chunks ready for embedding into vector databases like Pinecone, Weaviate, or Chroma.
LLM context preparation
Extract clean text from web pages to use as context in ChatGPT, Claude, or other LLM prompts. Markdown is more token-efficient than HTML and preserves document structure.
Knowledge base building
Crawl competitor documentation, industry resources, or internal wikis and convert them into a structured Markdown corpus for search and retrieval systems.
Content migration
Moving content between CMS platforms? Convert existing web pages to Markdown for import into Hugo, Jekyll, Gatsby, Docusaurus, or any Markdown-based system.
Fine-tuning dataset creation
Build training datasets from web content by converting pages to clean Markdown. Filter by word count and metadata to select high-quality content.
Web research and archiving
Archive web content in a clean, readable format. Markdown is future-proof, version-controllable, and human-readable without any rendering engine.
How to convert websites to Markdown
- Provide URLs -- Enter one or more website URLs to convert. Each URL is a starting point for the crawl (e.g.,
https://docs.example.com). - Set crawl depth -- Choose how deep to follow links. Depth 0 converts only the input URLs. Depth 1 also converts pages linked from those URLs. Depth 2 goes one level further, and so on (max 5).
- Set page limit -- Choose the maximum number of pages to convert (default 10, max 1,000). This applies globally across all input URLs.
- Run the actor -- Click "Start" to begin. The actor crawls each URL, strips non-content HTML, converts to Markdown, and outputs structured results.
- Download results -- Once finished, download your data as JSON, CSV, or Excel from the Dataset tab. Each row contains one page's Markdown content with metadata.
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | -- | Starting URLs to crawl and convert to Markdown. |
maxPages | integer | No | 10 | Maximum total pages to crawl across all URLs (1-1,000). |
maxDepth | integer | No | 1 | How deep to follow links. 0 = only input URLs. 1 = input URLs + pages they link to. Max 5. |
sameDomainOnly | boolean | No | true | Only follow links within the same domain as the starting URL. Disable to crawl across domains. |
includeMetadata | boolean | No | true | Include page title, description, H1, and discovered links in the output. |
excludeSelectors | string[] | No | [] | Additional CSS selectors to remove before conversion (e.g. nav, footer, .sidebar). Common non-content elements are removed by default. |
Input examples
Convert a single page:
{"urls": ["https://docs.example.com/getting-started"]}
Crawl an entire documentation site:
{"urls": ["https://docs.example.com"],"maxPages": 200,"maxDepth": 3,"sameDomainOnly": true}
Convert multiple pages with no link following:
{"urls": ["https://example.com/blog/post-1","https://example.com/blog/post-2","https://example.com/blog/post-3"],"maxDepth": 0}
Remove sidebar and banner before conversion:
{"urls": ["https://example.com"],"maxPages": 50,"maxDepth": 2,"excludeSelectors": [".sidebar", ".banner", "#cookie-notice"]}
Input tips
- Use depth 0 for specific pages -- if you have exact URLs, set
maxDepthto 0 to skip link following and convert only the pages you specified. - Use depth 1-2 for documentation sites -- this captures the main content and one or two levels of subpages.
- Exclude noisy elements -- if the default stripping leaves unwanted content (sidebars, promotional banners), add their CSS selectors to
excludeSelectors. - Start with a small page limit to preview the output quality before crawling hundreds of pages.
Output example
Each item in the output dataset represents one converted page:
{"url": "https://docs.example.com/getting-started","title": "Getting Started - Example Docs","description": "Learn how to set up Example in 5 minutes.","h1": "Getting Started","markdown": "# Getting Started\n\nWelcome to Example. This guide walks you through setting up your first project.\n\n## Prerequisites\n\n- Node.js 18 or later\n- An Example account ([sign up here](https://example.com/signup))\n\n## Installation\n\n```bash\nnpm install @example/sdk\n```\n\n## Quick start\n\n1. Import the SDK\n2. Initialize with your API key\n3. Make your first request\n\n```javascript\nimport { Example } from '@example/sdk';\n\nconst client = new Example({ apiKey: 'YOUR_KEY' });\nconst result = await client.query('hello world');\nconsole.log(result);\n```","wordCount": 87,"links": ["https://docs.example.com/installation","https://docs.example.com/authentication","https://docs.example.com/api-reference"],"crawledAt": "2026-03-18T14:32:18.456Z"}
Output fields
| Field | Type | Description |
|---|---|---|
url | string | Canonical URL of the page (after redirects, fragment stripped) |
title | string | Page title from <title> tag (only when includeMetadata is true) |
description | string | Meta description from <meta name="description"> (only when includeMetadata is true) |
h1 | string | Text of the first <h1> heading on the page (only when includeMetadata is true) |
markdown | string | Clean Markdown conversion of the page content with non-content elements stripped |
wordCount | number | Number of words in the Markdown output |
links | string[] | All links discovered on the page (only when includeMetadata is true) |
crawledAt | string | ISO 8601 timestamp of when the page was crawled |
How much does it cost to convert websites to Markdown?
Website to Markdown uses pay-per-event pricing -- you pay $0.05 per page converted. Platform usage costs are included in the price.
| Scenario | Pages | Cost per page | Total cost |
|---|---|---|---|
| Quick test | 1 | $0.05 | $0.05 |
| Blog post batch | 10 | $0.05 | $0.50 |
| Documentation site | 100 | $0.05 | $5.00 |
| Large knowledge base | 500 | $0.05 | $25.00 |
| Full site archive | 1,000 | $0.05 | $50.00 |
You can set a maximum spending limit per run to control costs. The actor charges per page converted, so you only pay for pages that produce output.
Convert websites to Markdown using the API
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("ryanclinton/website-to-markdown").call(run_input={"urls": ["https://docs.example.com"],"maxPages": 50,"maxDepth": 2,"sameDomainOnly": True,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{item['url']} ({item['wordCount']} words)")# Save markdown to filefilename = item["url"].split("/")[-1] or "index"with open(f"{filename}.md", "w") as f:f.write(item["markdown"])
JavaScript
import { ApifyClient } from "apify-client";const client = new ApifyClient({ token: "YOUR_API_TOKEN" });const run = await client.actor("ryanclinton/website-to-markdown").call({urls: ["https://docs.example.com"],maxPages: 50,maxDepth: 2,sameDomainOnly: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();for (const item of items) {console.log(`${item.url} (${item.wordCount} words)`);console.log(item.markdown.substring(0, 200) + "...\n");}
cURL
# Start the actor runcurl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-to-markdown/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://docs.example.com"],"maxPages": 50,"maxDepth": 2,"sameDomainOnly": true}'# Fetch results (replace DATASET_ID from the run response)curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"
How Website to Markdown works
The actor runs a pipeline with two main phases:
Phase 1: Crawl and discover
The actor normalizes each input URL (forces HTTPS if missing), validates it, and creates initial crawl requests at depth 0. CheerioCrawler processes pages with up to 10 concurrent connections, 120 requests/minute rate limit, 30-second timeout per page, and 2 automatic retries. For each page, the actor checks the global page counter against maxPages and deduplicates URLs by stripping fragments.
Phase 2: Strip and convert
For each crawled page, the actor:
- Strips non-content elements -- Removes
<script>,<style>,<nav>,<footer>,<header>, ads, cookie banners, and any elements matchingexcludeSelectors. This runs before link extraction to preserve navigation links for crawling. - Extracts links -- Discovers all
<a>href links on the page for both crawl enqueuing and output. - Extracts metadata -- Pulls the page title from
<title>, description from<meta name="description">, and H1 text from the first<h1>element. - Converts to Markdown -- Transforms the cleaned HTML into Markdown, preserving headings, lists, tables, code blocks, links, images, bold, and italic formatting.
- Counts words -- Computes the word count from the Markdown output. Pages with fewer than 5 words are skipped as empty.
If maxDepth allows further crawling, discovered links are enqueued with incremented depth. When sameDomainOnly is enabled, links to other domains are filtered out. Non-HTML URLs (PDFs, images, CSS, JS, fonts, archives) are automatically skipped.
Limitations
- HTTP-only scraping -- uses CheerioCrawler (server-side HTML parsing), not a browser. JavaScript-rendered content (React, Angular, Vue SPAs) will not be captured. Only the server-rendered HTML is converted.
- Maximum 1,000 pages per run -- this is a hard cap to prevent runaway crawls. For larger sites, run multiple batches with different starting URLs.
- Maximum crawl depth of 5 -- deep link chains beyond 5 levels are not followed.
- No authentication -- only processes publicly accessible pages. Login-gated content is not supported.
- Same-domain default -- by default, only follows links within the same domain. Cross-domain crawling can be enabled but requires careful page limits to avoid unbounded crawling.
- Markdown fidelity varies -- complex HTML layouts (multi-column, heavily styled content, interactive elements) may not convert perfectly to Markdown. The output prioritizes clean text extraction over pixel-perfect layout preservation.
- No image downloading -- images are represented as Markdown image links (
) but the actual image files are not downloaded or stored.
Integrations
- Zapier -- Trigger a Zap when new Markdown content is extracted. Push to Notion, Google Docs, or any connected app.
- Make -- Build automated workflows that feed converted Markdown into knowledge bases, CMS platforms, or AI pipelines.
- Google Sheets -- Export page URLs, titles, and word counts to Google Sheets for content audits.
- Apify API -- Call the actor programmatically. Start runs, poll for completion, and download results in JSON, CSV, XML, or Excel format.
- Webhooks -- Receive notifications when a run completes, then automatically process results in your backend.
- LangChain / LlamaIndex -- Feed Markdown output directly into LLM agent frameworks for RAG, summarization, or question-answering workflows.
Responsible use
- This actor only accesses publicly visible web pages.
- Respect website terms of service and
robots.txtdirectives. - Do not use this actor to scrape copyrighted content for redistribution without permission.
- Be mindful of crawl rates when targeting small websites -- the default rate limits are designed to be respectful.
- For guidance on web scraping legality, see Apify's guide.
FAQ
How many pages can I convert in one run?
Up to 1,000 pages per run. The default is 10. Set maxPages to control the total. For sites with more than 1,000 pages, run multiple batches starting from different sections of the site.
Does this work with JavaScript-heavy websites (React, Vue, Angular)? No. The actor uses CheerioCrawler which parses server-rendered HTML only. If a website loads its main content via JavaScript, the Markdown output will be empty or incomplete. For JS-rendered sites, consider using a browser-based scraper first and then converting the rendered HTML.
What HTML elements are stripped before conversion?
By default, the actor removes <script>, <style>, <nav>, <footer>, <header>, <iframe>, cookie consent banners, ad containers, and other non-content elements. You can add additional elements to remove using the excludeSelectors parameter.
Can I use the Markdown output directly in ChatGPT or Claude?
Yes. The Markdown output is clean text optimized for LLM consumption. It preserves document structure (headings, lists, code blocks) while being more token-efficient than raw HTML. Copy the markdown field directly into your prompt or use the API to feed it programmatically.
How is this different from just copying text from a webpage? Copy-paste loses structure -- headings become plain text, lists lose formatting, code blocks merge with surrounding text. This actor preserves Markdown formatting, handles multiple pages automatically, extracts metadata, and outputs structured JSON ready for programmatic use.
Can I schedule this to run periodically? Yes. Use Apify Schedules to run the actor daily, weekly, or at any custom interval. This is useful for keeping knowledge bases in sync with source documentation that changes frequently.
What happens if a page fails to load? The crawler automatically retries failed pages up to 2 times. If a page still fails, it is skipped and a warning is logged. Other pages continue processing normally.
Can I crawl across multiple domains?
Yes. Set sameDomainOnly to false to follow links to any domain. Be sure to set a reasonable maxPages limit to prevent the crawl from expanding indefinitely.
Related actors
| Actor | How to combine |
|---|---|
| Website Contact Scraper | Extract emails, phones, and team members from websites instead of converting to Markdown |
| Google Maps Email Extractor | Find businesses on Google Maps and extract their contact information |
| Website Tech Stack Detector | Identify what technologies a website uses alongside its content |
| WHOIS Domain Lookup | Look up domain registration details for websites you're converting |
Support
Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom scraping solutions or enterprise integrations, reach out through the Apify platform.