Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

Website to Markdown — AI-Ready Content for RAG

Deprecated

See alternative Actors

Website To Markdown. Available on the Apify Store with pay-per-event pricing.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ryan Clinton

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Website to Markdown

Convert any website into clean, structured Markdown text ready for RAG pipelines, LLM context windows, and knowledge bases. Provide one or more URLs and get back well-formatted Markdown with metadata, links, and word counts.

Website to Markdown crawls web pages, strips out navigation, ads, scripts, and other non-content elements, then converts the remaining HTML into clean Markdown. It follows links to configurable depth, respects page limits, and outputs one structured record per page with the Markdown text plus optional metadata (title, description, H1, discovered links).

What data can you extract?

Data Point	Source	Example
Markdown content	HTML body converted to Markdown	`# Getting Started\n\nThis guide covers...`
Page title	`<title>` tag and meta tags	`Getting Started - Acme Docs`
Meta description	`<meta name="description">`	`Learn how to set up Acme in 5 minutes`
H1 heading	First `<h1>` on the page	`Getting Started`
Word count	Computed from Markdown output	`1,247`
Discovered links	All `<a>` href links on the page	`["https://docs.acme.com/install", ...]`
Page URL	Canonical URL after redirects	`https://docs.acme.com/getting-started`

Why use Website to Markdown?

LLMs and RAG systems need clean text, not raw HTML full of navigation bars, cookie banners, and script tags. Manually copying and cleaning website content doesn't scale past a handful of pages. This actor automates the entire conversion pipeline: crawl pages, strip non-content elements, convert to Markdown, and return structured data with metadata.

Feed the output directly into vector databases, LLM context windows, fine-tuning datasets, or knowledge base systems without any post-processing.

Built on the Apify platform, Website to Markdown gives you capabilities you won't get from a simple script:

Scheduling -- run daily to keep your knowledge base in sync with source websites
API access -- trigger conversions programmatically from Python, JavaScript, or any HTTP client
Proxy rotation -- crawl at scale without IP blocks using Apify's built-in proxy infrastructure
Monitoring -- get notified when runs fail or produce unexpected results
Integrations -- connect directly to Zapier, Make, Google Sheets, or webhooks

Features

HTML to Markdown conversion that preserves headings, lists, tables, code blocks, bold/italic text, links, and images while stripping scripts, styles, navigation, footers, ads, and cookie banners
Configurable crawl depth (0-5 levels) to convert a single page or crawl an entire documentation site
Page limit control (1-1,000 pages) to manage costs and scope
Same-domain filtering to stay within the target website and avoid crawling external links
Custom CSS selector exclusion to remove specific page elements (sidebars, headers, footers) before conversion
Metadata extraction including page title, meta description, H1 heading, and word count
Link discovery that captures all links found on each page for further processing or site mapping
Automatic non-HTML filtering that skips PDFs, images, CSS, JavaScript, fonts, and other binary files
Pay-per-event pricing with per-page charging so you only pay for pages actually converted
Concurrent crawling at up to 10 simultaneous connections for fast processing of large sites

Use cases for converting websites to Markdown

RAG pipeline ingestion

Convert documentation sites, knowledge bases, or help centers into Markdown chunks ready for embedding into vector databases like Pinecone, Weaviate, or Chroma.

LLM context preparation

Extract clean text from web pages to use as context in ChatGPT, Claude, or other LLM prompts. Markdown is more token-efficient than HTML and preserves document structure.

Knowledge base building

Crawl competitor documentation, industry resources, or internal wikis and convert them into a structured Markdown corpus for search and retrieval systems.

Content migration

Moving content between CMS platforms? Convert existing web pages to Markdown for import into Hugo, Jekyll, Gatsby, Docusaurus, or any Markdown-based system.

Fine-tuning dataset creation

Build training datasets from web content by converting pages to clean Markdown. Filter by word count and metadata to select high-quality content.

Web research and archiving

Archive web content in a clean, readable format. Markdown is future-proof, version-controllable, and human-readable without any rendering engine.

How to convert websites to Markdown

Provide URLs -- Enter one or more website URLs to convert. Each URL is a starting point for the crawl (e.g., https://docs.example.com).
Set crawl depth -- Choose how deep to follow links. Depth 0 converts only the input URLs. Depth 1 also converts pages linked from those URLs. Depth 2 goes one level further, and so on (max 5).
Set page limit -- Choose the maximum number of pages to convert (default 10, max 1,000). This applies globally across all input URLs.
Run the actor -- Click "Start" to begin. The actor crawls each URL, strips non-content HTML, converts to Markdown, and outputs structured results.
Download results -- Once finished, download your data as JSON, CSV, or Excel from the Dataset tab. Each row contains one page's Markdown content with metadata.

Input parameters

Parameter	Type	Required	Default	Description
`urls`	string[]	Yes	--	Starting URLs to crawl and convert to Markdown.
`maxPages`	integer	No	10	Maximum total pages to crawl across all URLs (1-1,000).
`maxDepth`	integer	No	1	How deep to follow links. 0 = only input URLs. 1 = input URLs + pages they link to. Max 5.
`sameDomainOnly`	boolean	No	true	Only follow links within the same domain as the starting URL. Disable to crawl across domains.
`includeMetadata`	boolean	No	true	Include page title, description, H1, and discovered links in the output.
`excludeSelectors`	string[]	No	[]	Additional CSS selectors to remove before conversion (e.g. `nav`, `footer`, `.sidebar`). Common non-content elements are removed by default.

Input examples

Convert a single page:

{
    "urls": ["https://docs.example.com/getting-started"]
}

Crawl an entire documentation site:

{
    "urls": ["https://docs.example.com"],
    "maxPages": 200,
    "maxDepth": 3,
    "sameDomainOnly": true
}

Convert multiple pages with no link following:

{
    "urls": [
        "https://example.com/blog/post-1",
        "https://example.com/blog/post-2",
        "https://example.com/blog/post-3"
    ],
    "maxDepth": 0
}

Remove sidebar and banner before conversion:

{
    "urls": ["https://example.com"],
    "maxPages": 50,
    "maxDepth": 2,
    "excludeSelectors": [".sidebar", ".banner", "#cookie-notice"]
}

Input tips

Use depth 0 for specific pages -- if you have exact URLs, set maxDepth to 0 to skip link following and convert only the pages you specified.
Use depth 1-2 for documentation sites -- this captures the main content and one or two levels of subpages.
Exclude noisy elements -- if the default stripping leaves unwanted content (sidebars, promotional banners), add their CSS selectors to excludeSelectors.
Start with a small page limit to preview the output quality before crawling hundreds of pages.

Output example

Each item in the output dataset represents one converted page:

{
    "url": "https://docs.example.com/getting-started",
    "title": "Getting Started - Example Docs",
    "description": "Learn how to set up Example in 5 minutes.",
    "h1": "Getting Started",
    "markdown": "# Getting Started\n\nWelcome to Example. This guide walks you through setting up your first project.\n\n## Prerequisites\n\n- Node.js 18 or later\n- An Example account ([sign up here](https://example.com/signup))\n\n## Installation\n\n```bash\nnpm install @example/sdk\n```\n\n## Quick start\n\n1. Import the SDK\n2. Initialize with your API key\n3. Make your first request\n\n```javascript\nimport { Example } from '@example/sdk';\n\nconst client = new Example({ apiKey: 'YOUR_KEY' });\nconst result = await client.query('hello world');\nconsole.log(result);\n```",
    "wordCount": 87,
    "links": [
        "https://docs.example.com/installation",
        "https://docs.example.com/authentication",
        "https://docs.example.com/api-reference"
    ],
    "crawledAt": "2026-03-18T14:32:18.456Z"
}

Output fields

Field	Type	Description
`url`	string	Canonical URL of the page (after redirects, fragment stripped)
`title`	string	Page title from `<title>` tag (only when `includeMetadata` is true)
`description`	string	Meta description from `<meta name="description">` (only when `includeMetadata` is true)
`h1`	string	Text of the first `<h1>` heading on the page (only when `includeMetadata` is true)
`markdown`	string	Clean Markdown conversion of the page content with non-content elements stripped
`wordCount`	number	Number of words in the Markdown output
`links`	string[]	All links discovered on the page (only when `includeMetadata` is true)
`crawledAt`	string	ISO 8601 timestamp of when the page was crawled

How much does it cost to convert websites to Markdown?

Website to Markdown uses pay-per-event pricing -- you pay $0.05 per page converted. Platform usage costs are included in the price.

Scenario	Pages	Cost per page	Total cost
Quick test	1	$0.05	$0.05
Blog post batch	10	$0.05	$0.50
Documentation site	100	$0.05	$5.00
Large knowledge base	500	$0.05	$25.00
Full site archive	1,000	$0.05	$50.00

You can set a maximum spending limit per run to control costs. The actor charges per page converted, so you only pay for pages that produce output.

Convert websites to Markdown using the API

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("ryanclinton/website-to-markdown").call(run_input={
    "urls": ["https://docs.example.com"],
    "maxPages": 50,
    "maxDepth": 2,
    "sameDomainOnly": True,
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item['url']} ({item['wordCount']} words)")
    # Save markdown to file
    filename = item["url"].split("/")[-1] or "index"
    with open(f"{filename}.md", "w") as f:
        f.write(item["markdown"])

JavaScript

import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "YOUR_API_TOKEN" });

const run = await client.actor("ryanclinton/website-to-markdown").call({
    urls: ["https://docs.example.com"],
    maxPages: 50,
    maxDepth: 2,
    sameDomainOnly: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
    console.log(`${item.url} (${item.wordCount} words)`);
    console.log(item.markdown.substring(0, 200) + "...\n");
}

cURL

# Start the actor run
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-to-markdown/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://docs.example.com"],
    "maxPages": 50,
    "maxDepth": 2,
    "sameDomainOnly": true
  }'

# Fetch results (replace DATASET_ID from the run response)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"

How Website to Markdown works

The actor runs a pipeline with two main phases:

Phase 1: Crawl and discover

The actor normalizes each input URL (forces HTTPS if missing), validates it, and creates initial crawl requests at depth 0. CheerioCrawler processes pages with up to 10 concurrent connections, 120 requests/minute rate limit, 30-second timeout per page, and 2 automatic retries. For each page, the actor checks the global page counter against maxPages and deduplicates URLs by stripping fragments.

Phase 2: Strip and convert

For each crawled page, the actor:

Strips non-content elements -- Removes <script>, <style>, <nav>, <footer>, <header>, ads, cookie banners, and any elements matching excludeSelectors. This runs before link extraction to preserve navigation links for crawling.
Extracts links -- Discovers all <a> href links on the page for both crawl enqueuing and output.
Extracts metadata -- Pulls the page title from <title>, description from <meta name="description">, and H1 text from the first <h1> element.
Converts to Markdown -- Transforms the cleaned HTML into Markdown, preserving headings, lists, tables, code blocks, links, images, bold, and italic formatting.
Counts words -- Computes the word count from the Markdown output. Pages with fewer than 5 words are skipped as empty.

If maxDepth allows further crawling, discovered links are enqueued with incremented depth. When sameDomainOnly is enabled, links to other domains are filtered out. Non-HTML URLs (PDFs, images, CSS, JS, fonts, archives) are automatically skipped.

Limitations

HTTP-only scraping -- uses CheerioCrawler (server-side HTML parsing), not a browser. JavaScript-rendered content (React, Angular, Vue SPAs) will not be captured. Only the server-rendered HTML is converted.
Maximum 1,000 pages per run -- this is a hard cap to prevent runaway crawls. For larger sites, run multiple batches with different starting URLs.
Maximum crawl depth of 5 -- deep link chains beyond 5 levels are not followed.
No authentication -- only processes publicly accessible pages. Login-gated content is not supported.
Same-domain default -- by default, only follows links within the same domain. Cross-domain crawling can be enabled but requires careful page limits to avoid unbounded crawling.
Markdown fidelity varies -- complex HTML layouts (multi-column, heavily styled content, interactive elements) may not convert perfectly to Markdown. The output prioritizes clean text extraction over pixel-perfect layout preservation.
No image downloading -- images are represented as Markdown image links (![alt](url)) but the actual image files are not downloaded or stored.

Integrations

Zapier -- Trigger a Zap when new Markdown content is extracted. Push to Notion, Google Docs, or any connected app.
Make -- Build automated workflows that feed converted Markdown into knowledge bases, CMS platforms, or AI pipelines.
Google Sheets -- Export page URLs, titles, and word counts to Google Sheets for content audits.
Apify API -- Call the actor programmatically. Start runs, poll for completion, and download results in JSON, CSV, XML, or Excel format.
Webhooks -- Receive notifications when a run completes, then automatically process results in your backend.
LangChain / LlamaIndex -- Feed Markdown output directly into LLM agent frameworks for RAG, summarization, or question-answering workflows.

Responsible use

This actor only accesses publicly visible web pages.
Respect website terms of service and robots.txt directives.
Do not use this actor to scrape copyrighted content for redistribution without permission.
Be mindful of crawl rates when targeting small websites -- the default rate limits are designed to be respectful.
For guidance on web scraping legality, see Apify's guide.

FAQ

How many pages can I convert in one run? Up to 1,000 pages per run. The default is 10. Set maxPages to control the total. For sites with more than 1,000 pages, run multiple batches starting from different sections of the site.

Does this work with JavaScript-heavy websites (React, Vue, Angular)? No. The actor uses CheerioCrawler which parses server-rendered HTML only. If a website loads its main content via JavaScript, the Markdown output will be empty or incomplete. For JS-rendered sites, consider using a browser-based scraper first and then converting the rendered HTML.

What HTML elements are stripped before conversion? By default, the actor removes <script>, <style>, <nav>, <footer>, <header>, <iframe>, cookie consent banners, ad containers, and other non-content elements. You can add additional elements to remove using the excludeSelectors parameter.

Can I use the Markdown output directly in ChatGPT or Claude? Yes. The Markdown output is clean text optimized for LLM consumption. It preserves document structure (headings, lists, code blocks) while being more token-efficient than raw HTML. Copy the markdown field directly into your prompt or use the API to feed it programmatically.

How is this different from just copying text from a webpage? Copy-paste loses structure -- headings become plain text, lists lose formatting, code blocks merge with surrounding text. This actor preserves Markdown formatting, handles multiple pages automatically, extracts metadata, and outputs structured JSON ready for programmatic use.

Can I schedule this to run periodically? Yes. Use Apify Schedules to run the actor daily, weekly, or at any custom interval. This is useful for keeping knowledge bases in sync with source documentation that changes frequently.

What happens if a page fails to load? The crawler automatically retries failed pages up to 2 times. If a page still fails, it is skipped and a warning is logged. Other pages continue processing normally.

Can I crawl across multiple domains? Yes. Set sameDomainOnly to false to follow links to any domain. Be sure to set a reasonable maxPages limit to prevent the crawl from expanding indefinitely.

Actor	How to combine
Website Contact Scraper	Extract emails, phones, and team members from websites instead of converting to Markdown
Google Maps Email Extractor	Find businesses on Google Maps and extract their contact information
Website Tech Stack Detector	Identify what technologies a website uses alongside its content
WHOIS Domain Lookup	Look up domain registration details for websites you're converting

Support

Found a bug or have a feature request? Open an issue in the Issues tab on this actor's page. For custom scraping solutions or enterprise integrations, reach out through the Apify platform.

AI-Ready Web Content Crawler (LLM/RAG Optimized)

brilliant_gum/web-content-crawler

Deep-crawl websites and extract LLM-ready Markdown with OG tags, JSON-LD, author, dates, token estimates, native RAG chunking, language filtering, content-hash dedup, and per-page error reporting. Enforced timeouts. Zero silent failures.

Yuliia Kulakova

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Hitman studio

AI Web Crawler

hounderd/ai-web-crawler

Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.

Hounderd

Unclaimed Google Maps Leads Scraper

pawanhiray/unclaimed-google-maps-leads

Finds unclaimed Google Maps business listings and enriches them with emails, social links, and AI-ready markdown from their websites.

Pawan Hiray

AI-Powered Smart Web Scraper

cloud9_ai/ai-web-scraper

Intelligent content extraction from any website using Crawlee + AI. Auto-detects structure, adapts to layout changes, handles JavaScript rendering. No custom code needed. Extract articles, products, listings from 1000s of pages.

cloud9

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

ParseForge

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

Kai Agent

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.

Ahmed Jasarevic

Website Content Crawler — AI & RAG Ready

santamaria-automations/website-content-crawler

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.

NanoScrape

Website Content Crawler

rupom888/website-content-crawler

Syed Rupom

Clean Web Scraper - Markdown for AI via Firecrawl

clearpath/web-to-markdown

Convert any website to clean, LLM-optimized markdown using Firecrawl. Perfect for RAG pipelines, AI training data, and knowledge bases. No login required, 25% cheaper than Firecrawl direct. Batch process hundreds of URLs. Supports PDF/DOCX. Pay only $0.004 per page - no monthly fees.

ClearPath

Website to Markdown — AI-Ready Content for RAG

Website to Markdown

What data can you extract?

Why use Website to Markdown?

Features

Use cases for converting websites to Markdown

RAG pipeline ingestion

LLM context preparation

Knowledge base building

Content migration

Fine-tuning dataset creation

Web research and archiving

How to convert websites to Markdown

Input parameters

Input examples

Input tips

Output example

Output fields

How much does it cost to convert websites to Markdown?

Convert websites to Markdown using the API

Python

JavaScript

cURL

How Website to Markdown works

Phase 1: Crawl and discover

Phase 2: Strip and convert

Limitations

Integrations

Responsible use

FAQ

Related actors

Support

You might also like

AI-Ready Web Content Crawler (LLM/RAG Optimized)

Website to Text & Markdown — AI / RAG Content Crawler

AI Web Crawler

Unclaimed Google Maps Leads Scraper

AI-Powered Smart Web Scraper

Website Content Crawler

AI Content Crawler

Website to Clean Markdown (AI & RAG Ready)

Website Content Crawler — AI & RAG Ready

Website Content Crawler

Clean Web Scraper - Markdown for AI via Firecrawl