Pricing

Pay per usage

Website Content to Markdown

Convert any website to clean Markdown for RAG pipelines, LLM training, and AI apps. Crawls pages, strips boilerplate, preserves headings, tables, and code blocks. GFM support.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

ryan clinton

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

What it does

Give it any URL and it:

Extracts the main content (strips nav, footer, sidebar, ads)
Converts to clean Markdown with proper heading hierarchy
Preserves code blocks, tables, lists, and links
Returns per-page metadata (title, description, word count, language)
Auto-discovers pages via sitemap.xml and link following

Key features

Main content extraction — intelligent stripping of navigation, footers, sidebars, cookie banners, and ads
Semantic detection — finds <main>, <article>, [role="main"] before falling back to body
GFM support — tables, strikethrough, and task lists converted properly
Sitemap auto-discovery — finds all pages on a domain via sitemap.xml
Depth-controlled crawling — BFS from starting page with configurable depth
Per-page output — each page is its own dataset item, ready for vector ingestion
Metadata — title, description, language, word count per page

Example output

{
    "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "title": "Web scraping for beginners",
    "description": "Learn the basics of web scraping and data extraction.",
    "markdown": "# Web scraping for beginners\n\nWeb scraping is the process of extracting data from websites...",
    "wordCount": 1250,
    "language": "en",
    "crawlDepth": 0,
    "crawledAt": "2026-02-07T12:00:00.000Z"
}

Input

Field	Type	Default	Description
`urls`	string[]	required	Starting URLs to crawl and convert
`maxPagesPerDomain`	integer (1-100)	10	Maximum pages per domain
`maxCrawlDepth`	integer (0-5)	2	Link levels to follow (0 = starting page only)
`includeMetadata`	boolean	true	Include title, description, language
`onlyMainContent`	boolean	true	Strip nav/footer/sidebar/ads
`proxyConfiguration`	object	Apify Proxy	Proxy settings

Use cases

RAG pipelines — Feed clean content into vector databases (Pinecone, Weaviate, Qdrant)
LLM fine-tuning — Build training datasets from web content
Knowledge bases — Convert documentation sites to searchable markdown
Content migration — Move website content between platforms
AI agents — Give agents access to structured web page content
Research — Extract readable content from multiple sources

API usage

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ryanclinton/website-content-to-markdown").call(
    run_input={
        "urls": ["https://docs.apify.com/academy/web-scraping-for-beginners"],
        "maxPagesPerDomain": 10,
        "maxCrawlDepth": 2,
    }
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{item['url']} — {item['wordCount']} words")
    print(item["markdown"][:200])

JavaScript

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('ryanclinton/website-content-to-markdown').call({
    urls: ['https://docs.apify.com/academy/web-scraping-for-beginners'],
    maxPagesPerDomain: 10,
    maxCrawlDepth: 2,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
    console.log(`${item.url} — ${item.wordCount} words`);
});

Pipeline integration

Chain with LLM processing for AI workflows:

Website Content to Markdown — Extract clean content
LLM API — Summarize, classify, or extract entities
Vector database — Store embeddings for RAG retrieval

Or combine with the B2B lead generation pipeline:

Google Maps Lead Enricher — Find businesses
Website Content to Markdown — Extract their content
Website Tech Stack Detector — Analyze their tech
B2B Lead Qualifier — Score and qualify leads

Limitations

Uses CheerioCrawler (HTTP-only) — JavaScript-rendered SPAs may return minimal content
Rate-limited to 120 requests/minute per domain
Maximum 100 pages per domain

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

HTML to Markdown

web.harvester/html-to-markdown

Convert HTML to clean Markdown. Supports GFM tables, code blocks, and custom rules. Perfect for content migration and documentation.

Web Harvester

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

244

5.0

Markdown to HTML Converter

ryanclinton/markdown-to-html

Convert Markdown to HTML with GitHub Flavored Markdown support. Extract headings, count words, and generate clean HTML. Supports tables, code blocks, and more. Free.

ryan clinton

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.