Website Content to Markdown avatar

Website Content to Markdown

Pricing

from $20.00 / 1,000 page converteds

Go to Apify Store
Website Content to Markdown

Website Content to Markdown

Crawl any website and convert its HTML pages into clean, well-structured Markdown text. Purpose-built for AI workflows, RAG pipelines, LLM fine-tuning, and documentation archival. Give it URLs and get back Markdown documents with navigation, ads, and boilerplate stripped away.

Pricing

from $20.00 / 1,000 page converteds

Rating

0.0

(0)

Developer

ryan clinton

ryan clinton

Maintained by Community

Actor stats

0

Bookmarked

10

Total users

4

Monthly active users

a day ago

Last modified

Share

Crawl any website and convert its HTML pages into clean, well-structured Markdown text. Purpose-built for AI workflows, RAG pipelines, LLM fine-tuning, and documentation archival. Give it URLs and get back Markdown documents with navigation, ads, and boilerplate stripped away.

Why use Website Content to Markdown?

Large language models, retrieval-augmented generation (RAG) pipelines, and documentation tools need clean text — not raw HTML full of navigation bars, cookie banners, and ad scripts. This actor does the tedious work of fetching, parsing, cleaning, and formatting web content into Markdown that LLMs can consume directly. It discovers pages via sitemap.xml and internal link following, extracts main content using semantic HTML selectors, strips 30+ categories of boilerplate, and outputs GitHub Flavored Markdown with headings, tables, code blocks, and lists faithfully preserved.

Features

  • Smart content extraction — identifies the main content area using semantic selectors (<main>, <article>, [role="main"], .content, .post-content). Falls back gracefully for non-standard markup
  • Aggressive boilerplate removal — strips 30+ categories of non-content elements including navigation, headers, footers, sidebars, ads, cookie banners, social widgets, comment sections, modals, breadcrumbs, and hidden elements
  • GitHub Flavored Markdown — full GFM support via Turndown with turndown-plugin-gfm. Tables, strikethrough, task lists, and fenced code blocks render correctly
  • Sitemap discovery — automatically fetches and parses sitemap.xml (including sitemap index files) to discover pages beyond internal links
  • Crawl depth control — configure how many link levels deep the crawler follows (0 = starting page only, up to 5 levels)
  • Per-domain page limits — cap how many pages to convert per domain to control costs and output size
  • Page metadata — optionally extract title, meta description, language, and word count per page
  • URL deduplication — tracks visited URLs per domain to avoid processing the same page twice
  • Clean image handling — preserves meaningful images with alt text while stripping data-URI images and tracking pixels
  • Proxy support — built-in Apify proxy configuration for sites that block direct requests

Use Cases

RAG pipeline ingestion

Convert documentation sites into Markdown chunks for vector database storage (Pinecone, Weaviate, Qdrant). The word count field helps estimate token usage for chunking strategies.

LLM training data

Prepare fine-tuning datasets by converting blog posts, knowledge bases, and technical documentation into clean text. The boilerplate removal ensures training data isn't polluted with navigation and ads.

Documentation archival

Snapshot entire documentation sites as Markdown files for offline access, version control, or migration between platforms.

Competitive content analysis

Convert competitor blog posts and landing pages to Markdown for content gap analysis, keyword extraction, and tone analysis using LLMs.

Knowledge base building

Feed web content into chatbot knowledge bases. The clean Markdown output integrates directly with LangChain and LlamaIndex document loaders.

Content migration

Extract content from legacy CMS platforms as Markdown for migration to static site generators (Hugo, Jekyll, Astro) or modern CMS tools.

How to Use

  1. Open the actor in the Apify Console and click Start
  2. Enter one or more website URLs (e.g., https://docs.example.com or just example.com)
  3. Adjust Max pages per domain (default: 10) and Max crawl depth (default: 2)
  4. Toggle Main content only to control boilerplate stripping (enabled by default)
  5. Click Start and wait for the run to complete
  6. Download results from the Dataset tab in JSON, CSV, or Excel format

Input Parameters

ParameterTypeRequiredDefaultDescription
urlsstring[]YesWebsite URLs to crawl and convert. Bare domains are auto-prefixed with https://
maxPagesPerDomainintegerNo10Maximum pages to convert per domain (1–100)
maxCrawlDepthintegerNo2How many link levels deep to follow (0–5). Set 0 for starting pages only
includeMetadatabooleanNotrueInclude page title, meta description, language, and word count
onlyMainContentbooleanNotrueExtract only main content, stripping nav, footers, sidebars, ads
proxyConfigurationobjectNoApify ProxyProxy settings for crawling

Input Examples

Convert a documentation site:

{
"urls": ["https://docs.apify.com/academy"],
"maxPagesPerDomain": 50,
"maxCrawlDepth": 3,
"includeMetadata": true,
"onlyMainContent": true
}

Single page only (no crawling):

{
"urls": ["https://example.com/blog/my-article"],
"maxPagesPerDomain": 1,
"maxCrawlDepth": 0,
"includeMetadata": true,
"onlyMainContent": true
}

Multiple sites, raw content (no boilerplate stripping):

{
"urls": ["example.com", "competitor.com"],
"maxPagesPerDomain": 20,
"maxCrawlDepth": 2,
"includeMetadata": false,
"onlyMainContent": false
}

Input Tips

  • Start with 3-5 pages to test output quality before scaling up.
  • Use crawl depth 0 when you only need specific pages — provide exact URLs.
  • Keep "Main content only" enabled for AI/LLM workflows — the boilerplate removal produces much cleaner text.
  • Provide section-specific URLs rather than homepages. Use https://example.com/docs instead of https://example.com to target relevant content directly.

Output Example

Each page is stored as a separate item in the dataset:

{
"url": "https://docs.apify.com/academy/web-scraping-for-beginners",
"title": "Web scraping for beginners | Apify Academy",
"description": "Learn the basics of web scraping with this beginner-friendly guide.",
"markdown": "# Web scraping for beginners\n\nWeb scraping is the process of extracting data from websites...\n\n## Why scrape the web?\n\n- **Market research** -- Track competitor pricing\n- **Lead generation** -- Build prospect lists\n\n## Getting started\n\nTo get started with web scraping...",
"wordCount": 1247,
"language": "en",
"crawlDepth": 0,
"crawledAt": "2025-01-15T14:30:00.000Z"
}

Output Fields

FieldTypeDescription
urlstringFull URL of the converted page
titlestringPage title from OpenGraph, <title> tag, or first <h1> (empty if metadata disabled)
descriptionstringMeta description from OpenGraph or <meta name="description">
markdownstringFull converted Markdown content of the page
wordCountnumberWord count of the Markdown output — useful for estimating LLM token usage (~1.3 tokens per word)
languagestring/nullLanguage code from <html lang> attribute (e.g., "en", "fr")
crawlDepthnumberHow many links deep this page was from the starting URL (0 = starting page)
crawledAtstringISO 8601 timestamp of when the page was crawled

Programmatic Access (API)

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ryanclinton/website-content-to-markdown").call(run_input={
"urls": ["https://docs.apify.com/academy"],
"maxPagesPerDomain": 20,
"maxCrawlDepth": 2,
"onlyMainContent": True,
})
for page in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{page['url']}{page['wordCount']} words")
# Feed into RAG pipeline, save to file, etc.
with open(f"{page['title']}.md", "w") as f:
f.write(page["markdown"])

JavaScript

import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "YOUR_API_TOKEN" });
const run = await client.actor("ryanclinton/website-content-to-markdown").call({
urls: ["https://docs.apify.com/academy"],
maxPagesPerDomain: 20,
maxCrawlDepth: 2,
onlyMainContent: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const page of items) {
console.log(`${page.url}${page.wordCount} words`);
// Feed into LangChain, save to vector DB, etc.
}

cURL

# Start the actor run
curl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-content-to-markdown/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://docs.apify.com/academy"],
"maxPagesPerDomain": 20,
"maxCrawlDepth": 2,
"onlyMainContent": true
}'
# Fetch results (replace DATASET_ID from the run response)
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"

How It Works

The actor runs a three-phase pipeline:

Phase 1: URL Discovery

For each input URL, the actor:

  1. Normalizes the URL (adds https:// if missing, validates format)
  2. Deduplicates by domain — only the first URL per domain is used as a starting point
  3. Fetches sitemap.xml (and sitemap index files) to discover additional pages beyond what internal links expose
  4. Builds a request queue combining the starting URL with sitemap-discovered URLs

Phase 2: Page Crawling

A CheerioCrawler processes pages with 10 concurrent workers at up to 120 requests/minute:

  1. Skips non-HTML responses (XML sitemaps, PDFs, feeds)
  2. Enforces per-domain page limits and URL deduplication (trailing slash normalization)
  3. Follows internal links using breadth-first search (BFS) up to maxCrawlDepth levels
  4. Filters out links to binary files (images, videos, fonts, archives)

Phase 3: Content Extraction & Conversion

For each page, the actor:

  1. Content extraction — if "main content only" is enabled, tries semantic selectors in order: <main>, <article>, [role="main"], #content, .content, .post-content, .entry-content, .article-body, .page-content, .main-content. Uses the first match with 200+ characters. Falls back to <body> with non-content removal.
  2. Boilerplate stripping — removes 30+ categories of non-content elements: navigation, headers, footers, sidebars, ads, cookie banners, social widgets, comment sections, modals, breadcrumbs, hidden elements, scripts, styles, and iframes.
  3. HTML-to-Markdown conversion — the Turndown library converts clean HTML to GitHub Flavored Markdown with ATX headings, fenced code blocks, and inline links. Custom rules strip data-URI images, images without alt text, and empty anchor tags.
  4. Cleanup — collapses triple+ newlines to double, trims trailing whitespace per line, removes blank-only lines.
  5. Metadata extraction — pulls title (OpenGraph > <title> > <h1>), description (OpenGraph > meta description), and language (<html lang>). Counts words for token estimation.
  6. Quality filter — pages producing less than 50 characters of Markdown are skipped as near-empty.

How Much Does It Cost?

The actor runs on minimal resources (256 MB memory) and is very cost-efficient:

ScenarioPagesEstimated CostRun Time
Single page1< $0.01~5 seconds
Small site10~$0.01~20 seconds
Medium site50~$0.03~2 minutes
Large site100~$0.05~5 minutes

Proxy usage is the primary cost driver. The actor processes pages concurrently (up to 10 at a time) for fast throughput.

Tips

  • Check word counts to estimate token usage. Rough rule: 1 word ≈ 1.3 tokens for most LLMs.
  • Disable metadata if you only need raw Markdown text and want smaller output.
  • Use depth 0 for targeted extraction. Provide exact URLs and skip link following entirely.
  • Keep "Main content only" enabled for AI workflows — it dramatically improves text quality.
  • Provide section-specific URLs (e.g., /docs, /blog) rather than homepages to target relevant content.
  • Schedule periodic runs to keep knowledge base snapshots up to date.

Combine with Other Actors

ActorHow to combine
AI Training Data CuratorConvert websites to Markdown, then use the curator to clean, deduplicate, and format for LLM training
Website Change MonitorDetect when pages change, then re-convert to Markdown for updated knowledge bases
Website Contact ScraperExtract contacts from the same sites you're converting to Markdown
Company Deep ResearchFeed company website Markdown into deep research workflows
Wikipedia Article SearchCombine Wikipedia content with website content for comprehensive knowledge bases

Limitations

  • No JavaScript rendering — uses CheerioCrawler (server-side HTML parsing), not a headless browser. Single-page applications (SPAs) that load content via JavaScript are not supported.
  • No authenticated content — only processes publicly accessible pages. Login walls and paywalls produce their gate content, not the protected content.
  • English-optimized selectors — content extraction selectors use English class names (.content, .post-content). Sites with non-English class names may need "main content only" disabled for best results.
  • Same-domain only — the crawler never follows links to external domains.
  • No PDF/image content — only converts HTML pages. PDFs, images, and other binary files are skipped.
  • Sitemap dependent — page discovery depends on sitemap.xml and internal links. Orphaned pages not linked from anywhere won't be discovered.

Responsible Use

  • This actor only accesses publicly visible web pages.
  • Respect robots.txt and website terms of service regarding automated access.
  • Do not use converted content in ways that violate the original website's copyright or licensing terms.
  • For guidance on web scraping legality, see Apify's guide.

FAQ

What types of content does this actor handle best? Text-heavy pages: documentation, blog posts, articles, knowledge bases, product descriptions, and informational pages. Not designed for SPAs requiring JavaScript rendering.

Can I use the output directly with ChatGPT, Claude, or other LLMs? Yes. The Markdown output is specifically designed for LLM consumption. Feed the markdown field directly into prompts, build vector databases for RAG, or store as training data. The word count helps estimate context window fit.

Does the actor follow links to other domains? No. It only follows internal links within the same domain as each starting URL.

What happens if a page fails to load? The actor retries failed requests up to 2 times. If still failing, it's skipped with a warning. Failed pages don't count toward the per-domain page limit.

How does "main content only" work? It tries semantic HTML selectors in order (<main>, <article>, [role="main"], etc.) to find the content area. If found, it strips non-content elements within that area. If not found, it falls back to the full <body> with 30+ categories of boilerplate removed.

Integrations

  • LangChain / LlamaIndex — use the Apify document loader to feed Markdown directly into your RAG pipeline
  • Zapier — send converted Markdown to Google Docs, Notion, Slack, or any Zapier-supported destination
  • Make — chain with other steps in automated workflows
  • Google Sheets — export to spreadsheets for team review
  • Apify API — trigger runs programmatically and retrieve results via REST API
  • Webhooks — get notified when conversion completes
  • GitHub Actions — schedule periodic runs to keep documentation snapshots up to date
  • Vector databases (Pinecone, Weaviate, Qdrant, Chroma) — push Markdown output for semantic search