Website Content to Markdown
Pricing
from $20.00 / 1,000 page converteds
Website Content to Markdown
Crawl any website and convert its HTML pages into clean, well-structured Markdown text. Purpose-built for AI workflows, RAG pipelines, LLM fine-tuning, and documentation archival. Give it URLs and get back Markdown documents with navigation, ads, and boilerplate stripped away.
Pricing
from $20.00 / 1,000 page converteds
Rating
0.0
(0)
Developer
ryan clinton
Actor stats
0
Bookmarked
10
Total users
4
Monthly active users
a day ago
Last modified
Categories
Share
Crawl any website and convert its HTML pages into clean, well-structured Markdown text. Purpose-built for AI workflows, RAG pipelines, LLM fine-tuning, and documentation archival. Give it URLs and get back Markdown documents with navigation, ads, and boilerplate stripped away.
Why use Website Content to Markdown?
Large language models, retrieval-augmented generation (RAG) pipelines, and documentation tools need clean text — not raw HTML full of navigation bars, cookie banners, and ad scripts. This actor does the tedious work of fetching, parsing, cleaning, and formatting web content into Markdown that LLMs can consume directly. It discovers pages via sitemap.xml and internal link following, extracts main content using semantic HTML selectors, strips 30+ categories of boilerplate, and outputs GitHub Flavored Markdown with headings, tables, code blocks, and lists faithfully preserved.
Features
- Smart content extraction — identifies the main content area using semantic selectors (
<main>,<article>,[role="main"],.content,.post-content). Falls back gracefully for non-standard markup - Aggressive boilerplate removal — strips 30+ categories of non-content elements including navigation, headers, footers, sidebars, ads, cookie banners, social widgets, comment sections, modals, breadcrumbs, and hidden elements
- GitHub Flavored Markdown — full GFM support via Turndown with turndown-plugin-gfm. Tables, strikethrough, task lists, and fenced code blocks render correctly
- Sitemap discovery — automatically fetches and parses sitemap.xml (including sitemap index files) to discover pages beyond internal links
- Crawl depth control — configure how many link levels deep the crawler follows (0 = starting page only, up to 5 levels)
- Per-domain page limits — cap how many pages to convert per domain to control costs and output size
- Page metadata — optionally extract title, meta description, language, and word count per page
- URL deduplication — tracks visited URLs per domain to avoid processing the same page twice
- Clean image handling — preserves meaningful images with alt text while stripping data-URI images and tracking pixels
- Proxy support — built-in Apify proxy configuration for sites that block direct requests
Use Cases
RAG pipeline ingestion
Convert documentation sites into Markdown chunks for vector database storage (Pinecone, Weaviate, Qdrant). The word count field helps estimate token usage for chunking strategies.
LLM training data
Prepare fine-tuning datasets by converting blog posts, knowledge bases, and technical documentation into clean text. The boilerplate removal ensures training data isn't polluted with navigation and ads.
Documentation archival
Snapshot entire documentation sites as Markdown files for offline access, version control, or migration between platforms.
Competitive content analysis
Convert competitor blog posts and landing pages to Markdown for content gap analysis, keyword extraction, and tone analysis using LLMs.
Knowledge base building
Feed web content into chatbot knowledge bases. The clean Markdown output integrates directly with LangChain and LlamaIndex document loaders.
Content migration
Extract content from legacy CMS platforms as Markdown for migration to static site generators (Hugo, Jekyll, Astro) or modern CMS tools.
How to Use
- Open the actor in the Apify Console and click Start
- Enter one or more website URLs (e.g.,
https://docs.example.comor justexample.com) - Adjust Max pages per domain (default: 10) and Max crawl depth (default: 2)
- Toggle Main content only to control boilerplate stripping (enabled by default)
- Click Start and wait for the run to complete
- Download results from the Dataset tab in JSON, CSV, or Excel format
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | — | Website URLs to crawl and convert. Bare domains are auto-prefixed with https:// |
maxPagesPerDomain | integer | No | 10 | Maximum pages to convert per domain (1–100) |
maxCrawlDepth | integer | No | 2 | How many link levels deep to follow (0–5). Set 0 for starting pages only |
includeMetadata | boolean | No | true | Include page title, meta description, language, and word count |
onlyMainContent | boolean | No | true | Extract only main content, stripping nav, footers, sidebars, ads |
proxyConfiguration | object | No | Apify Proxy | Proxy settings for crawling |
Input Examples
Convert a documentation site:
{"urls": ["https://docs.apify.com/academy"],"maxPagesPerDomain": 50,"maxCrawlDepth": 3,"includeMetadata": true,"onlyMainContent": true}
Single page only (no crawling):
{"urls": ["https://example.com/blog/my-article"],"maxPagesPerDomain": 1,"maxCrawlDepth": 0,"includeMetadata": true,"onlyMainContent": true}
Multiple sites, raw content (no boilerplate stripping):
{"urls": ["example.com", "competitor.com"],"maxPagesPerDomain": 20,"maxCrawlDepth": 2,"includeMetadata": false,"onlyMainContent": false}
Input Tips
- Start with 3-5 pages to test output quality before scaling up.
- Use crawl depth 0 when you only need specific pages — provide exact URLs.
- Keep "Main content only" enabled for AI/LLM workflows — the boilerplate removal produces much cleaner text.
- Provide section-specific URLs rather than homepages. Use
https://example.com/docsinstead ofhttps://example.comto target relevant content directly.
Output Example
Each page is stored as a separate item in the dataset:
{"url": "https://docs.apify.com/academy/web-scraping-for-beginners","title": "Web scraping for beginners | Apify Academy","description": "Learn the basics of web scraping with this beginner-friendly guide.","markdown": "# Web scraping for beginners\n\nWeb scraping is the process of extracting data from websites...\n\n## Why scrape the web?\n\n- **Market research** -- Track competitor pricing\n- **Lead generation** -- Build prospect lists\n\n## Getting started\n\nTo get started with web scraping...","wordCount": 1247,"language": "en","crawlDepth": 0,"crawledAt": "2025-01-15T14:30:00.000Z"}
Output Fields
| Field | Type | Description |
|---|---|---|
url | string | Full URL of the converted page |
title | string | Page title from OpenGraph, <title> tag, or first <h1> (empty if metadata disabled) |
description | string | Meta description from OpenGraph or <meta name="description"> |
markdown | string | Full converted Markdown content of the page |
wordCount | number | Word count of the Markdown output — useful for estimating LLM token usage (~1.3 tokens per word) |
language | string/null | Language code from <html lang> attribute (e.g., "en", "fr") |
crawlDepth | number | How many links deep this page was from the starting URL (0 = starting page) |
crawledAt | string | ISO 8601 timestamp of when the page was crawled |
Programmatic Access (API)
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("ryanclinton/website-content-to-markdown").call(run_input={"urls": ["https://docs.apify.com/academy"],"maxPagesPerDomain": 20,"maxCrawlDepth": 2,"onlyMainContent": True,})for page in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{page['url']} — {page['wordCount']} words")# Feed into RAG pipeline, save to file, etc.with open(f"{page['title']}.md", "w") as f:f.write(page["markdown"])
JavaScript
import { ApifyClient } from "apify-client";const client = new ApifyClient({ token: "YOUR_API_TOKEN" });const run = await client.actor("ryanclinton/website-content-to-markdown").call({urls: ["https://docs.apify.com/academy"],maxPagesPerDomain: 20,maxCrawlDepth: 2,onlyMainContent: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();for (const page of items) {console.log(`${page.url} — ${page.wordCount} words`);// Feed into LangChain, save to vector DB, etc.}
cURL
# Start the actor runcurl -X POST "https://api.apify.com/v2/acts/ryanclinton~website-content-to-markdown/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://docs.apify.com/academy"],"maxPagesPerDomain": 20,"maxCrawlDepth": 2,"onlyMainContent": true}'# Fetch results (replace DATASET_ID from the run response)curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_API_TOKEN&format=json"
How It Works
The actor runs a three-phase pipeline:
Phase 1: URL Discovery
For each input URL, the actor:
- Normalizes the URL (adds
https://if missing, validates format) - Deduplicates by domain — only the first URL per domain is used as a starting point
- Fetches
sitemap.xml(and sitemap index files) to discover additional pages beyond what internal links expose - Builds a request queue combining the starting URL with sitemap-discovered URLs
Phase 2: Page Crawling
A CheerioCrawler processes pages with 10 concurrent workers at up to 120 requests/minute:
- Skips non-HTML responses (XML sitemaps, PDFs, feeds)
- Enforces per-domain page limits and URL deduplication (trailing slash normalization)
- Follows internal links using breadth-first search (BFS) up to
maxCrawlDepthlevels - Filters out links to binary files (images, videos, fonts, archives)
Phase 3: Content Extraction & Conversion
For each page, the actor:
- Content extraction — if "main content only" is enabled, tries semantic selectors in order:
<main>,<article>,[role="main"],#content,.content,.post-content,.entry-content,.article-body,.page-content,.main-content. Uses the first match with 200+ characters. Falls back to<body>with non-content removal. - Boilerplate stripping — removes 30+ categories of non-content elements: navigation, headers, footers, sidebars, ads, cookie banners, social widgets, comment sections, modals, breadcrumbs, hidden elements, scripts, styles, and iframes.
- HTML-to-Markdown conversion — the Turndown library converts clean HTML to GitHub Flavored Markdown with ATX headings, fenced code blocks, and inline links. Custom rules strip data-URI images, images without alt text, and empty anchor tags.
- Cleanup — collapses triple+ newlines to double, trims trailing whitespace per line, removes blank-only lines.
- Metadata extraction — pulls title (OpenGraph >
<title>><h1>), description (OpenGraph > meta description), and language (<html lang>). Counts words for token estimation. - Quality filter — pages producing less than 50 characters of Markdown are skipped as near-empty.
How Much Does It Cost?
The actor runs on minimal resources (256 MB memory) and is very cost-efficient:
| Scenario | Pages | Estimated Cost | Run Time |
|---|---|---|---|
| Single page | 1 | < $0.01 | ~5 seconds |
| Small site | 10 | ~$0.01 | ~20 seconds |
| Medium site | 50 | ~$0.03 | ~2 minutes |
| Large site | 100 | ~$0.05 | ~5 minutes |
Proxy usage is the primary cost driver. The actor processes pages concurrently (up to 10 at a time) for fast throughput.
Tips
- Check word counts to estimate token usage. Rough rule: 1 word ≈ 1.3 tokens for most LLMs.
- Disable metadata if you only need raw Markdown text and want smaller output.
- Use depth 0 for targeted extraction. Provide exact URLs and skip link following entirely.
- Keep "Main content only" enabled for AI workflows — it dramatically improves text quality.
- Provide section-specific URLs (e.g.,
/docs,/blog) rather than homepages to target relevant content. - Schedule periodic runs to keep knowledge base snapshots up to date.
Combine with Other Actors
| Actor | How to combine |
|---|---|
| AI Training Data Curator | Convert websites to Markdown, then use the curator to clean, deduplicate, and format for LLM training |
| Website Change Monitor | Detect when pages change, then re-convert to Markdown for updated knowledge bases |
| Website Contact Scraper | Extract contacts from the same sites you're converting to Markdown |
| Company Deep Research | Feed company website Markdown into deep research workflows |
| Wikipedia Article Search | Combine Wikipedia content with website content for comprehensive knowledge bases |
Limitations
- No JavaScript rendering — uses CheerioCrawler (server-side HTML parsing), not a headless browser. Single-page applications (SPAs) that load content via JavaScript are not supported.
- No authenticated content — only processes publicly accessible pages. Login walls and paywalls produce their gate content, not the protected content.
- English-optimized selectors — content extraction selectors use English class names (
.content,.post-content). Sites with non-English class names may need "main content only" disabled for best results. - Same-domain only — the crawler never follows links to external domains.
- No PDF/image content — only converts HTML pages. PDFs, images, and other binary files are skipped.
- Sitemap dependent — page discovery depends on sitemap.xml and internal links. Orphaned pages not linked from anywhere won't be discovered.
Responsible Use
- This actor only accesses publicly visible web pages.
- Respect
robots.txtand website terms of service regarding automated access. - Do not use converted content in ways that violate the original website's copyright or licensing terms.
- For guidance on web scraping legality, see Apify's guide.
FAQ
What types of content does this actor handle best? Text-heavy pages: documentation, blog posts, articles, knowledge bases, product descriptions, and informational pages. Not designed for SPAs requiring JavaScript rendering.
Can I use the output directly with ChatGPT, Claude, or other LLMs?
Yes. The Markdown output is specifically designed for LLM consumption. Feed the markdown field directly into prompts, build vector databases for RAG, or store as training data. The word count helps estimate context window fit.
Does the actor follow links to other domains? No. It only follows internal links within the same domain as each starting URL.
What happens if a page fails to load? The actor retries failed requests up to 2 times. If still failing, it's skipped with a warning. Failed pages don't count toward the per-domain page limit.
How does "main content only" work?
It tries semantic HTML selectors in order (<main>, <article>, [role="main"], etc.) to find the content area. If found, it strips non-content elements within that area. If not found, it falls back to the full <body> with 30+ categories of boilerplate removed.
Integrations
- LangChain / LlamaIndex — use the Apify document loader to feed Markdown directly into your RAG pipeline
- Zapier — send converted Markdown to Google Docs, Notion, Slack, or any Zapier-supported destination
- Make — chain with other steps in automated workflows
- Google Sheets — export to spreadsheets for team review
- Apify API — trigger runs programmatically and retrieve results via REST API
- Webhooks — get notified when conversion completes
- GitHub Actions — schedule periodic runs to keep documentation snapshots up to date
- Vector databases (Pinecone, Weaviate, Qdrant, Chroma) — push Markdown output for semantic search