Website Content To Markdown avatar
Website Content To Markdown

Pricing

Pay per usage

Go to Apify Store
Website Content To Markdown

Website Content To Markdown

Convert any website to clean Markdown for RAG pipelines, LLM training, and AI apps. Extracts main content, strips navigation and ads, preserves headings, code blocks, and tables. Sitemap auto-discovery. Lightweight Firecrawl alternative.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

ryan clinton

ryan clinton

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

18 hours ago

Last modified

Share

Convert any website to clean Markdown for RAG pipelines, LLM training data, and AI applications. A lightweight, transparent-pricing alternative to Firecrawl.

What it does

Give it any URL and it:

  • Extracts the main content (strips nav, footer, sidebar, ads)
  • Converts to clean Markdown with proper heading hierarchy
  • Preserves code blocks, tables, lists, and links
  • Returns per-page metadata (title, description, word count, language)
  • Auto-discovers pages via sitemap.xml and link following

Key features

  • Main content extraction — intelligent stripping of navigation, footers, sidebars, cookie banners, and ads
  • Semantic detection — finds <main>, <article>, [role="main"] before falling back to body
  • GFM support — tables, strikethrough, and task lists converted properly
  • Sitemap auto-discovery — finds all pages on a domain via sitemap.xml
  • Depth-controlled crawling — BFS from starting page with configurable depth
  • Per-page output — each page is its own dataset item, ready for vector ingestion
  • Metadata — title, description, language, word count per page

Example output

{
"url": "https://docs.apify.com/academy/web-scraping-for-beginners",
"title": "Web scraping for beginners",
"description": "Learn the basics of web scraping and data extraction.",
"markdown": "# Web scraping for beginners\n\nWeb scraping is the process of extracting data from websites...",
"wordCount": 1250,
"language": "en",
"crawlDepth": 0,
"crawledAt": "2026-02-07T12:00:00.000Z"
}

Input

FieldTypeDefaultDescription
urlsstring[]requiredStarting URLs to crawl and convert
maxPagesPerDomaininteger (1-100)10Maximum pages per domain
maxCrawlDepthinteger (0-5)2Link levels to follow (0 = starting page only)
includeMetadatabooleantrueInclude title, description, language
onlyMainContentbooleantrueStrip nav/footer/sidebar/ads
proxyConfigurationobjectApify ProxyProxy settings

Use cases

  • RAG pipelines — Feed clean content into vector databases (Pinecone, Weaviate, Qdrant)
  • LLM fine-tuning — Build training datasets from web content
  • Knowledge bases — Convert documentation sites to searchable markdown
  • Content migration — Move website content between platforms
  • AI agents — Give agents access to structured web page content
  • Research — Extract readable content from multiple sources

API usage

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("ryanclinton/website-content-to-markdown").call(
run_input={
"urls": ["https://docs.apify.com/academy/web-scraping-for-beginners"],
"maxPagesPerDomain": 10,
"maxCrawlDepth": 2,
}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"{item['url']}{item['wordCount']} words")
print(item["markdown"][:200])

JavaScript

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('ryanclinton/website-content-to-markdown').call({
urls: ['https://docs.apify.com/academy/web-scraping-for-beginners'],
maxPagesPerDomain: 10,
maxCrawlDepth: 2,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
console.log(`${item.url}${item.wordCount} words`);
});

Pipeline integration

Chain with LLM processing for AI workflows:

  1. Website Content to Markdown — Extract clean content
  2. LLM API — Summarize, classify, or extract entities
  3. Vector database — Store embeddings for RAG retrieval

Or combine with the B2B lead generation pipeline:

  1. Google Maps Lead Enricher — Find businesses
  2. Website Content to Markdown — Extract their content
  3. Website Tech Stack Detector — Analyze their tech
  4. B2B Lead Qualifier — Score and qualify leads

Limitations

  • Uses CheerioCrawler (HTTP-only) — JavaScript-rendered SPAs may return minimal content
  • Rate-limited to 120 requests/minute per domain
  • Maximum 100 pages per domain