Website Content To Markdown
Pricing
Pay per usage
Go to Apify Store
Website Content To Markdown
Convert any website to clean Markdown for RAG pipelines, LLM training, and AI apps. Extracts main content, strips navigation and ads, preserves headings, code blocks, and tables. Sitemap auto-discovery. Lightweight Firecrawl alternative.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

ryan clinton
Maintained by Community
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
18 hours ago
Last modified
Categories
Share
Convert any website to clean Markdown for RAG pipelines, LLM training data, and AI applications. A lightweight, transparent-pricing alternative to Firecrawl.
What it does
Give it any URL and it:
- Extracts the main content (strips nav, footer, sidebar, ads)
- Converts to clean Markdown with proper heading hierarchy
- Preserves code blocks, tables, lists, and links
- Returns per-page metadata (title, description, word count, language)
- Auto-discovers pages via sitemap.xml and link following
Key features
- Main content extraction — intelligent stripping of navigation, footers, sidebars, cookie banners, and ads
- Semantic detection — finds
<main>,<article>,[role="main"]before falling back to body - GFM support — tables, strikethrough, and task lists converted properly
- Sitemap auto-discovery — finds all pages on a domain via sitemap.xml
- Depth-controlled crawling — BFS from starting page with configurable depth
- Per-page output — each page is its own dataset item, ready for vector ingestion
- Metadata — title, description, language, word count per page
Example output
{"url": "https://docs.apify.com/academy/web-scraping-for-beginners","title": "Web scraping for beginners","description": "Learn the basics of web scraping and data extraction.","markdown": "# Web scraping for beginners\n\nWeb scraping is the process of extracting data from websites...","wordCount": 1250,"language": "en","crawlDepth": 0,"crawledAt": "2026-02-07T12:00:00.000Z"}
Input
| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | Starting URLs to crawl and convert |
maxPagesPerDomain | integer (1-100) | 10 | Maximum pages per domain |
maxCrawlDepth | integer (0-5) | 2 | Link levels to follow (0 = starting page only) |
includeMetadata | boolean | true | Include title, description, language |
onlyMainContent | boolean | true | Strip nav/footer/sidebar/ads |
proxyConfiguration | object | Apify Proxy | Proxy settings |
Use cases
- RAG pipelines — Feed clean content into vector databases (Pinecone, Weaviate, Qdrant)
- LLM fine-tuning — Build training datasets from web content
- Knowledge bases — Convert documentation sites to searchable markdown
- Content migration — Move website content between platforms
- AI agents — Give agents access to structured web page content
- Research — Extract readable content from multiple sources
API usage
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("ryanclinton/website-content-to-markdown").call(run_input={"urls": ["https://docs.apify.com/academy/web-scraping-for-beginners"],"maxPagesPerDomain": 10,"maxCrawlDepth": 2,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{item['url']} — {item['wordCount']} words")print(item["markdown"][:200])
JavaScript
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('ryanclinton/website-content-to-markdown').call({urls: ['https://docs.apify.com/academy/web-scraping-for-beginners'],maxPagesPerDomain: 10,maxCrawlDepth: 2,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(item => {console.log(`${item.url} — ${item.wordCount} words`);});
Pipeline integration
Chain with LLM processing for AI workflows:
- Website Content to Markdown — Extract clean content
- LLM API — Summarize, classify, or extract entities
- Vector database — Store embeddings for RAG retrieval
Or combine with the B2B lead generation pipeline:
- Google Maps Lead Enricher — Find businesses
- Website Content to Markdown — Extract their content
- Website Tech Stack Detector — Analyze their tech
- B2B Lead Qualifier — Score and qualify leads
Limitations
- Uses CheerioCrawler (HTTP-only) — JavaScript-rendered SPAs may return minimal content
- Rate-limited to 120 requests/minute per domain
- Maximum 100 pages per domain