Pricing

from $4.00 / 1,000 results

Website to Markdown Crawler for LLM & RAG

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Pricing

from $4.00 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Actor stats

Bookmarked

Total users

Monthly active users

7 days ago

Last modified

Website to Markdown & Text Crawler — AI, RAG & LLM Data 📄

Turn any website into clean Markdown and plain text for AI. This website content crawler crawls an entire site, strips away navigation, headers, footers, ads and scripts, and exports the boilerplate-free main content of every page as Markdown and plain text — ready to feed straight into LLM training sets, RAG pipelines, embeddings, vector databases and AI agents.

Give it one URL — it discovers and extracts every page automatically. No login, no headless browser, one clean row per page.

Looking to scrape a website for an LLM, convert HTML to Markdown, build RAG data, or extract text from a website at scale? That's exactly what this actor does.

✨ Key features

🕷️ Full-site crawl — start from one URL and follow internal links across the whole domain.
📝 Clean Markdown + plain text — main content only, with nav/header/footer/sidebar/scripts removed.
🔗 Absolute links & images — relative URLs are rewritten to absolute, so the Markdown is portable.
🧠 Built for AI / RAG / LLM — chunk-ready output for embeddings, fine-tuning and retrieval.
🏷️ Rich page metadata — title, meta description, H1, language, canonical and word count.
⚡ Fast & cheap — pure HTTP, no browser, high concurrency.

💡 Use cases

RAG & knowledge bases — turn docs, blogs and help centers into clean Markdown chunks for retrieval-augmented generation.
LLM fine-tuning datasets — collect high-quality text at scale from any set of websites.
AI agents & chatbots — feed your agent fresh, structured website content.
Content migration & archiving — export an entire website to Markdown.
Semantic search & embeddings — generate clean text to embed into a vector database (Pinecone, Weaviate, pgvector, …).

📦 What you get

One row per crawled page:

Field	Description
`url`	Page URL
`title`	Page title
`metaDescription`	Meta description
`h1`	First H1 heading
`lang`	Page language
`canonical`	Canonical URL
`wordCount`	Word count of the main content
`text`	Clean main-content text (boilerplate removed)
`markdown`	The same content converted to Markdown
`html`	Cleaned main-content HTML (optional)
`crawledAt`	ISO 8601 timestamp

Example output

{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started",
  "metaDescription": "Set up the SDK in 5 minutes.",
  "h1": "Getting Started",
  "wordCount": 812,
  "text": "Getting Started Install the package...",
  "markdown": "# Getting Started\n\nInstall the package...",
  "crawledAt": "2026-05-25T14:13:00.000Z"
}

🚀 How to use it

Click Try for free / Start.
Paste one or more website URLs into Start URLs.
(Optional) Set Max pages to crawl — use 0 to crawl the whole site.
(Optional) Toggle Save Markdown, Save plain text, Save HTML.
Click Save & Start.
Export your dataset as JSON, CSV, Excel or via API, or pull it straight into your AI pipeline.

⚙️ Input

Option	Description	Default
`startUrls`	Websites to crawl	– (required)
`maxPagesToCrawl`	Max pages per run (`0` = whole site)	`1000`
`saveMarkdown`	Include Markdown output	`true`
`saveText`	Include plain-text output	`true`
`saveHtml`	Include cleaned main-content HTML	`false`
`maxConcurrency`	Parallel requests	`10`

Example input

{
  "startUrls": [{ "url": "https://docs.apify.com" }],
  "maxPagesToCrawl": 2000,
  "saveMarkdown": true,
  "saveText": true
}

🔍 How it works

The crawler follows internal links within the same domain as your Start URLs. For each page it removes scripts, styles, navigation, headers, footers and sidebars, isolates the main content (<main> / <article> / body), rewrites relative links and images to absolute URLs, and exports the result as clean text and Markdown. It's pure HTTP — fast and cheap, with no headless browser.

🧰 Tips & best practices

Set maxPagesToCrawl to 0 to capture an entire site for a knowledge base.
Keep saveText and saveMarkdown on for maximum flexibility downstream; turn on saveHtml if you need raw HTML.
Use the wordCount field to filter out thin pages before embedding.
Lower maxConcurrency if a site rate-limits you.

❓ FAQ

Does it render JavaScript? No — it parses server-rendered HTML, which keeps runs fast and cheap and works for the large majority of websites and documentation sites.

Is the Markdown clean enough for RAG? Yes — navigation, headers, footers, ads and scripts are stripped, and links/images are absolute, so the output is ready to chunk and embed.

How do I crawl the whole site? Set maxPagesToCrawl to 0.

Can I crawl multiple sites at once? Yes — add several Start URLs.

What formats can I export? JSON, CSV, Excel, HTML and a full REST API.

Yes. Paste a URL and the crawler converts every page to clean Markdown — no website API, no login, no headless browser required.

Is this an HTML to Markdown crawler for RAG?

Yes. It strips nav, headers, footers, ads and scripts, then converts the main content from HTML to Markdown so the output is ready to chunk and embed for RAG pipelines.

How do I export website text to CSV or JSON?

Run the crawl, then export the dataset as JSON, CSV, Excel or via the REST API to scrape website text for LLM training data at scale.

Sitemap to URL Crawler — extract every URL from a sitemap.xml to feed this crawler.
Website SEO Audit Crawler — on-page SEO audit for every page.
Website Image & Media Crawler — extract all images and media for multimodal datasets.
JSON-LD Schema & Meta Tag Extractor — structured data and meta tags from any page.

📝 Changelog

2026-06-07

Docs: added coverage for converting a website to Markdown without an API or login, HTML to Markdown for RAG, and exporting website text to CSV/JSON.

2026-06-05

🛡️ Reliability fix: results are no longer dropped by strict output validation — runs now complete cleanly even at high volume (thousands of results).
⚡ Stability & performance hardening; fresh rebuild.

2026-06-04

Verified live & refreshed build — reliability/maintenance pass.

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

Juan Triviño

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Site to Markdown — any site to clean, LLM-ready markdown

topsail/site-to-markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

Connor Teskey

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Daniel Dimitrov

AI Website Content Extractor

scrapeai/ai-website-content-extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

ScrapeAI

5.0

RAG-Ready Website Crawler

themineworks/rag-crawler

Pre-chunked markdown website crawler for RAG pipelines and LLM context. Only charges for pages that successfully crawl.

The Mine Works

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠