Pricing

from $4.00 / 1,000 results

Website to Markdown Crawler for LLM & RAG

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Pricing

from $4.00 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Actor stats

Bookmarked

Total users

Monthly active users

9 days ago

Last modified

Website Text & Markdown Crawler — Clean Content for AI, RAG & LLM (No API)

Turn any website into clean Markdown and plain text for AI. This website content crawler starts from a single URL, follows internal links across the whole domain, and strips away navigation, headers, footers, sidebars, ads and scripts — exporting the boilerplate-free main content of every page as Markdown and plain text. You get title, metaDescription, h1, lang, canonical, wordCount, text and markdown per page — ready to feed straight into LLM training sets, RAG pipelines, embeddings, vector databases and AI agents. Fast, pure HTTP, no headless browser, no login, no API key.

🏆 Why this website-to-Markdown crawler?

11 fields per page · thousands of pages per run · pure HTTP + main-content extraction (no browser) · relative links rewritten to absolute · export to JSON / CSV / Excel. The unofficial "scrape a website for an LLM" API alternative for RAG data, embeddings and content migration.

✨ What this Actor does / Key features

🕷️ Full-site crawl — start from one URL and automatically follow internal links across the entire domain; no sitemap or URL list required.
📝 Clean Markdown + plain text — returns main content only, with nav, header, footer, sidebar, ads and scripts removed, so output is ready to chunk and embed.
🔗 Absolute links & images — relative URLs are rewritten to absolute, so the Markdown is portable and every link/image still resolves outside the site.
🧠 Built for AI / RAG / LLM — chunk-ready text and markdown for embeddings, fine-tuning and retrieval-augmented generation.
🏷️ Rich page metadata — title, metaDescription, h1, lang, canonical and wordCount on every row for easy filtering and deduplication.
🎚️ Per-format toggles — turn saveMarkdown, saveText and saveHtml on or off independently to keep datasets lean.
⚡ Fast & cheap — pure HTTP with configurable maxConcurrency; no headless browser, so runs stay quick and inexpensive at scale.
🌐 Multi-site in one run — pass several Start URLs to crawl many sites into a single dataset.

🚀 Quick start (3 steps)

Configure — paste one or more website URLs into Start URLs, then (optionally) set Max pages to crawl (0 = whole site) and toggle Save Markdown / Save plain text / Save HTML.
Run — click Start. The crawler discovers internal links and streams one clean row per page into your dataset.
Get your data — open the Output tab and export to JSON, CSV, Excel or HTML, or pull it via the Apify API straight into your AI pipeline.

📥 Input

Give the Actor at least one Start URL. Everything else is optional.

Example — crawl an entire documentation site for RAG

{
  "startUrls": [{ "url": "https://docs.apify.com" }],
  "maxPagesToCrawl": 0,
  "saveMarkdown": true,
  "saveText": true,
  "saveHtml": false,
  "maxConcurrency": 10
}

Example — quick sample of a blog (first 50 pages, Markdown only)

{
  "startUrls": [{ "url": "https://blog.example.com" }],
  "maxPagesToCrawl": 50,
  "saveMarkdown": true,
  "saveText": false
}

Example — crawl several sites into one knowledge-base dataset

{
  "startUrls": [
    { "url": "https://help.example.com" },
    { "url": "https://docs.example.io" }
  ],
  "maxPagesToCrawl": 2000,
  "saveMarkdown": true,
  "saveText": true,
  "saveHtml": true,
  "maxConcurrency": 5
}

Field	Type	Description
`startUrls`	array	Websites to crawl. The crawler follows internal links from each start URL and extracts every page. Required.
`maxPagesToCrawl`	integer	Maximum pages per run. `0` = no limit (crawl the whole site). Default `1000`.
`saveMarkdown`	boolean	Include the page's main content converted to Markdown. Default `true`.
`saveText`	boolean	Include the page's main content as plain text. Default `true`.
`saveHtml`	boolean	Include the cleaned main-content HTML. Default `false`.
`maxConcurrency`	integer	Number of parallel requests. Lower it if the target site rate-limits you. Default `10`.

Tip: For a complete knowledge base, set maxPagesToCrawl to 0 to capture the whole site. Keep saveText and saveMarkdown on for maximum downstream flexibility, and turn on saveHtml only when you need the raw main-content HTML. Lower maxConcurrency if a site starts rate-limiting.

📤 Output

One row per crawled page — up to 11 fields — exportable to JSON, CSV, Excel or HTML. Here is a trimmed sample record:

{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started",
  "metaDescription": "Set up the SDK in 5 minutes.",
  "h1": "Getting Started",
  "lang": "en",
  "canonical": "https://docs.example.com/getting-started",
  "wordCount": 812,
  "text": "Getting Started Install the package with npm install example-sdk. Then import the client and call connect()...",
  "markdown": "# Getting Started\n\nInstall the package with `npm install example-sdk`. Then import the client and call `connect()`...",
  "html": "<main><h1>Getting Started</h1><p>Install the package...</p></main>",
  "crawledAt": "2026-07-06T14:13:00.000Z"
}

💡 Use cases

RAG & knowledge bases — turn docs, blogs and help centers into clean Markdown chunks for retrieval-augmented generation.
LLM fine-tuning datasets — collect high-quality text at scale from any set of websites.
AI agents & chatbots — feed your agent fresh, structured website content it can reason over.
Semantic search & embeddings — generate clean text to embed into a vector database (Pinecone, Weaviate, pgvector, Qdrant…).
Content migration & archiving — export an entire website to Markdown for a rebuild or offline archive.
Content audits — use wordCount, title and metaDescription to spot thin, duplicate or missing-metadata pages before embedding.

👥 Who uses it

AI/ML engineers building RAG & LLM datasets · data teams & analysts · chatbot and AI-agent developers · technical writers & docs teams · SEO and content teams · agencies running content migrations.

💰 Pricing

This Actor runs on a simple pay-per-result model — you pay for the pages you extract, with no separate Apify platform fees to calculate. Try it on the free tier first, then scale up. See the Pricing tab on this page for the current rate.

❓ Frequently Asked Questions

Does it render JavaScript? No — it parses server-rendered HTML, which keeps runs fast and cheap and works for the large majority of websites and documentation sites.

Is the Markdown clean enough for RAG? Yes — navigation, headers, footers, sidebars, ads and scripts are stripped, and links/images are rewritten to absolute URLs, so the output is ready to chunk and embed.

How do I crawl the whole site? Set maxPagesToCrawl to 0 and the crawler will follow internal links until it has captured every reachable page on the domain.

Can I crawl multiple sites at once? Yes — add several Start URLs and they are all crawled into the same dataset.

How much data can I get? You can extract thousands of pages per run. Raise maxPagesToCrawl (or set it to 0) to cover large sites; lower maxConcurrency if a site rate-limits you.

Yes. Paste a URL and the crawler converts every page to clean Markdown — no website API, no login, no API key and no headless browser required. Only an Apify account is needed.

Is this an HTML-to-Markdown crawler for RAG?

Yes. It strips nav, headers, footers, ads and scripts, then converts the main content from HTML to Markdown so the output is ready to chunk and embed for RAG pipelines.

How do I export website text to CSV or JSON?

Run the crawl, then export the dataset as JSON, CSV, Excel or HTML from the Apify console — or pull it via the REST API — to scrape website text for LLM training data at scale.

How do I convert a documentation site to Markdown for RAG?

Paste the docs start URL, keep saveMarkdown on and set maxPagesToCrawl to 0 (or high enough to cover the site), then export clean per-page Markdown ready to chunk and embed.

Yes. It isolates the main content (<main> / <article> / body) and removes nav, headers, footers, sidebars, ads and scripts, keeping only the boilerplate-free content as Markdown and plain text.

Is it legal to crawl a website with this Actor?

The Actor only collects publicly available web-page content. You are responsible for respecting each site's terms of service, robots.txt and applicable laws (including GDPR) when you crawl and reuse the data.

🔗 More web-content & crawler tools by logiover

Building a full content or lead pipeline? Pair this crawler with the rest of the web-data suite:

Actor	What it does
URL to Markdown	Convert a single URL to clean Markdown
Sitemap to URL Crawler	Extract every URL from a sitemap.xml to feed this crawler
Website Image & Media Extractor	Pull all images and media for multimodal datasets
Website Link Graph Crawler	Map internal/external link structure across a site
Website SEO Audit Crawler	On-page SEO audit for every page
Website Tech Stack Detector	Detect the technologies a website runs on
Website Contact Scraper	Emails, phones and socials from any site
JSON-LD Schema & Meta Tag Extractor	Structured data and meta tags from any page
Broken Link Checker	Find broken internal & external links
Website Change Monitor	Track content changes on any page over time
Wayback Machine URL Extractor	Pull historical URLs from the Wayback Machine

👉 Browse all logiover scrapers on Apify Store — 180+ actors across real estate, jobs, crypto, social media & B2B data.

⏰ Scheduling & integration

Schedule this Actor on Apify to re-crawl a site daily or weekly and keep your knowledge base fresh. Export results to JSON, CSV, Excel or HTML, sync to Google Sheets, or push straight to your vector database, data warehouse, BI tools and webhooks through the Apify API. Connect it to Make, n8n or Zapier to build automated RAG and content pipelines that refresh embeddings whenever pages change.

⭐ Support & feedback

Found a bug or need an extra field? Open an issue on the Issues tab — response is usually fast. If this Actor saves you time, a ★★★★★ review on the Store page genuinely helps and is hugely appreciated. 🙏

⚖️ Legal

This Actor extracts only publicly available web-page content and is intended for legitimate research, AI/ML and content-migration use. You are responsible for complying with each target site's terms of service, robots.txt, copyright and any applicable laws such as GDPR.

📝 Changelog

2026-07-06

✨ README overhaul: richer per-page output sample, ready-to-run example scenarios (docs/blog/multi-site), web-content & crawler cross-promo links, expanded FAQ and clearer quick-start.

2026-07-01

Maintenance pass: re-verified end-to-end on live data and confirmed successful runs within the 5-minute quality window on the default input.
Sharpened Store metadata (SEO title & description) and expanded the FAQ with high-intent, long-tail questions for easier discovery in Google and Apify Store search.
Added ready-to-run example tasks that cover common real-world use cases.

2026-06-15

Reliability pass: re-verified end-to-end on live data with real-world inputs. Routine maintenance build.

2026-06-07

Docs: added coverage for converting a website to Markdown without an API or login, HTML to Markdown for RAG, and exporting website text to CSV/JSON.

2026-06-05

🛡️ Reliability fix: results are no longer dropped by strict output validation — runs now complete cleanly even at high volume (thousands of results).
⚡ Stability & performance hardening; fresh rebuild.

2026-06-04

Verified live & refreshed build — reliability/maintenance pass.

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

Website to Markdown Scraper

receptional_blender/website-to-markdown-scraper

Crawl any website and turn its pages into clean Markdown — plus optional plain text, raw HTML and full-page screenshots. Built for LLM, RAG and AI training datasets.

Assia Fadli

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.

Ken M

Website Content Crawler - Markdown & Text for LLM / RAG

pear_fight/website-content-crawler-markdown-text-for-llm-rag

Crawl any website and extract clean article text and Markdown, ready to feed into LLMs, ChatGPT, vector databases and RAG pipelines. Removes navigation, ads and boilerplate. Configurable crawl depth and page limits. Export to JSON, CSV, Excel.

Harald

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Site to Markdown — any site to clean, LLM-ready markdown

topsail/site-to-markdown

Scrape any website to clean, LLM-ready markdown — a compliant Firecrawl alternative for RAG ingestion, robots.txt always on.

Connor Teskey

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.