Website to Markdown Crawler for LLM & RAG avatar

Website to Markdown Crawler for LLM & RAG

Pricing

from $4.00 / 1,000 results

Go to Apify Store
Website to Markdown Crawler for LLM & RAG

Website to Markdown Crawler for LLM & RAG

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Pricing

from $4.00 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

2

Monthly active users

7 days ago

Last modified

Share

Website to Markdown & Text Crawler β€” AI, RAG & LLM Data πŸ“„

Turn any website into clean Markdown and plain text for AI. This website content crawler crawls an entire site, strips away navigation, headers, footers, ads and scripts, and exports the boilerplate-free main content of every page as Markdown and plain text β€” ready to feed straight into LLM training sets, RAG pipelines, embeddings, vector databases and AI agents.

Give it one URL β€” it discovers and extracts every page automatically. No login, no headless browser, one clean row per page.

Looking to scrape a website for an LLM, convert HTML to Markdown, build RAG data, or extract text from a website at scale? That's exactly what this actor does.


✨ Key features

  • πŸ•·οΈ Full-site crawl β€” start from one URL and follow internal links across the whole domain.
  • πŸ“ Clean Markdown + plain text β€” main content only, with nav/header/footer/sidebar/scripts removed.
  • πŸ”— Absolute links & images β€” relative URLs are rewritten to absolute, so the Markdown is portable.
  • 🧠 Built for AI / RAG / LLM β€” chunk-ready output for embeddings, fine-tuning and retrieval.
  • 🏷️ Rich page metadata β€” title, meta description, H1, language, canonical and word count.
  • ⚑ Fast & cheap β€” pure HTTP, no browser, high concurrency.

πŸ’‘ Use cases

  • RAG & knowledge bases β€” turn docs, blogs and help centers into clean Markdown chunks for retrieval-augmented generation.
  • LLM fine-tuning datasets β€” collect high-quality text at scale from any set of websites.
  • AI agents & chatbots β€” feed your agent fresh, structured website content.
  • Content migration & archiving β€” export an entire website to Markdown.
  • Semantic search & embeddings β€” generate clean text to embed into a vector database (Pinecone, Weaviate, pgvector, …).

πŸ“¦ What you get

One row per crawled page:

FieldDescription
urlPage URL
titlePage title
metaDescriptionMeta description
h1First H1 heading
langPage language
canonicalCanonical URL
wordCountWord count of the main content
textClean main-content text (boilerplate removed)
markdownThe same content converted to Markdown
htmlCleaned main-content HTML (optional)
crawledAtISO 8601 timestamp

Example output

{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started",
"metaDescription": "Set up the SDK in 5 minutes.",
"h1": "Getting Started",
"wordCount": 812,
"text": "Getting Started Install the package...",
"markdown": "# Getting Started\n\nInstall the package...",
"crawledAt": "2026-05-25T14:13:00.000Z"
}

πŸš€ How to use it

  1. Click Try for free / Start.
  2. Paste one or more website URLs into Start URLs.
  3. (Optional) Set Max pages to crawl β€” use 0 to crawl the whole site.
  4. (Optional) Toggle Save Markdown, Save plain text, Save HTML.
  5. Click Save & Start.
  6. Export your dataset as JSON, CSV, Excel or via API, or pull it straight into your AI pipeline.

βš™οΈ Input

OptionDescriptionDefault
startUrlsWebsites to crawl– (required)
maxPagesToCrawlMax pages per run (0 = whole site)1000
saveMarkdownInclude Markdown outputtrue
saveTextInclude plain-text outputtrue
saveHtmlInclude cleaned main-content HTMLfalse
maxConcurrencyParallel requests10

Example input

{
"startUrls": [{ "url": "https://docs.apify.com" }],
"maxPagesToCrawl": 2000,
"saveMarkdown": true,
"saveText": true
}

πŸ” How it works

The crawler follows internal links within the same domain as your Start URLs. For each page it removes scripts, styles, navigation, headers, footers and sidebars, isolates the main content (<main> / <article> / body), rewrites relative links and images to absolute URLs, and exports the result as clean text and Markdown. It's pure HTTP β€” fast and cheap, with no headless browser.

🧰 Tips & best practices

  • Set maxPagesToCrawl to 0 to capture an entire site for a knowledge base.
  • Keep saveText and saveMarkdown on for maximum flexibility downstream; turn on saveHtml if you need raw HTML.
  • Use the wordCount field to filter out thin pages before embedding.
  • Lower maxConcurrency if a site rate-limits you.

❓ FAQ

Does it render JavaScript? No β€” it parses server-rendered HTML, which keeps runs fast and cheap and works for the large majority of websites and documentation sites.

Is the Markdown clean enough for RAG? Yes β€” navigation, headers, footers, ads and scripts are stripped, and links/images are absolute, so the output is ready to chunk and embed.

How do I crawl the whole site? Set maxPagesToCrawl to 0.

Can I crawl multiple sites at once? Yes β€” add several Start URLs.

What formats can I export? JSON, CSV, Excel, HTML and a full REST API.

Can I convert a website to Markdown without an API or login?

Yes. Paste a URL and the crawler converts every page to clean Markdown β€” no website API, no login, no headless browser required.

Is this an HTML to Markdown crawler for RAG?

Yes. It strips nav, headers, footers, ads and scripts, then converts the main content from HTML to Markdown so the output is ready to chunk and embed for RAG pipelines.

How do I export website text to CSV or JSON?

Run the crawl, then export the dataset as JSON, CSV, Excel or via the REST API to scrape website text for LLM training data at scale.

  • Sitemap to URL Crawler β€” extract every URL from a sitemap.xml to feed this crawler.
  • Website SEO Audit Crawler β€” on-page SEO audit for every page.
  • Website Image & Media Crawler β€” extract all images and media for multimodal datasets.
  • JSON-LD Schema & Meta Tag Extractor β€” structured data and meta tags from any page.

πŸ“ Changelog

2026-06-07

  • Docs: added coverage for converting a website to Markdown without an API or login, HTML to Markdown for RAG, and exporting website text to CSV/JSON.

2026-06-05

  • πŸ›‘οΈ Reliability fix: results are no longer dropped by strict output validation β€” runs now complete cleanly even at high volume (thousands of results).
  • ⚑ Stability & performance hardening; fresh rebuild.

2026-06-04

  • Verified live & refreshed build β€” reliability/maintenance pass.