Website to Markdown Crawler for LLM & RAG
Pricing
from $4.00 / 1,000 results
Website to Markdown Crawler for LLM & RAG
Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.
Pricing
from $4.00 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
0
Bookmarked
5
Total users
2
Monthly active users
7 days ago
Last modified
Categories
Share
Website to Markdown & Text Crawler β AI, RAG & LLM Data π
Turn any website into clean Markdown and plain text for AI. This website content crawler crawls an entire site, strips away navigation, headers, footers, ads and scripts, and exports the boilerplate-free main content of every page as Markdown and plain text β ready to feed straight into LLM training sets, RAG pipelines, embeddings, vector databases and AI agents.
Give it one URL β it discovers and extracts every page automatically. No login, no headless browser, one clean row per page.
Looking to scrape a website for an LLM, convert HTML to Markdown, build RAG data, or extract text from a website at scale? That's exactly what this actor does.
β¨ Key features
- π·οΈ Full-site crawl β start from one URL and follow internal links across the whole domain.
- π Clean Markdown + plain text β main content only, with nav/header/footer/sidebar/scripts removed.
- π Absolute links & images β relative URLs are rewritten to absolute, so the Markdown is portable.
- π§ Built for AI / RAG / LLM β chunk-ready output for embeddings, fine-tuning and retrieval.
- π·οΈ Rich page metadata β title, meta description, H1, language, canonical and word count.
- β‘ Fast & cheap β pure HTTP, no browser, high concurrency.
π‘ Use cases
- RAG & knowledge bases β turn docs, blogs and help centers into clean Markdown chunks for retrieval-augmented generation.
- LLM fine-tuning datasets β collect high-quality text at scale from any set of websites.
- AI agents & chatbots β feed your agent fresh, structured website content.
- Content migration & archiving β export an entire website to Markdown.
- Semantic search & embeddings β generate clean text to embed into a vector database (Pinecone, Weaviate, pgvector, β¦).
π¦ What you get
One row per crawled page:
| Field | Description |
|---|---|
url | Page URL |
title | Page title |
metaDescription | Meta description |
h1 | First H1 heading |
lang | Page language |
canonical | Canonical URL |
wordCount | Word count of the main content |
text | Clean main-content text (boilerplate removed) |
markdown | The same content converted to Markdown |
html | Cleaned main-content HTML (optional) |
crawledAt | ISO 8601 timestamp |
Example output
{"url": "https://docs.example.com/getting-started","title": "Getting Started","metaDescription": "Set up the SDK in 5 minutes.","h1": "Getting Started","wordCount": 812,"text": "Getting Started Install the package...","markdown": "# Getting Started\n\nInstall the package...","crawledAt": "2026-05-25T14:13:00.000Z"}
π How to use it
- Click Try for free / Start.
- Paste one or more website URLs into Start URLs.
- (Optional) Set Max pages to crawl β use
0to crawl the whole site. - (Optional) Toggle Save Markdown, Save plain text, Save HTML.
- Click Save & Start.
- Export your dataset as JSON, CSV, Excel or via API, or pull it straight into your AI pipeline.
βοΈ Input
| Option | Description | Default |
|---|---|---|
startUrls | Websites to crawl | β (required) |
maxPagesToCrawl | Max pages per run (0 = whole site) | 1000 |
saveMarkdown | Include Markdown output | true |
saveText | Include plain-text output | true |
saveHtml | Include cleaned main-content HTML | false |
maxConcurrency | Parallel requests | 10 |
Example input
{"startUrls": [{ "url": "https://docs.apify.com" }],"maxPagesToCrawl": 2000,"saveMarkdown": true,"saveText": true}
π How it works
The crawler follows internal links within the same domain as your Start URLs. For each page it removes scripts, styles, navigation, headers, footers and sidebars, isolates the main content (<main> / <article> / body), rewrites relative links and images to absolute URLs, and exports the result as clean text and Markdown. It's pure HTTP β fast and cheap, with no headless browser.
π§° Tips & best practices
- Set
maxPagesToCrawlto0to capture an entire site for a knowledge base. - Keep
saveTextandsaveMarkdownon for maximum flexibility downstream; turn onsaveHtmlif you need raw HTML. - Use the
wordCountfield to filter out thin pages before embedding. - Lower
maxConcurrencyif a site rate-limits you.
β FAQ
Does it render JavaScript? No β it parses server-rendered HTML, which keeps runs fast and cheap and works for the large majority of websites and documentation sites.
Is the Markdown clean enough for RAG? Yes β navigation, headers, footers, ads and scripts are stripped, and links/images are absolute, so the output is ready to chunk and embed.
How do I crawl the whole site? Set maxPagesToCrawl to 0.
Can I crawl multiple sites at once? Yes β add several Start URLs.
What formats can I export? JSON, CSV, Excel, HTML and a full REST API.
Can I convert a website to Markdown without an API or login?
Yes. Paste a URL and the crawler converts every page to clean Markdown β no website API, no login, no headless browser required.
Is this an HTML to Markdown crawler for RAG?
Yes. It strips nav, headers, footers, ads and scripts, then converts the main content from HTML to Markdown so the output is ready to chunk and embed for RAG pipelines.
How do I export website text to CSV or JSON?
Run the crawl, then export the dataset as JSON, CSV, Excel or via the REST API to scrape website text for LLM training data at scale.
π Related actors by the same author
- Sitemap to URL Crawler β extract every URL from a sitemap.xml to feed this crawler.
- Website SEO Audit Crawler β on-page SEO audit for every page.
- Website Image & Media Crawler β extract all images and media for multimodal datasets.
- JSON-LD Schema & Meta Tag Extractor β structured data and meta tags from any page.
π Changelog
2026-06-07
- Docs: added coverage for converting a website to Markdown without an API or login, HTML to Markdown for RAG, and exporting website text to CSV/JSON.
2026-06-05
- π‘οΈ Reliability fix: results are no longer dropped by strict output validation β runs now complete cleanly even at high volume (thousands of results).
- β‘ Stability & performance hardening; fresh rebuild.
2026-06-04
- Verified live & refreshed build β reliability/maintenance pass.