Website to Markdown & Text Crawler โ AI / RAG Data
Pricing
from $4.00 / 1,000 results
Website to Markdown & Text Crawler โ AI / RAG Data
Crawl an entire website and extract clean, boilerplate-free main content as Markdown and plain text โ ready for LLM training, RAG pipelines, embeddings and AI agents. No login, no browser, one row per page.
Pricing
from $4.00 / 1,000 results
Rating
0.0
(0)
Developer
Logiover
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Website to Markdown & Text Crawler โ AI, RAG & LLM Data ๐
Turn any website into clean Markdown and plain text for AI. This website content crawler crawls an entire site, strips away navigation, headers, footers, ads and scripts, and exports the boilerplate-free main content of every page as Markdown and plain text โ ready to feed straight into LLM training sets, RAG pipelines, embeddings, vector databases and AI agents.
Give it one URL โ it discovers and extracts every page automatically. No login, no headless browser, one clean row per page.
Looking to scrape a website for an LLM, convert HTML to Markdown, build RAG data, or extract text from a website at scale? That's exactly what this actor does.
โจ Key features
- ๐ท๏ธ Full-site crawl โ start from one URL and follow internal links across the whole domain.
- ๐ Clean Markdown + plain text โ main content only, with nav/header/footer/sidebar/scripts removed.
- ๐ Absolute links & images โ relative URLs are rewritten to absolute, so the Markdown is portable.
- ๐ง Built for AI / RAG / LLM โ chunk-ready output for embeddings, fine-tuning and retrieval.
- ๐ท๏ธ Rich page metadata โ title, meta description, H1, language, canonical and word count.
- โก Fast & cheap โ pure HTTP, no browser, high concurrency.
๐ก Use cases
- RAG & knowledge bases โ turn docs, blogs and help centers into clean Markdown chunks for retrieval-augmented generation.
- LLM fine-tuning datasets โ collect high-quality text at scale from any set of websites.
- AI agents & chatbots โ feed your agent fresh, structured website content.
- Content migration & archiving โ export an entire website to Markdown.
- Semantic search & embeddings โ generate clean text to embed into a vector database (Pinecone, Weaviate, pgvector, โฆ).
๐ฆ What you get
One row per crawled page:
| Field | Description |
|---|---|
url | Page URL |
title | Page title |
metaDescription | Meta description |
h1 | First H1 heading |
lang | Page language |
canonical | Canonical URL |
wordCount | Word count of the main content |
text | Clean main-content text (boilerplate removed) |
markdown | The same content converted to Markdown |
html | Cleaned main-content HTML (optional) |
crawledAt | ISO 8601 timestamp |
Example output
{"url": "https://docs.example.com/getting-started","title": "Getting Started","metaDescription": "Set up the SDK in 5 minutes.","h1": "Getting Started","wordCount": 812,"text": "Getting Started Install the package...","markdown": "# Getting Started\n\nInstall the package...","crawledAt": "2026-05-25T14:13:00.000Z"}
๐ How to use it
- Click Try for free / Start.
- Paste one or more website URLs into Start URLs.
- (Optional) Set Max pages to crawl โ use
0to crawl the whole site. - (Optional) Toggle Save Markdown, Save plain text, Save HTML.
- Click Save & Start.
- Export your dataset as JSON, CSV, Excel or via API, or pull it straight into your AI pipeline.
โ๏ธ Input
| Option | Description | Default |
|---|---|---|
startUrls | Websites to crawl | โ (required) |
maxPagesToCrawl | Max pages per run (0 = whole site) | 1000 |
saveMarkdown | Include Markdown output | true |
saveText | Include plain-text output | true |
saveHtml | Include cleaned main-content HTML | false |
maxConcurrency | Parallel requests | 10 |
Example input
{"startUrls": [{ "url": "https://docs.apify.com" }],"maxPagesToCrawl": 2000,"saveMarkdown": true,"saveText": true}
๐ How it works
The crawler follows internal links within the same domain as your Start URLs. For each page it removes scripts, styles, navigation, headers, footers and sidebars, isolates the main content (<main> / <article> / body), rewrites relative links and images to absolute URLs, and exports the result as clean text and Markdown. It's pure HTTP โ fast and cheap, with no headless browser.
๐งฐ Tips & best practices
- Set
maxPagesToCrawlto0to capture an entire site for a knowledge base. - Keep
saveTextandsaveMarkdownon for maximum flexibility downstream; turn onsaveHtmlif you need raw HTML. - Use the
wordCountfield to filter out thin pages before embedding. - Lower
maxConcurrencyif a site rate-limits you.
โ FAQ
Does it render JavaScript? No โ it parses server-rendered HTML, which keeps runs fast and cheap and works for the large majority of websites and documentation sites.
Is the Markdown clean enough for RAG? Yes โ navigation, headers, footers, ads and scripts are stripped, and links/images are absolute, so the output is ready to chunk and embed.
How do I crawl the whole site? Set maxPagesToCrawl to 0.
Can I crawl multiple sites at once? Yes โ add several Start URLs.
What formats can I export? JSON, CSV, Excel, HTML and a full REST API.
๐ Related actors by the same author
- Sitemap to URL Crawler โ extract every URL from a sitemap.xml to feed this crawler.
- Website SEO Audit Crawler โ on-page SEO audit for every page.
- Website Image & Media Crawler โ extract all images and media for multimodal datasets.
- JSON-LD Schema & Meta Tag Extractor โ structured data and meta tags from any page.
Changelog
- 2026-05-25 โ Maintenance & reliability pass: pulled the latest source and rebuilt the Actor on the current base image; build verified.
Last reviewed: 2026-05-25.