Website to Markdown & Text Crawler โ€” AI / RAG Data avatar

Website to Markdown & Text Crawler โ€” AI / RAG Data

Pricing

from $4.00 / 1,000 results

Go to Apify Store
Website to Markdown & Text Crawler โ€” AI / RAG Data

Website to Markdown & Text Crawler โ€” AI / RAG Data

Crawl an entire website and extract clean, boilerplate-free main content as Markdown and plain text โ€” ready for LLM training, RAG pipelines, embeddings and AI agents. No login, no browser, one row per page.

Pricing

from $4.00 / 1,000 results

Rating

0.0

(0)

Developer

Logiover

Logiover

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Website to Markdown & Text Crawler โ€” AI, RAG & LLM Data ๐Ÿ“„

Turn any website into clean Markdown and plain text for AI. This website content crawler crawls an entire site, strips away navigation, headers, footers, ads and scripts, and exports the boilerplate-free main content of every page as Markdown and plain text โ€” ready to feed straight into LLM training sets, RAG pipelines, embeddings, vector databases and AI agents.

Give it one URL โ€” it discovers and extracts every page automatically. No login, no headless browser, one clean row per page.

Looking to scrape a website for an LLM, convert HTML to Markdown, build RAG data, or extract text from a website at scale? That's exactly what this actor does.


โœจ Key features

  • ๐Ÿ•ท๏ธ Full-site crawl โ€” start from one URL and follow internal links across the whole domain.
  • ๐Ÿ“ Clean Markdown + plain text โ€” main content only, with nav/header/footer/sidebar/scripts removed.
  • ๐Ÿ”— Absolute links & images โ€” relative URLs are rewritten to absolute, so the Markdown is portable.
  • ๐Ÿง  Built for AI / RAG / LLM โ€” chunk-ready output for embeddings, fine-tuning and retrieval.
  • ๐Ÿท๏ธ Rich page metadata โ€” title, meta description, H1, language, canonical and word count.
  • โšก Fast & cheap โ€” pure HTTP, no browser, high concurrency.

๐Ÿ’ก Use cases

  • RAG & knowledge bases โ€” turn docs, blogs and help centers into clean Markdown chunks for retrieval-augmented generation.
  • LLM fine-tuning datasets โ€” collect high-quality text at scale from any set of websites.
  • AI agents & chatbots โ€” feed your agent fresh, structured website content.
  • Content migration & archiving โ€” export an entire website to Markdown.
  • Semantic search & embeddings โ€” generate clean text to embed into a vector database (Pinecone, Weaviate, pgvector, โ€ฆ).

๐Ÿ“ฆ What you get

One row per crawled page:

FieldDescription
urlPage URL
titlePage title
metaDescriptionMeta description
h1First H1 heading
langPage language
canonicalCanonical URL
wordCountWord count of the main content
textClean main-content text (boilerplate removed)
markdownThe same content converted to Markdown
htmlCleaned main-content HTML (optional)
crawledAtISO 8601 timestamp

Example output

{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started",
"metaDescription": "Set up the SDK in 5 minutes.",
"h1": "Getting Started",
"wordCount": 812,
"text": "Getting Started Install the package...",
"markdown": "# Getting Started\n\nInstall the package...",
"crawledAt": "2026-05-25T14:13:00.000Z"
}

๐Ÿš€ How to use it

  1. Click Try for free / Start.
  2. Paste one or more website URLs into Start URLs.
  3. (Optional) Set Max pages to crawl โ€” use 0 to crawl the whole site.
  4. (Optional) Toggle Save Markdown, Save plain text, Save HTML.
  5. Click Save & Start.
  6. Export your dataset as JSON, CSV, Excel or via API, or pull it straight into your AI pipeline.

โš™๏ธ Input

OptionDescriptionDefault
startUrlsWebsites to crawlโ€“ (required)
maxPagesToCrawlMax pages per run (0 = whole site)1000
saveMarkdownInclude Markdown outputtrue
saveTextInclude plain-text outputtrue
saveHtmlInclude cleaned main-content HTMLfalse
maxConcurrencyParallel requests10

Example input

{
"startUrls": [{ "url": "https://docs.apify.com" }],
"maxPagesToCrawl": 2000,
"saveMarkdown": true,
"saveText": true
}

๐Ÿ” How it works

The crawler follows internal links within the same domain as your Start URLs. For each page it removes scripts, styles, navigation, headers, footers and sidebars, isolates the main content (<main> / <article> / body), rewrites relative links and images to absolute URLs, and exports the result as clean text and Markdown. It's pure HTTP โ€” fast and cheap, with no headless browser.

๐Ÿงฐ Tips & best practices

  • Set maxPagesToCrawl to 0 to capture an entire site for a knowledge base.
  • Keep saveText and saveMarkdown on for maximum flexibility downstream; turn on saveHtml if you need raw HTML.
  • Use the wordCount field to filter out thin pages before embedding.
  • Lower maxConcurrency if a site rate-limits you.

โ“ FAQ

Does it render JavaScript? No โ€” it parses server-rendered HTML, which keeps runs fast and cheap and works for the large majority of websites and documentation sites.

Is the Markdown clean enough for RAG? Yes โ€” navigation, headers, footers, ads and scripts are stripped, and links/images are absolute, so the output is ready to chunk and embed.

How do I crawl the whole site? Set maxPagesToCrawl to 0.

Can I crawl multiple sites at once? Yes โ€” add several Start URLs.

What formats can I export? JSON, CSV, Excel, HTML and a full REST API.

  • Sitemap to URL Crawler โ€” extract every URL from a sitemap.xml to feed this crawler.
  • Website SEO Audit Crawler โ€” on-page SEO audit for every page.
  • Website Image & Media Crawler โ€” extract all images and media for multimodal datasets.
  • JSON-LD Schema & Meta Tag Extractor โ€” structured data and meta tags from any page.

Changelog

  • 2026-05-25 โ€” Maintenance & reliability pass: pulled the latest source and rebuilt the Actor on the current base image; build verified.

Last reviewed: 2026-05-25.