Pricing

from $0.70 / 1,000 page crawleds

Website to Markdown Crawler - Full-Site Text for LLMs & RAG

Crawl any website from a start URL and get every page as clean text + Markdown for LLMs, RAG, and AI agents. Follows internal links with depth and page limits, strips nav and ads, and returns one structured record per page. A fast, no-config site-to-Markdown crawler.

Pricing

from $0.70 / 1,000 page crawleds

Rating

0.0

(0)

Developer

AIDevs

Actor stats

Bookmarked

Total users

Monthly active users

19 days ago

Last modified

🕸️ Website to Markdown Crawler

Crawl any website from a single start URL and get every page back as clean text + Markdown — ready for LLMs, RAG pipelines, and AI agents. No config, no selectors, no headless-browser setup. Give it a URL, it follows the internal links, strips the navigation and ads, and returns one structured record per page.

This is the bulk, whole-site companion to the single-page AI Web Page Reader. Point it at a docs site, a blog, a knowledge base, or a marketing site and turn the entire thing into LLM-ready Markdown in one run.

🤔 What can Website to Markdown Crawler do?

🔗 Follow internal links automatically — breadth-first crawl from your start URL, with depth and page limits you control.
🧹 Return clean content — removes nav bars, headers, footers, cookie banners, scripts, and ads, keeping the real page text.
📝 Output Markdown + plain text — headings, lists, links, and emphasis preserved as Markdown; plain text for simple ingestion.
🏠 Stay on-domain — same-domain scoping by default so you don't wander off into the wider web.
🗂️ One record per page — title, description, word count, links found, depth, content, and Markdown for every page.
⚙️ Run with zero configuration — sensible defaults; the only required field is the start URL.

📊 What data do I get?

For every page crawled, you get one dataset record:

Field	Description
`url`	The page URL that was crawled.
`depth`	How many links deep from the start URL (start = 0).
`title`	Page title.
`description`	Meta description, if present.
`wordCount`	Word count of the extracted text.
`content`	Clean plain text of the main content.
`markdown`	LLM-ready Markdown version (when enabled).
`linksFound`	Number of links discovered on the page.
`crawledAt`	ISO timestamp of when the page was crawled.

💰 How much will it cost?

This Actor uses pay-per-event pricing — you only pay for what you crawl, and platform/compute usage is included (no surprise infrastructure bill):

Page crawled — $1.00 per 1,000 pages (the primary event). One charge per page successfully returned to the dataset.
Actor start — $0.00005 (a negligible per-run fee).
Platform usage / compute — included.

A 25-page docs site costs about $0.025. A 1,000-page site costs about $1.00. You set maxPages, so your spend is always capped by your own limit. Unlike compute-metered crawlers, your price is flat and predictable — you always know the cost before you run.

🚀 How do I use Website to Markdown Crawler?

Create a free Apify account (new accounts get free monthly usage credits).
Open the Actor and paste a website into Start URL (e.g. your docs or blog homepage).
Optionally set Max pages, Max depth, and whether to stay on the same domain.
Click Start and watch pages stream into the dataset.
Export the results as JSON, CSV, or Excel, or pull them via the API.

That's it — no proxies, browsers, or selectors to configure.

⬇️ Input

Configure the crawl from the Console Input tab or via the API. The only required field is startUrl.

Field	Type	Required	Default	Description
`startUrl`	string	Yes	—	The page to start crawling from.
`maxPages`	integer	No	`25`	Maximum number of pages to crawl.
`maxDepth`	integer	No	`3`	How many links deep to follow from the start URL.
`sameDomainOnly`	boolean	No	`true`	Only follow links on the start URL's domain.
`includeMarkdown`	boolean	No	`true`	Also return a Markdown version of each page.
`maxCharsPerPage`	integer	No	`0`	Cap the text/Markdown length per page (`0` = no limit).

Example input

{
  "startUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
  "maxPages": 25,
  "maxDepth": 3,
  "sameDomainOnly": true,
  "includeMarkdown": true,
  "maxCharsPerPage": 0
}

⬆️ Output

The Actor pushes one record per page to the dataset. In the Console Output tab you get a clean table (URL, depth, title, word count, links found, crawled-at); via API you get JSON/CSV/Excel. Example record:

{
  "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
  "depth": 0,
  "title": "Web scraping for beginners",
  "description": "Learn the basics of web scraping with a step-by-step course.",
  "wordCount": 642,
  "content": "Web scraping for beginners\n\nThis course teaches you...",
  "markdown": "# Web scraping for beginners\n\nThis course teaches you...",
  "linksFound": 38,
  "crawledAt": "2026-06-30T09:42:11.004Z"
}

🎯 Use cases

📚 RAG knowledge bases — ingest an entire docs site or help center into a vector database in one run.
🤖 AI agents — give an agent a clean Markdown snapshot of a whole site instead of one page at a time.
🔁 Content migrations — pull a site's pages into Markdown for a new CMS or static-site generator.
🧠 LLM fine-tuning / context — build a clean text corpus from a domain you control.
🔍 Site audits — get titles, word counts, and link counts for every page to spot thin or orphaned content.

🔌 Integrations & code examples

Call it from the API

curl "https://api.apify.com/v2/acts/entranced_gelato~website-to-markdown-crawler/run-sync-get-dataset-items?token=<APIFY_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{ "startUrl": "https://example.com", "maxPages": 50 }'

Python (Apify client)

from apify_client import ApifyClient

client = ApifyClient("<APIFY_TOKEN>")
run = client.actor("entranced_gelato/website-to-markdown-crawler").call(
    run_input={"startUrl": "https://docs.example.com", "maxPages": 100}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["url"], "->", len(item["markdown"]), "chars")

LangChain (RAG ingestion)

from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document

apify = ApifyWrapper()
loader = apify.call_actor(
    actor_id="entranced_gelato/website-to-markdown-crawler",
    run_input={"startUrl": "https://docs.example.com", "maxPages": 100},
    dataset_mapping_function=lambda item: Document(
        page_content=item["markdown"] or item["content"] or "",
        metadata={"source": item["url"], "title": item.get("title")},
    ),
)
docs = loader.load()  # feed straight into a vector store

MCP — add it to Claude, Cursor, or any agent

The Actor is exposed over the Model Context Protocol, so AI agents can call it as a tool. Point your MCP client at Apify's MCP server:

{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["-y", "@apify/actors-mcp-server", "--actors", "entranced_gelato/website-to-markdown-crawler"],
      "env": { "APIFY_TOKEN": "<APIFY_TOKEN>" }
    }
  }
}

Also integrates with LlamaIndex, Make, Zapier, and n8n — start a crawl from any flow and route the clean output anywhere.

🧰 Want more? Pair it with the rest of the suite

📄 AI Web Page Reader — read a single URL into clean text + Markdown (the per-page version of this crawler).
📑 AI Document Reader — turn a PDF, DOCX, TXT, or HTML document into LLM-ready text + Markdown.
🧭 AI Competitive Brief Generator — turn a competitor or prospect URL into a structured competitive, SEO, or sales brief.

❓ FAQ

Is web crawling legal? Crawling publicly available pages is generally legal, but how you use the data may be subject to the target site's terms and applicable laws (copyright, privacy, etc.). This Actor reads only public pages and respects the limits you set. You are responsible for how you use the output — when in doubt, consult a lawyer.

Does it run JavaScript-heavy sites? It fetches server-rendered HTML. Pages that render entirely client-side may return limited content; for those, a headless-browser crawler is a better fit.

How do I keep costs down? Set maxPages and maxDepth. The crawler stops as soon as it hits either limit, so your spend is capped by your own configuration.

Will it leave the site I gave it? Not by default — sameDomainOnly is true, so it only follows links on the start URL's domain. Turn it off to follow external links too.

Markdown or plain text? Both. markdown keeps formatting for LLMs; content is clean plain text. Set includeMarkdown: false if you only want plain text.

Can I call it from my AI agent? Yes — it's exposed over MCP and the Apify API, so agents and automations can invoke it as a tool.

Built for AI engineers, RAG/LLM developers, and automation builders who need a fast, reliable "website → Markdown" primitive.

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

Website Content Crawler - Markdown & Text for LLM / RAG

pear_fight/website-content-crawler-markdown-text-for-llm-rag

Crawl any website and extract clean article text and Markdown, ready to feed into LLMs, ChatGPT, vector databases and RAG pipelines. Removes navigation, ads and boilerplate. Configurable crawl depth and page limits. Export to JSON, CSV, Excel.

Harald

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website to Markdown for LLMs and RAG

rodrgds/website-to-markdown

Convert webpages into clean markdown for LLMs, RAG pipelines, AI datasets, archives, and content extraction. Simple pay-per-page pricing.

Rodrigo Dias

Website to Markdown for RAG & LLMs

hereditary_model/website-to-markdown

Crawls a website and converts every page into clean, LLM-ready Markdown for RAG pipelines, vector databases, and AI agents. Removes nav, ads, and boilerplate. Predictable pricing: $0.004 per page converted.

Aaron Marxsen

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

Site Crawler: Website → Markdown Corpus for LLM/RAG

boxbox10/site-crawler

Crawl a whole website or docs site and get one clean, LLM-ready Markdown + JSON record per page (title, headings, content, links, token count). Built for RAG ingestion and AI knowledge bases.

Marvin Eguilos

Web Page to Markdown & Text - URL Reader for LLMs & RAG

entranced_gelato/ai-web-page-reader

Read any web page as clean text + Markdown for LLMs and automations. Strips ads, nav, and scripts; returns the main content, metadata (title, author, date, word count), and an optional AI TL;DR + key points. The web-reading primitive for AI agents, RAG pipelines, and no-code flows.