Website to Markdown Crawler - Full-Site Text for LLMs & RAG avatar

Website to Markdown Crawler - Full-Site Text for LLMs & RAG

Pricing

from $1.00 / 1,000 page crawleds

Go to Apify Store
Website to Markdown Crawler - Full-Site Text for LLMs & RAG

Website to Markdown Crawler - Full-Site Text for LLMs & RAG

Crawl any website from a start URL and get every page as clean text + Markdown for LLMs, RAG, and AI agents. Follows internal links with depth and page limits, strips nav and ads, and returns one structured record per page. A fast, no-config site-to-Markdown crawler.

Pricing

from $1.00 / 1,000 page crawleds

Rating

0.0

(0)

Developer

AIDevs

AIDevs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

πŸ•ΈοΈ Website to Markdown Crawler

Website to Markdown Crawler

Crawl any website from a single start URL and get every page back as clean text + Markdown β€” ready for LLMs, RAG pipelines, and AI agents. No config, no selectors, no headless-browser setup. Give it a URL, it follows the internal links, strips the navigation and ads, and returns one structured record per page.

This is the bulk, whole-site companion to the single-page AI Web Page Reader. Point it at a docs site, a blog, a knowledge base, or a marketing site and turn the entire thing into LLM-ready Markdown in one run.


πŸ€” What can Website to Markdown Crawler do?

  • πŸ”— Follow internal links automatically β€” breadth-first crawl from your start URL, with depth and page limits you control.
  • 🧹 Return clean content β€” removes nav bars, headers, footers, cookie banners, scripts, and ads, keeping the real page text.
  • πŸ“ Output Markdown + plain text β€” headings, lists, links, and emphasis preserved as Markdown; plain text for simple ingestion.
  • 🏠 Stay on-domain β€” same-domain scoping by default so you don't wander off into the wider web.
  • πŸ—‚οΈ One record per page β€” title, description, word count, links found, depth, content, and Markdown for every page.
  • βš™οΈ Run with zero configuration β€” sensible defaults; the only required field is the start URL.

πŸ“Š What data do I get?

For every page crawled, you get one dataset record:

FieldDescription
urlThe page URL that was crawled.
depthHow many links deep from the start URL (start = 0).
titlePage title.
descriptionMeta description, if present.
wordCountWord count of the extracted text.
contentClean plain text of the main content.
markdownLLM-ready Markdown version (when enabled).
linksFoundNumber of links discovered on the page.
crawledAtISO timestamp of when the page was crawled.

πŸ’° How much will it cost?

This Actor uses pay-per-event pricing β€” you only pay for what you crawl, and platform/compute usage is included (no surprise infrastructure bill):

  • Page crawled β€” $1.00 per 1,000 pages (the primary event). One charge per page successfully returned to the dataset.
  • Actor start β€” $0.00005 (a negligible per-run fee).
  • Platform usage / compute β€” included.

A 25-page docs site costs about $0.025. A 1,000-page site costs about $1.00. You set maxPages, so your spend is always capped by your own limit. Unlike compute-metered crawlers, your price is flat and predictable β€” you always know the cost before you run.

πŸš€ How do I use Website to Markdown Crawler?

  1. Create a free Apify account (new accounts get free monthly usage credits).
  2. Open the Actor and paste a website into Start URL (e.g. your docs or blog homepage).
  3. Optionally set Max pages, Max depth, and whether to stay on the same domain.
  4. Click Start and watch pages stream into the dataset.
  5. Export the results as JSON, CSV, or Excel, or pull them via the API.

That's it β€” no proxies, browsers, or selectors to configure.

⬇️ Input

Configure the crawl from the Console Input tab or via the API. The only required field is startUrl.

FieldTypeRequiredDefaultDescription
startUrlstringYesβ€”The page to start crawling from.
maxPagesintegerNo25Maximum number of pages to crawl.
maxDepthintegerNo3How many links deep to follow from the start URL.
sameDomainOnlybooleanNotrueOnly follow links on the start URL's domain.
includeMarkdownbooleanNotrueAlso return a Markdown version of each page.
maxCharsPerPageintegerNo0Cap the text/Markdown length per page (0 = no limit).

Example input

{
"startUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
"maxPages": 25,
"maxDepth": 3,
"sameDomainOnly": true,
"includeMarkdown": true,
"maxCharsPerPage": 0
}

⬆️ Output

The Actor pushes one record per page to the dataset. In the Console Output tab you get a clean table (URL, depth, title, word count, links found, crawled-at); via API you get JSON/CSV/Excel. Example record:

{
"url": "https://docs.apify.com/academy/web-scraping-for-beginners",
"depth": 0,
"title": "Web scraping for beginners",
"description": "Learn the basics of web scraping with a step-by-step course.",
"wordCount": 642,
"content": "Web scraping for beginners\n\nThis course teaches you...",
"markdown": "# Web scraping for beginners\n\nThis course teaches you...",
"linksFound": 38,
"crawledAt": "2026-06-30T09:42:11.004Z"
}

🎯 Use cases

  • πŸ“š RAG knowledge bases β€” ingest an entire docs site or help center into a vector database in one run.
  • πŸ€– AI agents β€” give an agent a clean Markdown snapshot of a whole site instead of one page at a time.
  • πŸ” Content migrations β€” pull a site's pages into Markdown for a new CMS or static-site generator.
  • 🧠 LLM fine-tuning / context β€” build a clean text corpus from a domain you control.
  • πŸ” Site audits β€” get titles, word counts, and link counts for every page to spot thin or orphaned content.

πŸ”Œ Integrations & code examples

Call it from the API

curl "https://api.apify.com/v2/acts/entranced_gelato~website-to-markdown-crawler/run-sync-get-dataset-items?token=<APIFY_TOKEN>" \
-H "Content-Type: application/json" \
-d '{ "startUrl": "https://example.com", "maxPages": 50 }'

Python (Apify client)

from apify_client import ApifyClient
client = ApifyClient("<APIFY_TOKEN>")
run = client.actor("entranced_gelato/website-to-markdown-crawler").call(
run_input={"startUrl": "https://docs.example.com", "maxPages": 100}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["url"], "->", len(item["markdown"]), "chars")

LangChain (RAG ingestion)

from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document
apify = ApifyWrapper()
loader = apify.call_actor(
actor_id="entranced_gelato/website-to-markdown-crawler",
run_input={"startUrl": "https://docs.example.com", "maxPages": 100},
dataset_mapping_function=lambda item: Document(
page_content=item["markdown"] or item["content"] or "",
metadata={"source": item["url"], "title": item.get("title")},
),
)
docs = loader.load() # feed straight into a vector store

MCP β€” add it to Claude, Cursor, or any agent

The Actor is exposed over the Model Context Protocol, so AI agents can call it as a tool. Point your MCP client at Apify's MCP server:

{
"mcpServers": {
"apify": {
"command": "npx",
"args": ["-y", "@apify/actors-mcp-server", "--actors", "entranced_gelato/website-to-markdown-crawler"],
"env": { "APIFY_TOKEN": "<APIFY_TOKEN>" }
}
}
}

Also integrates with LlamaIndex, Make, Zapier, and n8n β€” start a crawl from any flow and route the clean output anywhere.

🧰 Want more? Pair it with the rest of the suite

  • πŸ“„ AI Web Page Reader β€” read a single URL into clean text + Markdown (the per-page version of this crawler).
  • πŸ“‘ AI Document Reader β€” turn a PDF, DOCX, TXT, or HTML document into LLM-ready text + Markdown.
  • 🧭 AI Competitive Brief Generator β€” turn a competitor or prospect URL into a structured competitive, SEO, or sales brief.

❓ FAQ

Is web crawling legal? Crawling publicly available pages is generally legal, but how you use the data may be subject to the target site's terms and applicable laws (copyright, privacy, etc.). This Actor reads only public pages and respects the limits you set. You are responsible for how you use the output β€” when in doubt, consult a lawyer.

Does it run JavaScript-heavy sites? It fetches server-rendered HTML. Pages that render entirely client-side may return limited content; for those, a headless-browser crawler is a better fit.

How do I keep costs down? Set maxPages and maxDepth. The crawler stops as soon as it hits either limit, so your spend is capped by your own configuration.

Will it leave the site I gave it? Not by default β€” sameDomainOnly is true, so it only follows links on the start URL's domain. Turn it off to follow external links too.

Markdown or plain text? Both. markdown keeps formatting for LLMs; content is clean plain text. Set includeMarkdown: false if you only want plain text.

Can I call it from my AI agent? Yes β€” it's exposed over MCP and the Apify API, so agents and automations can invoke it as a tool.


Built for AI engineers, RAG/LLM developers, and automation builders who need a fast, reliable "website β†’ Markdown" primitive.