Website to Markdown Crawler - Full-Site Text for LLMs & RAG
Pricing
from $1.00 / 1,000 page crawleds
Website to Markdown Crawler - Full-Site Text for LLMs & RAG
Crawl any website from a start URL and get every page as clean text + Markdown for LLMs, RAG, and AI agents. Follows internal links with depth and page limits, strips nav and ads, and returns one structured record per page. A fast, no-config site-to-Markdown crawler.
Pricing
from $1.00 / 1,000 page crawleds
Rating
0.0
(0)
Developer
AIDevs
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
πΈοΈ Website to Markdown Crawler
Crawl any website from a single start URL and get every page back as clean text + Markdown β ready for LLMs, RAG pipelines, and AI agents. No config, no selectors, no headless-browser setup. Give it a URL, it follows the internal links, strips the navigation and ads, and returns one structured record per page.
This is the bulk, whole-site companion to the single-page AI Web Page Reader. Point it at a docs site, a blog, a knowledge base, or a marketing site and turn the entire thing into LLM-ready Markdown in one run.
π€ What can Website to Markdown Crawler do?
- π Follow internal links automatically β breadth-first crawl from your start URL, with depth and page limits you control.
- π§Ή Return clean content β removes nav bars, headers, footers, cookie banners, scripts, and ads, keeping the real page text.
- π Output Markdown + plain text β headings, lists, links, and emphasis preserved as Markdown; plain text for simple ingestion.
- π Stay on-domain β same-domain scoping by default so you don't wander off into the wider web.
- ποΈ One record per page β title, description, word count, links found, depth, content, and Markdown for every page.
- βοΈ Run with zero configuration β sensible defaults; the only required field is the start URL.
π What data do I get?
For every page crawled, you get one dataset record:
| Field | Description |
|---|---|
url | The page URL that was crawled. |
depth | How many links deep from the start URL (start = 0). |
title | Page title. |
description | Meta description, if present. |
wordCount | Word count of the extracted text. |
content | Clean plain text of the main content. |
markdown | LLM-ready Markdown version (when enabled). |
linksFound | Number of links discovered on the page. |
crawledAt | ISO timestamp of when the page was crawled. |
π° How much will it cost?
This Actor uses pay-per-event pricing β you only pay for what you crawl, and platform/compute usage is included (no surprise infrastructure bill):
- Page crawled β $1.00 per 1,000 pages (the primary event). One charge per page successfully returned to the dataset.
- Actor start β $0.00005 (a negligible per-run fee).
- Platform usage / compute β included.
A 25-page docs site costs about $0.025. A 1,000-page site costs about $1.00. You set maxPages, so your spend is always capped by your own limit. Unlike compute-metered crawlers, your price is flat and predictable β you always know the cost before you run.
π How do I use Website to Markdown Crawler?
- Create a free Apify account (new accounts get free monthly usage credits).
- Open the Actor and paste a website into Start URL (e.g. your docs or blog homepage).
- Optionally set Max pages, Max depth, and whether to stay on the same domain.
- Click Start and watch pages stream into the dataset.
- Export the results as JSON, CSV, or Excel, or pull them via the API.
That's it β no proxies, browsers, or selectors to configure.
β¬οΈ Input
Configure the crawl from the Console Input tab or via the API. The only required field is startUrl.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
startUrl | string | Yes | β | The page to start crawling from. |
maxPages | integer | No | 25 | Maximum number of pages to crawl. |
maxDepth | integer | No | 3 | How many links deep to follow from the start URL. |
sameDomainOnly | boolean | No | true | Only follow links on the start URL's domain. |
includeMarkdown | boolean | No | true | Also return a Markdown version of each page. |
maxCharsPerPage | integer | No | 0 | Cap the text/Markdown length per page (0 = no limit). |
Example input
{"startUrl": "https://docs.apify.com/academy/web-scraping-for-beginners","maxPages": 25,"maxDepth": 3,"sameDomainOnly": true,"includeMarkdown": true,"maxCharsPerPage": 0}
β¬οΈ Output
The Actor pushes one record per page to the dataset. In the Console Output tab you get a clean table (URL, depth, title, word count, links found, crawled-at); via API you get JSON/CSV/Excel. Example record:
{"url": "https://docs.apify.com/academy/web-scraping-for-beginners","depth": 0,"title": "Web scraping for beginners","description": "Learn the basics of web scraping with a step-by-step course.","wordCount": 642,"content": "Web scraping for beginners\n\nThis course teaches you...","markdown": "# Web scraping for beginners\n\nThis course teaches you...","linksFound": 38,"crawledAt": "2026-06-30T09:42:11.004Z"}
π― Use cases
- π RAG knowledge bases β ingest an entire docs site or help center into a vector database in one run.
- π€ AI agents β give an agent a clean Markdown snapshot of a whole site instead of one page at a time.
- π Content migrations β pull a site's pages into Markdown for a new CMS or static-site generator.
- π§ LLM fine-tuning / context β build a clean text corpus from a domain you control.
- π Site audits β get titles, word counts, and link counts for every page to spot thin or orphaned content.
π Integrations & code examples
Call it from the API
curl "https://api.apify.com/v2/acts/entranced_gelato~website-to-markdown-crawler/run-sync-get-dataset-items?token=<APIFY_TOKEN>" \-H "Content-Type: application/json" \-d '{ "startUrl": "https://example.com", "maxPages": 50 }'
Python (Apify client)
from apify_client import ApifyClientclient = ApifyClient("<APIFY_TOKEN>")run = client.actor("entranced_gelato/website-to-markdown-crawler").call(run_input={"startUrl": "https://docs.example.com", "maxPages": 100})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["url"], "->", len(item["markdown"]), "chars")
LangChain (RAG ingestion)
from langchain_community.utilities import ApifyWrapperfrom langchain_core.documents import Documentapify = ApifyWrapper()loader = apify.call_actor(actor_id="entranced_gelato/website-to-markdown-crawler",run_input={"startUrl": "https://docs.example.com", "maxPages": 100},dataset_mapping_function=lambda item: Document(page_content=item["markdown"] or item["content"] or "",metadata={"source": item["url"], "title": item.get("title")},),)docs = loader.load() # feed straight into a vector store
MCP β add it to Claude, Cursor, or any agent
The Actor is exposed over the Model Context Protocol, so AI agents can call it as a tool. Point your MCP client at Apify's MCP server:
{"mcpServers": {"apify": {"command": "npx","args": ["-y", "@apify/actors-mcp-server", "--actors", "entranced_gelato/website-to-markdown-crawler"],"env": { "APIFY_TOKEN": "<APIFY_TOKEN>" }}}}
Also integrates with LlamaIndex, Make, Zapier, and n8n β start a crawl from any flow and route the clean output anywhere.
π§° Want more? Pair it with the rest of the suite
- π AI Web Page Reader β read a single URL into clean text + Markdown (the per-page version of this crawler).
- π AI Document Reader β turn a PDF, DOCX, TXT, or HTML document into LLM-ready text + Markdown.
- π§ AI Competitive Brief Generator β turn a competitor or prospect URL into a structured competitive, SEO, or sales brief.
β FAQ
Is web crawling legal? Crawling publicly available pages is generally legal, but how you use the data may be subject to the target site's terms and applicable laws (copyright, privacy, etc.). This Actor reads only public pages and respects the limits you set. You are responsible for how you use the output β when in doubt, consult a lawyer.
Does it run JavaScript-heavy sites? It fetches server-rendered HTML. Pages that render entirely client-side may return limited content; for those, a headless-browser crawler is a better fit.
How do I keep costs down? Set maxPages and maxDepth. The crawler stops as soon as it hits either limit, so your spend is capped by your own configuration.
Will it leave the site I gave it? Not by default β sameDomainOnly is true, so it only follows links on the start URL's domain. Turn it off to follow external links too.
Markdown or plain text? Both. markdown keeps formatting for LLMs; content is clean plain text. Set includeMarkdown: false if you only want plain text.
Can I call it from my AI agent? Yes β it's exposed over MCP and the Apify API, so agents and automations can invoke it as a tool.
Built for AI engineers, RAG/LLM developers, and automation builders who need a fast, reliable "website β Markdown" primitive.