Pricing

from $5.00 / 1,000 results

LLM Markdown Crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Daniel Dimitrov

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

What does LLM Markdown Crawler do?

LLM Markdown Crawler is an Apify Actor that crawls any website and converts its pages into pristine Markdown optimized for LLM and AI workflows. It uses CheerioCrawler (no headless browser) for blazing-fast, low-cost extraction. LLM Markdown Crawler can extract:

Clean Markdown content — navigation, footers, sidebars, and ads are stripped automatically
Page metadata — title, author, excerpt, and word count for each page
Deep crawl support — follow links up to a configurable depth with URL glob filtering
Structured output — each page becomes a single JSON record ready for vector databases, fine-tuning, or RAG ingestion

Why scrape websites for LLM data?

The web contains billions of pages of high-quality content — documentation, blog posts, research articles, and knowledge bases. Converting this content to clean Markdown is essential for modern AI workflows.

Here are just some of the ways you could use that data:

RAG pipelines — build retrieval-augmented generation systems with clean, chunked content from any domain
LLM fine-tuning — create high-quality training datasets from curated websites and documentation
Knowledge base archiving — capture and preserve website content in a portable, searchable format
Content analysis — analyze writing patterns, topics, and structure across large content collections
Documentation extraction — convert technical docs, wikis, and help centers into Markdown for internal tools

If you would like more inspiration on how scraping websites for LLM data could help your business, check out our industry pages.

How to scrape websites for LLM data

Click on Try for free.
Enter one or more website URLs in the startUrls field (e.g., https://docs.example.com/).
Set maxRequestsPerCrawl, maxDepth, and optionally add URL globs to focus the crawl.
Click on Run.
When LLM Markdown Crawler has finished, preview or download your data from the Dataset tab.

How much will it cost to scrape websites for LLM data?

Apify gives you $5 free usage credits every month on the Apify Free plan. Because this Actor uses CheerioCrawler with no browser overhead, you can crawl approximately 10,000 pages per month for that, so those 10,000 pages will be completely free!

But if you need to get more data regularly from websites, you should grab an Apify subscription. We recommend our $49/month Personal plan — you can get up to 100,000 pages every month with the $49 monthly plan!

Or get 1,000,000+ pages for $499 with the Team plan — wow!

Input parameters for LLM Markdown Crawler

Parameter	Type	Required	Default	Description
`startUrls`	Array	✅	—	List of URLs to start crawling
`maxRequestsPerCrawl`	Integer	❌	100	Maximum number of pages to process per crawl
`maxDepth`	Integer	❌	3	How many link levels deep to follow from start URLs
`globs`	Array	❌	`[]`	URL glob patterns to restrict which pages are crawled
`includeMetadata`	Boolean	❌	`true`	Whether to extract author, excerpt, and word count metadata

Output from LLM Markdown Crawler

Each crawled page is stored as a JSON record in the Actor's dataset:

{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started Guide",
  "markdown": "# Getting Started\n\nThis guide walks you through setting up...",
  "excerpt": "This guide walks you through setting up the platform in under 5 minutes.",
  "author": "Jane Smith",
  "wordCount": 1247
}

Field	Type	Description
`url`	String	The page URL that was crawled
`title`	String	Page title extracted from the document
`markdown`	String	Clean Markdown content with boilerplate removed
`excerpt`	String	Short summary of the page content (when metadata enabled)
`author`	String	Author name if detected in the page (when metadata enabled)
`wordCount`	Number	Estimated word count of the extracted content

Tips for scraping websites for LLM data

Use URL globs to restrict crawling to relevant sections (e.g., ["https://docs.example.com/guide/**"]) and avoid irrelevant pages
Start with maxDepth: 1 to preview results before running deeper crawls — this saves credits and lets you validate output quality
Set includeMetadata: false if you only need the Markdown content, for a slight speed improvement
Combine multiple start URLs to build datasets spanning multiple websites in a single run

Is it legal to scrape websites for LLM data?

Note that personal data is protected by GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. We also recommend that you read our blog post: is web scraping legal?

Webhook Integration

Pass an optional webhookUrl in the input to receive a POST notification when the run finishes:

{
  "webhookUrl": "https://your-server.com/webhook"
}

Payload sent by Apify:

{
  "eventType": "ACTOR.RUN.SUCCEEDED",
  "eventData": { "actorId": "...", "actorRunId": "..." },
  "resource": { "id": "...", "status": "SUCCEEDED", "defaultDatasetId": "..." }
}

The webhook fires on SUCCEEDED, FAILED, TIMED_OUT, and ABORTED events. Use it to trigger downstream pipelines, Zapier, Make.com, or any HTTP endpoint.

Integrations — connect LLM Markdown data to your AI stack

The Markdown output plugs directly into every major AI framework:

LangChain

from apify_client import ApifyClient
from langchain.docstore.document import Document
from langchain.text_splitter import MarkdownTextSplitter

client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("sleek_waveform/llm-markdown-crawler").call(run_input={
    "startUrls": [{"url": "https://docs.example.com/"}],
    "maxRequestsPerCrawl": 50,
    "maxDepth": 2
})

items = client.dataset(run["defaultDatasetId"]).list_items().items
splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = [Document(page_content=chunk, metadata={"url": item["url"], "title": item["title"]})
        for item in items for chunk in splitter.split_text(item["markdown"])]

Pinecone / Weaviate / Qdrant

Feed the markdown field directly to any vector store's upsert endpoint. Each record includes url and title as ready-made metadata fields — no pre-processing required.

OpenAI fine-tuning

Export the dataset as JSONL and use url, title, and markdown to build prompt-completion pairs for domain-specific fine-tuning on any OpenAI model.

FAQ about LLM Markdown Crawler

How many pages can I crawl on the free plan? With Apify's $5 monthly free credit and CheerioCrawler's low compute cost, you can process approximately 10,000 pages per month at no cost.

Can I crawl login-required pages? No. LLM Markdown Crawler only accesses publicly available pages. Pages behind authentication, paywalls, or login walls will return empty content or an error row.

How is this different from a generic web scraper? LLM Markdown Crawler uses Mozilla's Readability algorithm to strip navigation, ads, footers, and sidebar content before converting to Markdown — the same algorithm Firefox Reader View uses. Generic scrapers return raw HTML and require you to write extraction logic per site. This Actor returns clean prose that LLMs can immediately process.

Can I control which pages are crawled? Yes. Use globs to restrict the crawl to specific URL patterns (e.g., ["https://docs.example.com/api/**"]), and set maxDepth to control how many link levels deep to follow.

Does it handle JavaScript-rendered content? LLM Markdown Crawler uses CheerioCrawler (HTTP-only, no browser) for maximum speed and minimum cost. Pages that require JavaScript rendering to display content may return incomplete Markdown. For JS-heavy sites, consider using a full Playwright-based crawler.

What's the difference between excerpt and markdown? excerpt is a short auto-generated summary of the page (typically the first paragraph or meta description). markdown is the full extracted content of the page.

How do I split the Markdown into chunks for vector databases? Use a Markdown-aware text splitter (e.g., MarkdownTextSplitter from LangChain). Split at heading boundaries to keep semantic meaning intact. A chunk size of 500–1,000 tokens works well for most RAG use cases.

Is there an output limit per page? No hard limit. Very long pages (e.g., a full documentation site crawled at depth 0) may produce large Markdown strings. Set maxDepth: 0 and target specific deep URLs for more focused results.

Other sleek_waveform Actors you might like

Substack Newsletter Scraper — scrape Substack newsletters and extract clean post content in Markdown or HTML. Perfect for building RAG datasets from newsletter archives.
YouTube Trend Scraper — extract YouTube video data and trend scores by keyword. Combine with LLM Markdown Crawler to enrich video descriptions with scraped article content.
SaaS Competitor Scraper — scrape competitor SaaS websites for tech stacks, pricing, and features. Feeds naturally into an LLM analysis pipeline built with this crawler.

Found this Actor useful? Leave a review on the Apify Store — it takes 30 seconds and helps other developers discover it.

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website to Markdown - Clean LLM-Ready Content

ambitious_door/web-to-markdown

Convert any webpage into clean markdown stripped of navigation, ads, and boilerplate. Perfect for RAG pipelines, LLM context, and content extraction. Token counts included.

C. K.

Website Markdown Crawler

moorish-dev/website-markdown-crawler

Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.

Ziad Tarik

URL to Markdown for LLMs (polite, robots-respecting)

weltverbenzer/url-to-markdown-for-llms

Turn any URL into clean, LLM-ready Markdown for AI agents and RAG pipelines. Enforces robots.txt, extracts main content (Readability) and converts to Markdown. Returns title, byline and markdown.

Johannes Witt

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

AI Website Content Extractor

scrapeai/ai-website-content-extractor

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

ScrapeAI

5.0

RAG Spider - Web to Markdown Crawler for AI Training Data

lenient_grove/RAG-Spider

Enterprise-grade web crawler that converts messy websites into clean, chunked Markdown for AI systems. Uses Mozilla Readability for 95% cleaner extraction than competitors. Outputs RAG-ready data with metadata and token estimates. Perfect for building knowledge bases and training AI chatbots.

Tejas Rawool

5.0

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

Juan Triviño

Website to Markdown Crawler - Full-Site Text for LLMs & RAG

entranced_gelato/website-to-markdown-crawler

Crawl any website from a start URL and get every page as clean text + Markdown for LLMs, RAG, and AI agents. Follows internal links with depth and page limits, strips nav and ads, and returns one structured record per page. A fast, no-config site-to-Markdown crawler.