LLM Markdown Crawler avatar

LLM Markdown Crawler

Pricing

from $5.00 / 1,000 results

Go to Apify Store
LLM Markdown Crawler

LLM Markdown Crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Daniel Dimitrov

Daniel Dimitrov

Maintained by Community

Actor stats

1

Bookmarked

4

Total users

2

Monthly active users

2 days ago

Last modified

Share

Turn any website into clean, structured Markdown ready for Large Language Models, RAG pipelines, and AI training datasets. LLM Markdown Crawler uses Mozilla's Readability algorithm to strip away navigation, ads, and boilerplate — leaving you with pure content that LLMs can consume directly.

What does LLM Markdown Crawler do?

LLM Markdown Crawler is an Apify Actor that crawls any website and converts its pages into pristine Markdown optimized for LLM and AI workflows. It uses CheerioCrawler (no headless browser) for blazing-fast, low-cost extraction. LLM Markdown Crawler can extract:

  • Clean Markdown content — navigation, footers, sidebars, and ads are stripped automatically
  • Page metadata — title, author, excerpt, and word count for each page
  • Deep crawl support — follow links up to a configurable depth with URL glob filtering
  • Structured output — each page becomes a single JSON record ready for vector databases, fine-tuning, or RAG ingestion

Why scrape websites for LLM data?

The web contains billions of pages of high-quality content — documentation, blog posts, research articles, and knowledge bases. Converting this content to clean Markdown is essential for modern AI workflows.

Here are just some of the ways you could use that data:

  • RAG pipelines — build retrieval-augmented generation systems with clean, chunked content from any domain
  • LLM fine-tuning — create high-quality training datasets from curated websites and documentation
  • Knowledge base archiving — capture and preserve website content in a portable, searchable format
  • Content analysis — analyze writing patterns, topics, and structure across large content collections
  • Documentation extraction — convert technical docs, wikis, and help centers into Markdown for internal tools

If you would like more inspiration on how scraping websites for LLM data could help your business, check out our industry pages.

How to scrape websites for LLM data

  1. Click on Try for free.
  2. Enter one or more website URLs in the startUrls field (e.g., https://docs.example.com/).
  3. Set maxRequestsPerCrawl, maxDepth, and optionally add URL globs to focus the crawl.
  4. Click on Run.
  5. When LLM Markdown Crawler has finished, preview or download your data from the Dataset tab.

How much will it cost to scrape websites for LLM data?

Apify gives you $5 free usage credits every month on the Apify Free plan. Because this Actor uses CheerioCrawler with no browser overhead, you can crawl approximately 10,000 pages per month for that, so those 10,000 pages will be completely free!

But if you need to get more data regularly from websites, you should grab an Apify subscription. We recommend our $49/month Personal plan — you can get up to 100,000 pages every month with the $49 monthly plan!

Or get 1,000,000+ pages for $499 with the Team plan — wow!

Input parameters for LLM Markdown Crawler

ParameterTypeRequiredDefaultDescription
startUrlsArrayList of URLs to start crawling
maxRequestsPerCrawlInteger100Maximum number of pages to process per crawl
maxDepthInteger3How many link levels deep to follow from start URLs
globsArray[]URL glob patterns to restrict which pages are crawled
includeMetadataBooleantrueWhether to extract author, excerpt, and word count metadata

Output from LLM Markdown Crawler

Each crawled page is stored as a JSON record in the Actor's dataset:

{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started Guide",
"markdown": "# Getting Started\n\nThis guide walks you through setting up...",
"excerpt": "This guide walks you through setting up the platform in under 5 minutes.",
"author": "Jane Smith",
"wordCount": 1247
}
FieldTypeDescription
urlStringThe page URL that was crawled
titleStringPage title extracted from the document
markdownStringClean Markdown content with boilerplate removed
excerptStringShort summary of the page content (when metadata enabled)
authorStringAuthor name if detected in the page (when metadata enabled)
wordCountNumberEstimated word count of the extracted content

Tips for scraping websites for LLM data

  • Use URL globs to restrict crawling to relevant sections (e.g., ["https://docs.example.com/guide/**"]) and avoid irrelevant pages
  • Start with maxDepth: 1 to preview results before running deeper crawls — this saves credits and lets you validate output quality
  • Set includeMetadata: false if you only need the Markdown content, for a slight speed improvement
  • Combine multiple start URLs to build datasets spanning multiple websites in a single run

Note that personal data is protected by GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. We also recommend that you read our blog post: is web scraping legal?

Webhook Integration

Pass an optional webhookUrl in the input to receive a POST notification when the run finishes:

{
"webhookUrl": "https://your-server.com/webhook"
}

Payload sent by Apify:

{
"eventType": "ACTOR.RUN.SUCCEEDED",
"eventData": { "actorId": "...", "actorRunId": "..." },
"resource": { "id": "...", "status": "SUCCEEDED", "defaultDatasetId": "..." }
}

The webhook fires on SUCCEEDED, FAILED, TIMED_OUT, and ABORTED events. Use it to trigger downstream pipelines, Zapier, Make.com, or any HTTP endpoint.

Integrations — connect LLM Markdown data to your AI stack

The Markdown output plugs directly into every major AI framework:

LangChain

from apify_client import ApifyClient
from langchain.docstore.document import Document
from langchain.text_splitter import MarkdownTextSplitter
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("sleek_waveform/llm-markdown-crawler").call(run_input={
"startUrls": [{"url": "https://docs.example.com/"}],
"maxRequestsPerCrawl": 50,
"maxDepth": 2
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = [Document(page_content=chunk, metadata={"url": item["url"], "title": item["title"]})
for item in items for chunk in splitter.split_text(item["markdown"])]

Pinecone / Weaviate / Qdrant

Feed the markdown field directly to any vector store's upsert endpoint. Each record includes url and title as ready-made metadata fields — no pre-processing required.

OpenAI fine-tuning

Export the dataset as JSONL and use url, title, and markdown to build prompt-completion pairs for domain-specific fine-tuning on any OpenAI model.

FAQ about LLM Markdown Crawler

How many pages can I crawl on the free plan? With Apify's $5 monthly free credit and CheerioCrawler's low compute cost, you can process approximately 10,000 pages per month at no cost.

Can I crawl login-required pages? No. LLM Markdown Crawler only accesses publicly available pages. Pages behind authentication, paywalls, or login walls will return empty content or an error row.

How is this different from a generic web scraper? LLM Markdown Crawler uses Mozilla's Readability algorithm to strip navigation, ads, footers, and sidebar content before converting to Markdown — the same algorithm Firefox Reader View uses. Generic scrapers return raw HTML and require you to write extraction logic per site. This Actor returns clean prose that LLMs can immediately process.

Can I control which pages are crawled? Yes. Use globs to restrict the crawl to specific URL patterns (e.g., ["https://docs.example.com/api/**"]), and set maxDepth to control how many link levels deep to follow.

Does it handle JavaScript-rendered content? LLM Markdown Crawler uses CheerioCrawler (HTTP-only, no browser) for maximum speed and minimum cost. Pages that require JavaScript rendering to display content may return incomplete Markdown. For JS-heavy sites, consider using a full Playwright-based crawler.

What's the difference between excerpt and markdown? excerpt is a short auto-generated summary of the page (typically the first paragraph or meta description). markdown is the full extracted content of the page.

How do I split the Markdown into chunks for vector databases? Use a Markdown-aware text splitter (e.g., MarkdownTextSplitter from LangChain). Split at heading boundaries to keep semantic meaning intact. A chunk size of 500–1,000 tokens works well for most RAG use cases.

Is there an output limit per page? No hard limit. Very long pages (e.g., a full documentation site crawled at depth 0) may produce large Markdown strings. Set maxDepth: 0 and target specific deep URLs for more focused results.

Other sleek_waveform Actors you might like

  • Substack Newsletter Scraper — scrape Substack newsletters and extract clean post content in Markdown or HTML. Perfect for building RAG datasets from newsletter archives.
  • YouTube Trend Scraper — extract YouTube video data and trend scores by keyword. Combine with LLM Markdown Crawler to enrich video descriptions with scraped article content.
  • SaaS Competitor Scraper — scrape competitor SaaS websites for tech stacks, pricing, and features. Feeds naturally into an LLM analysis pipeline built with this crawler.

Found this Actor useful? Leave a review on the Apify Store — it takes 30 seconds and helps other developers discover it.