LLM Markdown Crawler avatar

LLM Markdown Crawler

Pricing

from $5.00 / 1,000 results

Go to Apify Store
LLM Markdown Crawler

LLM Markdown Crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Daniel Dimitrov

Daniel Dimitrov

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

6 hours ago

Last modified

Categories

Share

Turn any website into clean, structured Markdown ready for Large Language Models, RAG pipelines, and AI training datasets. LLM Markdown Crawler uses Mozilla's Readability algorithm to strip away navigation, ads, and boilerplate — leaving you with pure content that LLMs can consume directly.

What does LLM Markdown Crawler do?

LLM Markdown Crawler is an Apify Actor that crawls any website and converts its pages into pristine Markdown optimized for LLM and AI workflows. It uses CheerioCrawler (no headless browser) for blazing-fast, low-cost extraction. LLM Markdown Crawler can extract:

  • Clean Markdown content — navigation, footers, sidebars, and ads are stripped automatically
  • Page metadata — title, author, excerpt, and word count for each page
  • Deep crawl support — follow links up to a configurable depth with URL glob filtering
  • Structured output — each page becomes a single JSON record ready for vector databases, fine-tuning, or RAG ingestion

Why scrape websites for LLM data?

The web contains billions of pages of high-quality content — documentation, blog posts, research articles, and knowledge bases. Converting this content to clean Markdown is essential for modern AI workflows.

Here are just some of the ways you could use that data:

  • RAG pipelines — build retrieval-augmented generation systems with clean, chunked content from any domain
  • LLM fine-tuning — create high-quality training datasets from curated websites and documentation
  • Knowledge base archiving — capture and preserve website content in a portable, searchable format
  • Content analysis — analyze writing patterns, topics, and structure across large content collections
  • Documentation extraction — convert technical docs, wikis, and help centers into Markdown for internal tools

If you would like more inspiration on how scraping websites for LLM data could help your business, check out our industry pages.

How to scrape websites for LLM data

  1. Click on Try for free.
  2. Enter one or more website URLs in the startUrls field (e.g., https://docs.example.com/).
  3. Set maxRequestsPerCrawl, maxDepth, and optionally add URL globs to focus the crawl.
  4. Click on Run.
  5. When LLM Markdown Crawler has finished, preview or download your data from the Dataset tab.

How much will it cost to scrape websites for LLM data?

Apify gives you $5 free usage credits every month on the Apify Free plan. Because this Actor uses CheerioCrawler with no browser overhead, you can crawl approximately 10,000 pages per month for that, so those 10,000 pages will be completely free!

But if you need to get more data regularly from websites, you should grab an Apify subscription. We recommend our $49/month Personal plan — you can get up to 100,000 pages every month with the $49 monthly plan!

Or get 1,000,000+ pages for $499 with the Team plan — wow!

Input parameters for LLM Markdown Crawler

ParameterTypeRequiredDefaultDescription
startUrlsArrayList of URLs to start crawling
maxRequestsPerCrawlInteger100Maximum number of pages to process per crawl
maxDepthInteger1How many link levels deep to follow from start URLs
globsArray[]URL glob patterns to restrict which pages are crawled
includeMetadataBooleantrueWhether to extract author, excerpt, and word count metadata

Output from LLM Markdown Crawler

Each crawled page is stored as a JSON record in the Actor's dataset:

{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started Guide",
"markdown": "# Getting Started\n\nThis guide walks you through setting up...",
"excerpt": "This guide walks you through setting up the platform in under 5 minutes.",
"author": "Jane Smith",
"wordCount": 1247
}
FieldTypeDescription
urlStringThe page URL that was crawled
titleStringPage title extracted from the document
markdownStringClean Markdown content with boilerplate removed
excerptStringShort summary of the page content (when metadata enabled)
authorStringAuthor name if detected in the page (when metadata enabled)
wordCountNumberEstimated word count of the extracted content

Tips for scraping websites for LLM data

  • Use URL globs to restrict crawling to relevant sections (e.g., ["https://docs.example.com/guide/**"]) and avoid irrelevant pages
  • Start with maxDepth: 1 to preview results before running deeper crawls — this saves credits and lets you validate output quality
  • Set includeMetadata: false if you only need the Markdown content, for a slight speed improvement
  • Combine multiple start URLs to build datasets spanning multiple websites in a single run

Note that personal data is protected by GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. We also recommend that you read our blog post: is web scraping legal?

Webhook Integration

Pass an optional webhookUrl in the input to receive a POST notification when the run finishes:

{
"webhookUrl": "https://your-server.com/webhook"
}

Payload sent by Apify:

{
"eventType": "ACTOR.RUN.SUCCEEDED",
"eventData": { "actorId": "...", "actorRunId": "..." },
"resource": { "id": "...", "status": "SUCCEEDED", "defaultDatasetId": "..." }
}

The webhook fires on SUCCEEDED, FAILED, TIMED_OUT, and ABORTED events. Use it to trigger downstream pipelines, Zapier, Make.com, or any HTTP endpoint.