Pricing

$3.00 / 1,000 website page rows

Website Content Scraper: Clean Markdown for AI and RAG

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.

Pricing

$3.00 / 1,000 website page rows

Rating

0.0

(0)

Developer

Ken M

Actor stats

Bookmarked

Total users

Monthly active users

20 days ago

Last modified

What you get

One row per crawled page, with:

content (main page content as markdown by default; switch to plain text or cleaned HTML)
title, description (meta), lang, canonical
url, finalUrl (after redirects), depth, wordCount, crawledAt

Boilerplate is removed before extraction: scripts, styles, nav bars, headers, footers, sidebars, forms, and cookie banners. The main content region (main, article, or role="main") is preferred when the page declares one.

Input

startUrls (pages to start from)
maxPages (hard cap per run, default 20, up to 500)
maxDepth (0 = start URLs only, default 2)
sameDomainOnly (default true, subdomains included)
includePatterns / excludePatterns (URL substring filters, e.g. only /docs/)
outputFormat (markdown, text, or html)
useSitemap (also seed the crawl from sitemap.xml)

Example input

{
  "startUrls": ["https://docs.apify.com/platform"],
  "maxPages": 20,
  "maxDepth": 2,
  "includePatterns": ["/platform"],
  "outputFormat": "markdown"
}

Example output

{
  "url": "https://docs.apify.com/platform/actors",
  "finalUrl": "https://docs.apify.com/platform/actors",
  "depth": 1,
  "title": "Actors | Platform | Apify Documentation",
  "description": "Learn how to develop, run and share serverless cloud programs.",
  "lang": "en",
  "format": "markdown",
  "content": "# Actors\n\nActors are serverless cloud programs that can do almost anything a human can do in a web browser...",
  "wordCount": 412,
  "crawledAt": "2026-07-05T20:00:00.000Z"
}

Uses

Feed a RAG pipeline or vector database with a docs site, help center, or blog
Build fine-tuning or evaluation datasets from real site content
Keep a chatbot's knowledge base in sync with your product docs on a schedule
Monitor competitor marketing and docs pages as clean diffs instead of HTML soup
Archive a site's content as structured, searchable rows

Pricing

Pay per page. Only pages that return real content are pushed and charged; failed fetches, redirects to non-HTML, and empty pages cost nothing. The first 2 pages of every run are free so you can validate output before you scale up.

Notes

Plain HTTP fetching keeps runs fast and cheap. JavaScript-only sites (content rendered entirely client-side) are out of scope; most docs sites, blogs, and marketing sites work fine.
The crawler identifies itself with a descriptive User-Agent and fetches politely with capped concurrency.

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Website to Markdown Scraper

receptional_blender/website-to-markdown-scraper

Crawl any website and turn its pages into clean Markdown — plus optional plain text, raw HTML and full-page screenshots. Built for LLM, RAG and AI training datasets.

Assia Fadli

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Group Oject

Website Content Crawler — AI & RAG Ready

santamaria-automations/website-content-crawler

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.

NanoScrape

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

LLM-Ready Web Extractor: URL to Clean Markdown & JSON

f0rty7even/llm-web-extractor

Turn any web page or site into clean, LLM-ready Markdown and structured JSON for RAG, agents, and fine-tuning. Strips nav/ads/boilerplate; returns main content + metadata.

Michael Yousrie

Website to Markdown — LLM & RAG Content Exporter

sturdydata/website-markdown-exporter

Crawl any website and get one clean Markdown document per page — ready for RAG pipelines, vector databases, LLM fine-tuning, or docs migration. Boilerplate (nav, footers, cookie banners) stripped, main content auto-detected, sitemap-seeded crawling, robots.txt respected. HARD page caps and flat p...