Pricing

$5.00/month + usage

Try for free

Go to Apify Store

AI Website Content Extractor

Try for free

Crawl website pages, strip noise, and convert the main content to clean Markdown for RAG/LLM training.

Pricing

$5.00/month + usage

Rating

5.0

(2)

Developer

ScrapeAI

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

Features

Crawl any public website page(s)
Automatically dismiss cookie / consent dialogs
Strip navigation bars, headers, footers, sidebars, ads, and modals
Detect the main content area using semantic HTML selectors (main, article, [role="main"], etc.)
Convert HTML to clean Markdown via turndown
Skip low-content pages (login walls, redirects) automatically
Outputs a structured dataset ready for AI use-cases

Input

Field	Type	Description	Default
startUrls	Array	List of `{url}` objects or plain URL strings to crawl	`[{url: "https://example.com"}]`
maxPages	Number	Maximum number of pages to process	`20`
proxyConfiguration	Object	Apify proxy settings (optional)	`{}`

Example Input

{
    "startUrls": [
        { "url": "https://en.wikipedia.org/wiki/Artificial_intelligence" },
        { "url": "https://openai.com/blog" }
    ],
    "maxPages": 10
}

Output

Each extracted page produces one dataset record:

Field	Type	Description
url	String	URL of the crawled page
title	String	Page `<title>`
markdown	String	Clean Markdown of the main content
text	String
wordCount	Number	Approximate word count of the Markdown
extractedAt	String	ISO 8601 timestamp

Example Output

{
    "url": "https://en.wikipedia.org/wiki/Artificial_intelligence",
    "title": "Artificial intelligence - Wikipedia",
    "markdown": "# Artificial intelligence\n\nArtificial intelligence (AI) is the simulation of human intelligence...",
    "text": "Example Domain\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\nLearn more",	
    "wordCount": 4312,
    "extractedAt": "2026-03-13T08:00:00.000Z"
}

Use Cases

RAG pipelines — ingest Markdown directly into your vector store
LLM fine-tuning — build clean text corpora from any website
AI chatbots — feed domain knowledge to your assistant
Research — extract and archive article content at scale

Tips

Supply multiple startUrls to crawl several pages in one run
Increase maxPages to crawl an entire site (combine with Apify's link-following features)
For authenticated pages, configure a proxy or session in proxyConfiguration

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

329

5.0

Website Main Content Extractor

sync-network/website-main-content-extractor

Alam

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Website Content Crawler for AI & LLM Data

your_scraper_guy/website-content-crawler-lite

Crawl any website from a seed URL and extract clean Markdown content, ready for LLM training data, RAG pipelines, and vector databases. Set crawl depth, page limits, and domain scope.

Code With Aqib

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

AI Web Extractor: URL → Clean Markdown + JSON for LLM/RAG

boxbox10/ai-web-extractor

Turn any URL into clean, LLM-ready Markdown + structured JSON (title, headings, main content, links, metadata, token count). Perfect for RAG pipelines, AI agents, and LLM context.

Marvin Eguilos

Web to Markdown — AI-Ready Text from Any URL

wsgcjj/web-to-markdown

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

陈俊杰

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Daniel Dimitrov

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.