Deprecated

Pricing

from $2.00 / 1,000 website analyzeds

See alternative Actors

Go to Apify Store

Website Markdown Crawler

Deprecated

See alternative Actors

Crawls a website and converts every page to clean Markdown optimized for LLM ingestion.

Pricing

from $2.00 / 1,000 website analyzeds

Rating

0.0

(0)

Developer

Ziad Tarik

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Features

Clean Markdown Extraction: Strips noise (navigation, footers) to extract just the main content.
Smart Chunking: Splits content into token chunks respecting paragraph boundaries.
Language Filtering: Can automatically detect and filter pages by language (e.g., only en or fr).
Domain Control: Keeps the crawler scoped to the seed URL's domain.
Regex Exclusions: Skip non-valuable URLs like tags or author pages.

Output Example

Each crawled page yields a structured JSON record:

{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started — Example Docs",
  "description": "Learn how to set up Example in 5 minutes.",
  "language": "en",
  "wordCount": 842,
  "tokenEstimate": 1120,
  "headings": [
    { "level": 1, "text": "Getting Started" },
    { "level": 2, "text": "Installation" }
  ],
  "markdown": "# Getting Started\n\nLearn how to...",
  "chunks": [
    { "index": 0, "content": "# Getting Started\n\nLearn how to...", "tokenEstimate": 498 }
  ],
  "chunkCount": 1,
  "depth": 1,
  "crawledAt": "2026-06-10T14:32:00.000Z"
}

Integrations

Connect the crawler directly into your RAG stack.

LlamaIndex

from llama_index.core import Document

# After running the Actor, download dataset as JSON
docs = [
    Document(text=chunk['content'], metadata={'url': item['url'], 'chunk': chunk['index']})
    for item in dataset_items
    for chunk in item['chunks']
]

LangChain

from langchain.docstore.document import Document as LCDoc

lc_docs = [
    LCDoc(page_content=chunk['content'], metadata={'source': item['url']})
    for item in dataset_items
    for chunk in item['chunks']
]

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

303

1.9

(2)

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

939

2.3

(3)

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website to Markdown Crawler - Full-Site Text for LLMs & RAG

entranced_gelato/website-to-markdown-crawler

Crawl any website from a start URL and get every page as clean text + Markdown for LLMs, RAG, and AI agents. Follows internal links with depth and page limits, strips nav and ads, and returns one structured record per page. A fast, no-config site-to-Markdown crawler.

AIDevs

Website to Markdown Scraper - HTML to MD for RAG API

pink_comic/website-content-to-markdown

Convert web pages and HTML to clean Markdown for RAG, LLM training, AI knowledge bases, and content migration. Strips nav, ads, scripts, and styling while preserving structure. Bulk output includes word/link/image counts.

Ava Torres

Website to Markdown Scraper — LLM & RAG Ready

perforated_hummingbird/url-to-markdown

Scrape any website into clean, LLM-ready Markdown. This URL-to-Markdown converter strips ads, nav, and boilerplate with Mozilla Readability — feed your AI models and RAG pipelines only real content. Batch URLs, optional JavaScript rendering, pay only for pages scraped successfully.

Damon Williams

Website to Markdown Scraper

receptional_blender/website-to-markdown-scraper

Crawl any website and turn its pages into clean Markdown — plus optional plain text, raw HTML and full-page screenshots. Built for LLM, RAG and AI training datasets.

Assia Fadli

Get Site to Markdown

b-w/get-site

Website to Markdown Crawler An asynchronous web crawler that mirrors websites into a single organized markdown file, with handling for images and directory structure preservation. Designed to operate with low cost. This works great to build context for AI agents.

b-w.pro

Simple Website Scrapper (markdown format)

manojaditya64/simple-website-scrapper-markdown-format

A simple website scrapper that scrapes websites and converts it into markdown format which is easy to use with LLM. You can feed markdown data to LLM for easy analysis.

Manojaditya Nadar

5.0

(1)

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.