Website to Text & Markdown — AI / RAG Content Crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Hitman studio

Actor stats

Bookmarked

Total users

Monthly active users

25 days ago

Last modified

🕷️ RAG Website Crawler — Markdown + Chunks + PDFs for AI

Turn any website into clean, LLM-ready data in one run. Built for RAG pipelines, AI chatbots, and vector databases (Pinecone, Qdrant, Weaviate…).

Why this one is better

Feature	Plain content crawlers	RAG Website Crawler
Clean Markdown	✅	✅
Auto chunks + token counts	❌ (extra step)	✅ built-in
PDF / Word / Excel extraction	❌ skipped	✅ included
Anti-block fetching	sometimes	✅ browser TLS + proxy
AI summary per page	❌	✅ optional, your own key
robots.txt + trap protection	varies	✅ built-in
GPU needed	—	❌ 100% CPU

What you get per page

{
  "url": "https://site.com/docs/intro",
  "title": "Introduction",
  "markdown": "# Introduction\n\n...",
  "word_count": 812,
  "token_count": 1043,
  "chunk_count": 3,
  "chunks": [{ "index": 0, "text": "...", "tokens": 500 }],
  "is_document": false,
  "depth": 1,
  "content_hash": "…",
  "crawled_at": "2026-06-08T07:00:00Z"
}

Chunks are ready to embed straight into a vector DB.

Robust by design

Handles the classic crawler traps automatically:

Infinite loops / calendar traps → depth + page caps, trap heuristics
Duplicate URLs / content → URL normalisation + content-hash dedup
robots.txt & crawl-delay → respected (toggle)
Rate limits / blocks → polite delay + jitter + proxy + 429 backoff
Huge pages / memory → size cap, HTTP-only (no heavy browser)
Dead URLs → limited retries, never re-queued

Input (key options)

startUrls — where to begin
maxPages, maxDepth, sameDomainOnly, allowSubdomains
chunkSizeTokens, chunkOverlapTokens
includeDocuments — also crawl linked PDFs/Office files
respectRobotsTxt, crawlDelaySeconds, useProxy
aiProvider + aiApiKey (BYOK) — optional per-page AI summary

Privacy

The AI summary uses your own key (isSecret, encrypted, never logged). The Actor never ships any built-in key, so nothing of ours can be exposed.

What people use this for (search terms)

Whether you are a beginner who just wants to copy a website's text, or a developer building a production RAG pipeline, this Actor fits:

website to text · website to markdown · scrape website content · copy all pages of a website · website content downloader · website reader · extract text from a website · web page to text
data for AI · LLM-ready data · RAG crawler · vector database ingestion · embeddings input · knowledge base builder · AI chatbot training data · documentation scraper · docs to markdown
works with ChatGPT, Claude, Gemini, LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, Supabase Vector
also: PDF scraper · crawl PDFs on a website · Word/Excel text extraction · sitemap crawler · whole-site crawler

Common use cases

Build an AI chatbot that answers questions about your website or docs
Feed a company knowledge base into a vector database for RAG
Turn documentation / help centers into clean Markdown for LLMs
Collect research content from many pages into one structured dataset
Extract text from PDFs and documents linked across a site

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website Content Crawler - Markdown & Text for LLM / RAG

pear_fight/website-content-crawler-markdown-text-for-llm-rag

Crawl any website and extract clean article text and Markdown, ready to feed into LLMs, ChatGPT, vector databases and RAG pipelines. Removes navigation, ads and boilerplate. Configurable crawl depth and page limits. Export to JSON, CSV, Excel.

Harald

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.