Website to Text & Markdown — AI / RAG Content Crawler
Pricing
from $5.00 / 1,000 results
Website to Text & Markdown — AI / RAG Content Crawler
Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.
Pricing
from $5.00 / 1,000 results
Rating
0.0
(0)
Developer
Hitman studio
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
🕷️ RAG Website Crawler — Markdown + Chunks + PDFs for AI
Turn any website into clean, LLM-ready data in one run. Built for RAG pipelines, AI chatbots, and vector databases (Pinecone, Qdrant, Weaviate…).
Why this one is better
| Feature | Plain content crawlers | RAG Website Crawler |
|---|---|---|
| Clean Markdown | ✅ | ✅ |
| Auto chunks + token counts | ❌ (extra step) | ✅ built-in |
| PDF / Word / Excel extraction | ❌ skipped | ✅ included |
| Anti-block fetching | sometimes | ✅ browser TLS + proxy |
| AI summary per page | ❌ | ✅ optional, your own key |
| robots.txt + trap protection | varies | ✅ built-in |
| GPU needed | — | ❌ 100% CPU |
What you get per page
{"url": "https://site.com/docs/intro","title": "Introduction","markdown": "# Introduction\n\n...","word_count": 812,"token_count": 1043,"chunk_count": 3,"chunks": [{ "index": 0, "text": "...", "tokens": 500 }],"is_document": false,"depth": 1,"content_hash": "…","crawled_at": "2026-06-08T07:00:00Z"}
Chunks are ready to embed straight into a vector DB.
Robust by design
Handles the classic crawler traps automatically:
- Infinite loops / calendar traps → depth + page caps, trap heuristics
- Duplicate URLs / content → URL normalisation + content-hash dedup
- robots.txt & crawl-delay → respected (toggle)
- Rate limits / blocks → polite delay + jitter + proxy + 429 backoff
- Huge pages / memory → size cap, HTTP-only (no heavy browser)
- Dead URLs → limited retries, never re-queued
Input (key options)
startUrls— where to beginmaxPages,maxDepth,sameDomainOnly,allowSubdomainschunkSizeTokens,chunkOverlapTokensincludeDocuments— also crawl linked PDFs/Office filesrespectRobotsTxt,crawlDelaySeconds,useProxyaiProvider+aiApiKey(BYOK) — optional per-page AI summary
Privacy
The AI summary uses your own key (isSecret, encrypted, never logged).
The Actor never ships any built-in key, so nothing of ours can be exposed.
What people use this for (search terms)
Whether you are a beginner who just wants to copy a website's text, or a developer building a production RAG pipeline, this Actor fits:
- website to text · website to markdown · scrape website content · copy all pages of a website · website content downloader · website reader · extract text from a website · web page to text
- data for AI · LLM-ready data · RAG crawler · vector database ingestion · embeddings input · knowledge base builder · AI chatbot training data · documentation scraper · docs to markdown
- works with ChatGPT, Claude, Gemini, LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, Supabase Vector
- also: PDF scraper · crawl PDFs on a website · Word/Excel text extraction · sitemap crawler · whole-site crawler
Common use cases
- Build an AI chatbot that answers questions about your website or docs
- Feed a company knowledge base into a vector database for RAG
- Turn documentation / help centers into clean Markdown for LLMs
- Collect research content from many pages into one structured dataset
- Extract text from PDFs and documents linked across a site