Website to Text & Markdown — AI / RAG Content Crawler avatar

Website to Text & Markdown — AI / RAG Content Crawler

Pricing

from $5.00 / 1,000 results

Go to Apify Store
Website to Text & Markdown — AI / RAG Content Crawler

Website to Text & Markdown — AI / RAG Content Crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Hitman studio

Hitman studio

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

🕷️ RAG Website Crawler — Markdown + Chunks + PDFs for AI

Turn any website into clean, LLM-ready data in one run. Built for RAG pipelines, AI chatbots, and vector databases (Pinecone, Qdrant, Weaviate…).

Why this one is better

FeaturePlain content crawlersRAG Website Crawler
Clean Markdown
Auto chunks + token counts❌ (extra step)✅ built-in
PDF / Word / Excel extraction❌ skipped✅ included
Anti-block fetchingsometimes✅ browser TLS + proxy
AI summary per page✅ optional, your own key
robots.txt + trap protectionvaries✅ built-in
GPU needed❌ 100% CPU

What you get per page

{
"url": "https://site.com/docs/intro",
"title": "Introduction",
"markdown": "# Introduction\n\n...",
"word_count": 812,
"token_count": 1043,
"chunk_count": 3,
"chunks": [{ "index": 0, "text": "...", "tokens": 500 }],
"is_document": false,
"depth": 1,
"content_hash": "…",
"crawled_at": "2026-06-08T07:00:00Z"
}

Chunks are ready to embed straight into a vector DB.

Robust by design

Handles the classic crawler traps automatically:

  • Infinite loops / calendar traps → depth + page caps, trap heuristics
  • Duplicate URLs / content → URL normalisation + content-hash dedup
  • robots.txt & crawl-delay → respected (toggle)
  • Rate limits / blocks → polite delay + jitter + proxy + 429 backoff
  • Huge pages / memory → size cap, HTTP-only (no heavy browser)
  • Dead URLs → limited retries, never re-queued

Input (key options)

  • startUrls — where to begin
  • maxPages, maxDepth, sameDomainOnly, allowSubdomains
  • chunkSizeTokens, chunkOverlapTokens
  • includeDocuments — also crawl linked PDFs/Office files
  • respectRobotsTxt, crawlDelaySeconds, useProxy
  • aiProvider + aiApiKey (BYOK) — optional per-page AI summary

Privacy

The AI summary uses your own key (isSecret, encrypted, never logged). The Actor never ships any built-in key, so nothing of ours can be exposed.

What people use this for (search terms)

Whether you are a beginner who just wants to copy a website's text, or a developer building a production RAG pipeline, this Actor fits:

  • website to text · website to markdown · scrape website content · copy all pages of a website · website content downloader · website reader · extract text from a website · web page to text
  • data for AI · LLM-ready data · RAG crawler · vector database ingestion · embeddings input · knowledge base builder · AI chatbot training data · documentation scraper · docs to markdown
  • works with ChatGPT, Claude, Gemini, LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, Supabase Vector
  • also: PDF scraper · crawl PDFs on a website · Word/Excel text extraction · sitemap crawler · whole-site crawler

Common use cases

  • Build an AI chatbot that answers questions about your website or docs
  • Feed a company knowledge base into a vector database for RAG
  • Turn documentation / help centers into clean Markdown for LLMs
  • Collect research content from many pages into one structured dataset
  • Extract text from PDFs and documents linked across a site