Pricing

from $5.00 / 1,000 results

Website to Text & Markdown — AI / RAG Content Crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Pricing

from $5.00 / 1,000 results

Rating

0.0

(0)

Developer

Hitman studio

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

🕷️ RAG Website Crawler — Markdown + Chunks + PDFs for AI

Turn any website into clean, LLM-ready data in one run. Built for RAG pipelines, AI chatbots, and vector databases (Pinecone, Qdrant, Weaviate…).

Why this one is better

Feature	Plain content crawlers	RAG Website Crawler
Clean Markdown	✅	✅
Auto chunks + token counts	❌ (extra step)	✅ built-in
PDF / Word / Excel extraction	❌ skipped	✅ included
Anti-block fetching	sometimes	✅ browser TLS + proxy
AI summary per page	❌	✅ optional, your own key
robots.txt + trap protection	varies	✅ built-in
GPU needed	—	❌ 100% CPU

What you get per page

{
  "url": "https://site.com/docs/intro",
  "title": "Introduction",
  "markdown": "# Introduction\n\n...",
  "word_count": 812,
  "token_count": 1043,
  "chunk_count": 3,
  "chunks": [{ "index": 0, "text": "...", "tokens": 500 }],
  "is_document": false,
  "depth": 1,
  "content_hash": "…",
  "crawled_at": "2026-06-08T07:00:00Z"
}

Chunks are ready to embed straight into a vector DB.

Robust by design

Handles the classic crawler traps automatically:

Infinite loops / calendar traps → depth + page caps, trap heuristics
Duplicate URLs / content → URL normalisation + content-hash dedup
robots.txt & crawl-delay → respected (toggle)
Rate limits / blocks → polite delay + jitter + proxy + 429 backoff
Huge pages / memory → size cap, HTTP-only (no heavy browser)
Dead URLs → limited retries, never re-queued

Input (key options)

startUrls — where to begin
maxPages, maxDepth, sameDomainOnly, allowSubdomains
chunkSizeTokens, chunkOverlapTokens
includeDocuments — also crawl linked PDFs/Office files
respectRobotsTxt, crawlDelaySeconds, useProxy
aiProvider + aiApiKey (BYOK) — optional per-page AI summary

Privacy

The AI summary uses your own key (isSecret, encrypted, never logged). The Actor never ships any built-in key, so nothing of ours can be exposed.

What people use this for (search terms)

Whether you are a beginner who just wants to copy a website's text, or a developer building a production RAG pipeline, this Actor fits:

website to text · website to markdown · scrape website content · copy all pages of a website · website content downloader · website reader · extract text from a website · web page to text
data for AI · LLM-ready data · RAG crawler · vector database ingestion · embeddings input · knowledge base builder · AI chatbot training data · documentation scraper · docs to markdown
works with ChatGPT, Claude, Gemini, LangChain, LlamaIndex, Pinecone, Qdrant, Weaviate, Milvus, Supabase Vector
also: PDF scraper · crawl PDFs on a website · Word/Excel text extraction · sitemap crawler · whole-site crawler

Common use cases

Build an AI chatbot that answers questions about your website or docs
Feed a company knowledge base into a vector database for RAG
Turn documentation / help centers into clean Markdown for LLMs
Collect research content from many pages into one structured dataset
Extract text from PDFs and documents linked across a site

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Ken M

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

Juan Triviño

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Inus Grobler

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Andok

AI Data Pipeline — Crawl, Chunk & Export to Vector DB

ozapp/ai-data-pipeline

Crawl any website, extract clean text, split into chunks with quality scoring, and export to JSON, Pinecone, or Qdrant. Built for RAG pipelines and AI training data. Includes language detection, content type classification, and token counting.

Ozapp

Website Content Crawler for AI — Clean Markdown, 4x Cheaper

joyouscam35875/website-content-crawler

Crawl any website and extract clean text/markdown for LLMs, RAG pipelines, vector databases. BFS crawl with depth control, robots.txt support, boilerplate removal. Perfect for feeding AI models. $0.001/page — 4x cheaper than the official Apify crawler.