Pricing

from $0.30 / 1,000 results

RAG Website Intelligence Crawler

Crawl websites into clean Markdown + RAG-ready chunks. Outputs structural sitemap (KV), SimHash deduplication, and optional change detection (baseline + diff) for docs monitoring & KB sync. Built on Crawlee (Playwright) with optional Stagehand schema extraction.

Pricing

from $0.30 / 1,000 results

Rating

0.0

(0)

Developer

Hayder Al-Khalissi

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

RAG-Ready Website Intelligence Crawler

What is RAG-Ready Website Intelligence Crawler?

RAG-Ready Website Intelligence Crawler is an Apify Actor for crawling websites and turning pages into clean markdown, RAG-ready chunks, and a structured sitemap. It is designed for teams that need reliable content extraction for LLM search, internal knowledge bases, or documentation monitoring.

You provide one or more start URLs and basic crawl rules, and the Actor handles traversal, extraction, deduplication, and optional change detection. Runs can be triggered manually, on schedule, or via API.

What can this Actor do?

Extract clean, boilerplate-reduced page content using Readability + Turndown.
Generate RAG chunks using two strategies:
- byHeading for semantic chunks aligned with heading hierarchy.
- fixedTokens (character window + overlap) for uniform chunk sizing.
Detect near-duplicate pages (SimHash/MinHash options) to reduce noisy indexing.
Track content changes across runs with baseline + diff outputs.
Build a crawl graph/sitemap (sitemap.json) for structural analysis.
Optionally use Stagehand + Zod schema extraction for hard-to-parse layouts.
Optionally download linked files (for example PDF/DOCX) within configured limits.

Why use it on Apify?

Using this Actor on Apify gives you platform capabilities out of the box:

Scheduled crawls for continuous content sync.
API-first operation for pipeline automation.
Key-value store + dataset storage for downstream processing.
Proxy configuration (including Apify Proxy groups) for difficult targets.
Monitoring, logs, and run history for debugging and reliability.

What data can this Actor extract?

Data point	Description
`url` / `finalUrl`	Original and final resolved URL
`title`	Page title
`markdown` / `text`	Clean extracted content
`chunks[]`	RAG chunks with `chunkId`, `headingPath`, `text`, `charCount`
`links.internal` / `links.external`	Links discovered on the page
`meta`	Metadata such as `h1`, `canonical`, `robots`, `timestamp`
`diff`	Added/removed/modified chunks (when change detection is enabled)

How do I use RAG-Ready Website Intelligence Crawler to scrape website data?

Open the Actor in Apify Console and click Try for free.
Add one or more URLs to startUrls.
Set crawl scope (maxPages, maxDepth, sameDomainOnly, include/exclude patterns).
Choose extraction mode:
- rag-chunks (default)
- clean-markdown
- stagehand-schema (experimental, requires LLM key)
Configure optional features (dedup, changeDetection, downloadFiles, proxy).
Run the Actor and inspect:
- Dataset for page-level output
- Key-value store for sitemap.json, stats.json, and optional diff.json

How much does it cost to scrape website data?

Costs depend on site complexity, JavaScript rendering, proxy usage, and enabled features.

Lower-cost runs: smaller maxPages, no downloads, no change detection, standard extraction.
Higher-cost runs: large sites, heavy rendering, residential proxy usage, or Stagehand schema extraction.

If you are testing, start with a small crawl (for example 20-50 pages), review output quality, then scale gradually.

Input example

See example_input.json for a complete template.

Common input fields:

startUrls
maxPages, maxDepth
sameDomainOnly
includePatterns, excludePatterns
render: auto | playwright | http
extractMode: clean-markdown | rag-chunks | stagehand-schema
chunking: strategy, maxChars, overlapChars
dedup: enabled, method, threshold
changeDetection: enabled, baselineRunKvKey, emitDiffOnly, saveBaseline
downloadFiles
proxy

Output example

Dataset output (per page)

{
  "url": "https://example.com/docs/getting-started",
  "finalUrl": "https://example.com/docs/getting-started",
  "title": "Getting Started",
  "httpStatus": 200,
  "markdown": "# Getting Started\\n\\nWelcome to the docs...",
  "chunks": [
    {
      "chunkId": "getting-started::1",
      "headingPath": ["Getting Started"],
      "text": "Welcome to the docs...",
      "charCount": 118
    }
  ],
  "links": {
    "internal": ["https://example.com/docs/install"],
    "external": []
  },
  "meta": {
    "h1": "Getting Started",
    "canonical": "https://example.com/docs/getting-started",
    "robots": "index,follow",
    "timestamp": "2026-03-02T12:00:00.000Z"
  }
}

Key-value store artifacts

sitemap.json: crawl graph and section grouping
stats.json: crawl/chunk/dedup summary metrics
diff.json: URL/chunk-level differences (when enabled)
baseline key (changeDetection.baselineRunKvKey): prior snapshot for future diffs

FAQ, disclaimers, and support

When should I use `rag-chunks` vs `clean-markdown`?

Use rag-chunks when you plan to embed and retrieve content in a RAG system. Use clean-markdown when you only need cleaned page content without chunk metadata.

What is the Stagehand mode for?

stagehand-schema is useful for pages where static extraction is fragile. It can improve resilience on changing layouts, but it needs an LLM API key and may increase run cost.

How does deduplication work?

The Actor compares content similarity and can skip near-duplicate pages using configured method/threshold values, helping reduce duplicate embeddings and storage overhead.

How do I handle Cloudflare or bot protection?

Use Apify Proxy (often residential groups), tune retries/concurrency, and enable signed crawler identification where available.

Is this legal to use?

You are responsible for complying with target-site terms, applicable laws, and data-protection requirements in your jurisdiction and use case.

Where can I get help or request improvements?

Check run logs and stats.json first for troubleshooting.
Use the Actor Issues tab for bug reports and feature requests.
For custom extraction requirements, open an issue with sample URLs and expected output shape.

Rate this Actor

If this Actor saved you time or improved your pipeline, please rate it on Apify. Your rating helps other users discover it and helps prioritize future improvements.

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Website Content Crawler for AI and RAG

tropical_quince/website-content-crawler-rag

Website Content Crawler for AI and RAG. Extract structured data with automatic pagination, proxy rotation, and JSON/CSV export. Pay only for results.

Donny Nguyen

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Website to Clean Markdown (AI & RAG Ready)

ahmed_jasarevic/website-to-clean-markdown-ai-rag-ready

Convert any website into clean, noise-free Markdown. Perfect for training LLMs, building Custom GPTs, and RAG pipelines. Save 80% on OpenAI tokens by stripping HTML junk.

Ahmed Jasarevic

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl websites and extract clean training data for LLMs. Quality scoring, deduplication, PII detection, markdown output. Built for fine-tuning and RAG pipelines.

ryan clinton

Web Scraper RAG Ready

traorealexy/Web-Sraper-RAG-Ready

Turn any website into clean, token-efficient Markdown ready for RAG and LLM pipelines. Removes boilerplate, handles JavaScript rendering, and outputs structured JSON for LangChain, LlamaIndex, and vector databases.