RAG Website Intelligence Crawler avatar

RAG Website Intelligence Crawler

Pricing

from $0.30 / 1,000 results

Go to Apify Store
RAG Website Intelligence Crawler

RAG Website Intelligence Crawler

Crawl websites into clean Markdown + RAG-ready chunks. Outputs structural sitemap (KV), SimHash deduplication, and optional change detection (baseline + diff) for docs monitoring & KB sync. Built on Crawlee (Playwright) with optional Stagehand schema extraction.

Pricing

from $0.30 / 1,000 results

Rating

0.0

(0)

Developer

Hayder Al-Khalissi

Hayder Al-Khalissi

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

RAG-Ready Website Intelligence Crawler

What is RAG-Ready Website Intelligence Crawler?

RAG-Ready Website Intelligence Crawler is an Apify Actor for crawling websites and turning pages into clean markdown, RAG-ready chunks, and a structured sitemap. It is designed for teams that need reliable content extraction for LLM search, internal knowledge bases, or documentation monitoring.

You provide one or more start URLs and basic crawl rules, and the Actor handles traversal, extraction, deduplication, and optional change detection. Runs can be triggered manually, on schedule, or via API.

What can this Actor do?

  • Extract clean, boilerplate-reduced page content using Readability + Turndown.
  • Generate RAG chunks using two strategies:
    • byHeading for semantic chunks aligned with heading hierarchy.
    • fixedTokens (character window + overlap) for uniform chunk sizing.
  • Detect near-duplicate pages (SimHash/MinHash options) to reduce noisy indexing.
  • Track content changes across runs with baseline + diff outputs.
  • Build a crawl graph/sitemap (sitemap.json) for structural analysis.
  • Optionally use Stagehand + Zod schema extraction for hard-to-parse layouts.
  • Optionally download linked files (for example PDF/DOCX) within configured limits.

Why use it on Apify?

Using this Actor on Apify gives you platform capabilities out of the box:

  • Scheduled crawls for continuous content sync.
  • API-first operation for pipeline automation.
  • Key-value store + dataset storage for downstream processing.
  • Proxy configuration (including Apify Proxy groups) for difficult targets.
  • Monitoring, logs, and run history for debugging and reliability.

What data can this Actor extract?

Data pointDescription
url / finalUrlOriginal and final resolved URL
titlePage title
markdown / textClean extracted content
chunks[]RAG chunks with chunkId, headingPath, text, charCount
links.internal / links.externalLinks discovered on the page
metaMetadata such as h1, canonical, robots, timestamp
diffAdded/removed/modified chunks (when change detection is enabled)

How do I use RAG-Ready Website Intelligence Crawler to scrape website data?

  1. Open the Actor in Apify Console and click Try for free.
  2. Add one or more URLs to startUrls.
  3. Set crawl scope (maxPages, maxDepth, sameDomainOnly, include/exclude patterns).
  4. Choose extraction mode:
    • rag-chunks (default)
    • clean-markdown
    • stagehand-schema (experimental, requires LLM key)
  5. Configure optional features (dedup, changeDetection, downloadFiles, proxy).
  6. Run the Actor and inspect:
    • Dataset for page-level output
    • Key-value store for sitemap.json, stats.json, and optional diff.json

How much does it cost to scrape website data?

Costs depend on site complexity, JavaScript rendering, proxy usage, and enabled features.

  • Lower-cost runs: smaller maxPages, no downloads, no change detection, standard extraction.
  • Higher-cost runs: large sites, heavy rendering, residential proxy usage, or Stagehand schema extraction.

If you are testing, start with a small crawl (for example 20-50 pages), review output quality, then scale gradually.

Input example

See example_input.json for a complete template.

Common input fields:

  • startUrls
  • maxPages, maxDepth
  • sameDomainOnly
  • includePatterns, excludePatterns
  • render: auto | playwright | http
  • extractMode: clean-markdown | rag-chunks | stagehand-schema
  • chunking: strategy, maxChars, overlapChars
  • dedup: enabled, method, threshold
  • changeDetection: enabled, baselineRunKvKey, emitDiffOnly, saveBaseline
  • downloadFiles
  • proxy

Output example

Dataset output (per page)

{
"url": "https://example.com/docs/getting-started",
"finalUrl": "https://example.com/docs/getting-started",
"title": "Getting Started",
"httpStatus": 200,
"markdown": "# Getting Started\\n\\nWelcome to the docs...",
"chunks": [
{
"chunkId": "getting-started::1",
"headingPath": ["Getting Started"],
"text": "Welcome to the docs...",
"charCount": 118
}
],
"links": {
"internal": ["https://example.com/docs/install"],
"external": []
},
"meta": {
"h1": "Getting Started",
"canonical": "https://example.com/docs/getting-started",
"robots": "index,follow",
"timestamp": "2026-03-02T12:00:00.000Z"
}
}

Key-value store artifacts

  • sitemap.json: crawl graph and section grouping
  • stats.json: crawl/chunk/dedup summary metrics
  • diff.json: URL/chunk-level differences (when enabled)
  • baseline key (changeDetection.baselineRunKvKey): prior snapshot for future diffs

FAQ, disclaimers, and support

When should I use rag-chunks vs clean-markdown?

Use rag-chunks when you plan to embed and retrieve content in a RAG system. Use clean-markdown when you only need cleaned page content without chunk metadata.

What is the Stagehand mode for?

stagehand-schema is useful for pages where static extraction is fragile. It can improve resilience on changing layouts, but it needs an LLM API key and may increase run cost.

How does deduplication work?

The Actor compares content similarity and can skip near-duplicate pages using configured method/threshold values, helping reduce duplicate embeddings and storage overhead.

How do I handle Cloudflare or bot protection?

Use Apify Proxy (often residential groups), tune retries/concurrency, and enable signed crawler identification where available.

You are responsible for complying with target-site terms, applicable laws, and data-protection requirements in your jurisdiction and use case.

Where can I get help or request improvements?

  • Check run logs and stats.json first for troubleshooting.
  • Use the Actor Issues tab for bug reports and feature requests.
  • For custom extraction requirements, open an issue with sample URLs and expected output shape.

Rate this Actor

If this Actor saved you time or improved your pipeline, please rate it on Apify. Your rating helps other users discover it and helps prioritize future improvements.