Under maintenance

Pricing

from $6.50 / 1,000 page extracteds

Try for free

Go to Apify Store

RAG Web Extractor — Clean Markdown, HTML & Chunks

Under maintenance

Try for free

Extract clean website content for RAG and AI search. Crawl pages, remove boilerplate, preserve structure, and export markdown, HTML, text, JSON, and chunks.

Pricing

from $6.50 / 1,000 page extracteds

Rating

0.0

(0)

Developer

junipr

Actor stats

Bookmarked

Total users

Monthly active users

8 days ago

Last modified

What this actor does

Extract clean website content for RAG and AI search. Crawl pages, remove boilerplate, preserve structure, and export markdown, HTML, text, JSON, and chunks. The actor is configured for public or user-supplied sources, evidence-backed output rows, capped runs, and KVS summaries for blocked, empty, duplicate, or capped work.

What you can use it for

Build repeatable rag web extractor datasets for analysis and automation.
Export records with source URLs, evidence fields, and run summaries.
Run buyer-ready tasks for: Extract documentation into RAG-ready chunks; Convert a blog or docs site to clean markdown; Crawl public pages for AI search ingestion.

Search-intent terms: rag scraper, website to markdown, LLM data extraction, RAG crawler, content chunking, vector database ingestion.

Inputs

The input schema supports the actor's source fields plus cost and quality controls. Important caps include maxPages, maxDepth, renderJs, maxScrolls, chunkSize, chunkOverlap, maxConcurrency, requestTimeoutMs. Public task examples below provide small, safe starting inputs.

Outputs

Paid dataset rows and paid KVS artifacts are written only after the configured PPE charge succeeds. Rows include source/evidence fields where the actor can determine them and avoid fabricated values.

Key-value store reports

Every run writes SUMMARY.md and SUMMARY.json. Diagnostic CSV or JSON reports are used for blocked pages, no-result pages, duplicates, unsupported inputs, and capped runs instead of paid dataset rows.

Pricing

This actor uses pay-per-event pricing with Apify platform usage included in the event prices. Store discount tiers are configured for FREE, BRONZE, SILVER, and GOLD users.

Event	Unit	FREE	BRONZE	SILVER	GOLD
`actor-start`	run	$0.02000	$0.02000	$0.01800	$0.01600
`page-extracted`	page	$0.01719	$0.01719	$0.01547	$0.01375

Primary paid output event: page-extracted. Runtime-paid dataset rows and runtime-paid artifacts use this active paid-output event unless a future pricing revision adds an actor-specific artifact boundary with matching code.

Cost controls and max-charge behavior

Default max charge: $25.00. When the run reaches the max charge or an event-charge limit, it stops gracefully and writes free summary diagnostics without adding unpaid dataset rows.

Platform usage / proxy / browser / API notes

The configured model is fixed-inclusive PPE: platform usage is included in event prices and the platform-usage pass-through toggle is off. Features that require browser rendering, proxies, provider APIs, or additional fetches are bounded by the input caps and priced through explicit events.

Reliability notes and limitations

The actor processes public or permissioned inputs only. Blocked pages, login walls, unsupported formats, empty results, and provider errors are recorded as diagnostics. The actor does not fabricate rankings, prices, reviews, sentiment, traffic, compliance status, or private contact data when source evidence is unavailable.

Example inputs

{
  "startUrls": [
    {
      "url": "https://crawlee.dev/docs/introduction"
    }
  ],
  "maxPages": 25,
  "maxDepth": 0,
  "outputFormats": [
    "markdown"
  ],
  "includeRawHtml": false,
  "enableChunking": false,
  "chunkSize": 1000,
  "chunkOverlap": 200,
  "chunkStrategy": "semantic",
  "renderJs": true,
  "waitForSelector": "#main-content",
  "waitForTimeout": 5000,
  "handleInfiniteScroll": false,
  "maxScrolls": 20,
  "handlePagination": false,
  "paginationSelector": "a.next-page",
  "paginationMaxPages": 25,
  "removeSelectors": [],
  "includeSelectors": [],
  "removeNavigation": true,
  "removeAds": true,
  "removeCookieBanners": true,
  "minContentLength": 50,
  "language": "en",
  "extractMetadata": true,
  "extractSchemaOrg": true,
  "extractLinks": true,
  "extractImages": true,
  "extractTables": false,
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "httpHeaders": {},
  "deduplicateContent": true,
  "respectRobotsTxt": true,
  "maxRetries": 3,
  "requestTimeout": 30000
}

Public task examples

Extract documentation into RAG-ready chunks: Dataset rows matching the extract documentation into rag-ready chunks use case, plus SUMMARY.md and SUMMARY.json reports in the key-value store.
Convert a blog or docs site to clean markdown: Dataset rows matching the convert a blog or docs site to clean markdown use case, plus SUMMARY.md and SUMMARY.json reports in the key-value store.
Crawl public pages for AI search ingestion: Dataset rows matching the crawl public pages for ai search ingestion use case, plus SUMMARY.md and SUMMARY.json reports in the key-value store.

Ethical/allowed-use notes

Use this actor only for public data, owned files, or sources you have permission to process. Respect source terms, privacy expectations, robots guidance, and applicable laws. Do not use the actor to collect secrets, bypass access controls, or make unsupported legal, financial, medical, or compliance decisions.

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

News to Markdown — RAG-Ready News Chunks API

nexgendata/news-announcements-rag-markdown

Convert news and announcements into RAG-ready Markdown chunks. Clean JSON for PR, media-monitoring teams and AI agents.

NexGenData

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Inus Grobler

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website Content Scraper

qaseemiqbal/website-content-scraper

Extract clean Markdown, plain text, linked files, and RAG-ready chunks from websites, documentation, help centers, knowledge bases, and authenticated portals. Preserve structure, metadata, URLs, and crawl context for AI search, training, and retrieval workflows.

Muhammad Qaseem Iqbal

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Ken M

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Website Content Extractor for RAG: Markdown, HTML, Text

nezha/website-content-crawler

Turn docs sites, help centers, blogs, and websites into clean markdown, text, or HTML for RAG, AI knowledge bases, and internal search. Crawl from start URLs or sitemaps and keep the crawl in scope.