Pricing

Pay per usage

Go to Apify Store

Website Content Crawler for AI and RAG

Try for free

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

Website Content Crawler for AI & RAG - Clean Text & Markdown

What does Website Content Crawler for RAG do?

Website Content Crawler for RAG crawls any website and extracts clean text content optimized for AI and RAG (Retrieval-Augmented Generation) pipelines. It converts HTML pages into clean Markdown or plain text, automatically strips navigation menus, ads, and boilerplate, then chunks the content into semantic segments. Feed the output directly into LLMs, vector databases like Pinecone or Weaviate, or any RAG system for accurate knowledge retrieval.

Why use Website Content Crawler for RAG?

AI-optimized output — Content is cleaned, structured, and chunked specifically for embedding models and LLM context windows
Flexible formats — Choose between Markdown (preserves headings, links, code blocks) or plain text output
Smart chunking — Splits content at paragraph and sentence boundaries to maintain semantic coherence within each chunk
Navigation stripping — Automatically removes headers, footers, sidebars, cookie banners, and ads for cleaner content
Full site crawling — Follows links within the same domain to crawl entire documentation sites, blogs, or knowledge bases
Scalable extraction — Process up to 10,000 pages per run using Apify Proxy for reliable access
API integration — Access results programmatically via the Apify API to build automated RAG pipelines

How to use Website Content Crawler for RAG

Find Website Content Crawler for RAG on the Apify Store
Enter one or more starting URLs in the input configuration
Set the maximum number of pages to crawl (default: 100)
Choose your preferred output format: Markdown or plain text
Configure the chunk size based on your embedding model requirements (default: 1000 characters)
Click Start and wait for the crawler to finish
Download the chunked content in JSON format or connect via API to your vector database

Input configuration

Field	Type	Description	Default
startUrls	array	List of URLs to start crawling from	`["https://docs.apify.com"]`
maxPages	integer	Maximum number of pages to crawl	`100`
outputFormat	string	Output format: "markdown" or "text"	`"markdown"`
chunkSize	integer	Size of content chunks in characters	`1000`
includeLinks	boolean	Preserve hyperlinks in extracted content	`true`
stripNavigation	boolean	Remove navigation menus, headers, footers	`true`
useResidentialProxy	boolean	Enable residential proxy for blocked sites	`false`

Output data

The actor produces a dataset where each item represents one content chunk from a crawled page. Pages with more content produce multiple chunks. Here is an example output:

{
  "url": "https://docs.apify.com/platform/actors",
  "title": "Actors - Apify Documentation",
  "description": "Learn about Apify Actors and how to use them.",
  "content": "# Actors\n\nActors are serverless cloud programs that can run for a few seconds to hours. They accept input, perform a task, and produce output...",
  "chunkIndex": 0,
  "totalChunks": 5,
  "contentLength": 987,
  "outputFormat": "markdown",
  "scrapedAt": "2026-02-19T12:00:00.000Z"
}

Cost of usage

Website Content Crawler for RAG uses pay-per-event pricing at $0.75 per 1,000 results. Each content chunk counts as one result. A typical documentation site with 100 pages averaging 3 chunks each produces roughly 300 results, costing approximately $0.225 in platform fees. Actual compute costs depend on the number of pages and proxy usage.

Tips and advanced usage

Tune chunk size to match your embedding model. OpenAI text-embedding-3 works well with 500-1500 character chunks. Larger context window models can handle 3000+ characters
Schedule recurring crawls using Apify Schedules to keep your RAG knowledge base up to date with the latest content
Combine with vector databases by using Apify integrations to automatically push new chunks to Pinecone, Weaviate, or Qdrant
Use plain text format when feeding content to models that do not understand Markdown syntax
Disable link preservation for cleaner text when URLs are not needed in your embeddings
Start with a sitemap URL to ensure comprehensive coverage of large documentation sites
Set higher maxPages (1000+) for complete site coverage when building comprehensive knowledge bases

Built with Crawlee and Apify SDK. See more scrapers by donnycodesdefi on Apify Store.

Rag Architect

ai_solutionist/rag-architect

Transform any website into vector-store-ready knowledge chunks for Pinecone, Weaviate, LangChain, LlamaIndex, Supabase, n8n & more. AI-generated Q&A pairs, smart chunking, PII scrubbing. Build hallucination-free RAG chatbots in minutes.

Jason Pellerin

Quick Website Content Scraper ( Extract Text for RAG & LLMs )

automateitplease/ai-web-content-scraper-extract-text-for-rag-llms

Extract clean text from any website for AI/LLM applications. Supports both static and JavaScript-rendered sites (React, Vue, Angular). Perfect for RAG systems, chatbot training, and content analysis.

AutomateItPlease Workflow And Automaton Ops

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

102K

4.5

(167)

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

Mick

Rag Vector Store Writer

labrat011/rag-vector-store-writer

Apify Actor that writes embedding vectors to Pinecone or Qdrant vector databases. Chains directly with RAG Embedding Generator output or accepts raw vectors with metadata. Handles batching, retries, collection creation, metadata mapping, and ID generation. Bring your own vector DB API key.

Mick

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

Mick

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

halam

428

3.4

(3)

Website Content Crawler

alizarin_refrigerator-owner/website-crawler

Crawl websites for SEO audits. Extracts HTML, title, meta tags, headings, links, & text content from pages. Automatic sitemap detection & parsing Extracts metadata (title, description, OG tags) Heading structure (H1, H2, H3) Internal & external link analysis Image extraction w/alt text Word count

John Rippy

Website Content Crawler

jasondev/website-content-crawler

A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.

Jason Giang

AI Website Content Localizer & Scraper

eunit/ai-website-content-localizer-scraper

Scrape any website and instantly translate the content into 83+ languages using Lingo.dev AI. Build multilingual datasets, localize competitor data, and power global RAG pipelines with clean, context-aware translations. The ultimate tool for shipping global apps fast!