Website Content Crawler for AI and RAG avatar

Website Content Crawler for AI and RAG

Pricing

Pay per usage

Go to Apify Store
Website Content Crawler for AI and RAG

Website Content Crawler for AI and RAG

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Donny Nguyen

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Categories

Share

Website Content Crawler for AI & RAG - Clean Text & Markdown

What does Website Content Crawler for RAG do?

Website Content Crawler for RAG crawls any website and extracts clean text content optimized for AI and RAG (Retrieval-Augmented Generation) pipelines. It converts HTML pages into clean Markdown or plain text, automatically strips navigation menus, ads, and boilerplate, then chunks the content into semantic segments. Feed the output directly into LLMs, vector databases like Pinecone or Weaviate, or any RAG system for accurate knowledge retrieval.

Why use Website Content Crawler for RAG?

  • AI-optimized output — Content is cleaned, structured, and chunked specifically for embedding models and LLM context windows
  • Flexible formats — Choose between Markdown (preserves headings, links, code blocks) or plain text output
  • Smart chunking — Splits content at paragraph and sentence boundaries to maintain semantic coherence within each chunk
  • Navigation stripping — Automatically removes headers, footers, sidebars, cookie banners, and ads for cleaner content
  • Full site crawling — Follows links within the same domain to crawl entire documentation sites, blogs, or knowledge bases
  • Scalable extraction — Process up to 10,000 pages per run using Apify Proxy for reliable access
  • API integration — Access results programmatically via the Apify API to build automated RAG pipelines

How to use Website Content Crawler for RAG

  1. Find Website Content Crawler for RAG on the Apify Store
  2. Enter one or more starting URLs in the input configuration
  3. Set the maximum number of pages to crawl (default: 100)
  4. Choose your preferred output format: Markdown or plain text
  5. Configure the chunk size based on your embedding model requirements (default: 1000 characters)
  6. Click Start and wait for the crawler to finish
  7. Download the chunked content in JSON format or connect via API to your vector database

Input configuration

FieldTypeDescriptionDefault
startUrlsarrayList of URLs to start crawling from["https://docs.apify.com"]
maxPagesintegerMaximum number of pages to crawl100
outputFormatstringOutput format: "markdown" or "text""markdown"
chunkSizeintegerSize of content chunks in characters1000
includeLinksbooleanPreserve hyperlinks in extracted contenttrue
stripNavigationbooleanRemove navigation menus, headers, footerstrue
useResidentialProxybooleanEnable residential proxy for blocked sitesfalse

Output data

The actor produces a dataset where each item represents one content chunk from a crawled page. Pages with more content produce multiple chunks. Here is an example output:

{
"url": "https://docs.apify.com/platform/actors",
"title": "Actors - Apify Documentation",
"description": "Learn about Apify Actors and how to use them.",
"content": "# Actors\n\nActors are serverless cloud programs that can run for a few seconds to hours. They accept input, perform a task, and produce output...",
"chunkIndex": 0,
"totalChunks": 5,
"contentLength": 987,
"outputFormat": "markdown",
"scrapedAt": "2026-02-19T12:00:00.000Z"
}

Cost of usage

Website Content Crawler for RAG uses pay-per-event pricing at $0.75 per 1,000 results. Each content chunk counts as one result. A typical documentation site with 100 pages averaging 3 chunks each produces roughly 300 results, costing approximately $0.225 in platform fees. Actual compute costs depend on the number of pages and proxy usage.

Tips and advanced usage

  • Tune chunk size to match your embedding model. OpenAI text-embedding-3 works well with 500-1500 character chunks. Larger context window models can handle 3000+ characters
  • Schedule recurring crawls using Apify Schedules to keep your RAG knowledge base up to date with the latest content
  • Combine with vector databases by using Apify integrations to automatically push new chunks to Pinecone, Weaviate, or Qdrant
  • Use plain text format when feeding content to models that do not understand Markdown syntax
  • Disable link preservation for cleaner text when URLs are not needed in your embeddings
  • Start with a sitemap URL to ensure comprehensive coverage of large documentation sites
  • Set higher maxPages (1000+) for complete site coverage when building comprehensive knowledge bases

Built with Crawlee and Apify SDK. See more scrapers by donnycodesdefi on Apify Store.