Website Content Crawler for AI and RAG
Pricing
Pay per usage
Website Content Crawler for AI and RAG
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Donny Nguyen
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Website Content Crawler for AI & RAG - Clean Text & Markdown
What does Website Content Crawler for RAG do?
Website Content Crawler for RAG crawls any website and extracts clean text content optimized for AI and RAG (Retrieval-Augmented Generation) pipelines. It converts HTML pages into clean Markdown or plain text, automatically strips navigation menus, ads, and boilerplate, then chunks the content into semantic segments. Feed the output directly into LLMs, vector databases like Pinecone or Weaviate, or any RAG system for accurate knowledge retrieval.
Why use Website Content Crawler for RAG?
- AI-optimized output — Content is cleaned, structured, and chunked specifically for embedding models and LLM context windows
- Flexible formats — Choose between Markdown (preserves headings, links, code blocks) or plain text output
- Smart chunking — Splits content at paragraph and sentence boundaries to maintain semantic coherence within each chunk
- Navigation stripping — Automatically removes headers, footers, sidebars, cookie banners, and ads for cleaner content
- Full site crawling — Follows links within the same domain to crawl entire documentation sites, blogs, or knowledge bases
- Scalable extraction — Process up to 10,000 pages per run using Apify Proxy for reliable access
- API integration — Access results programmatically via the Apify API to build automated RAG pipelines
How to use Website Content Crawler for RAG
- Find Website Content Crawler for RAG on the Apify Store
- Enter one or more starting URLs in the input configuration
- Set the maximum number of pages to crawl (default: 100)
- Choose your preferred output format: Markdown or plain text
- Configure the chunk size based on your embedding model requirements (default: 1000 characters)
- Click Start and wait for the crawler to finish
- Download the chunked content in JSON format or connect via API to your vector database
Input configuration
| Field | Type | Description | Default |
|---|---|---|---|
| startUrls | array | List of URLs to start crawling from | ["https://docs.apify.com"] |
| maxPages | integer | Maximum number of pages to crawl | 100 |
| outputFormat | string | Output format: "markdown" or "text" | "markdown" |
| chunkSize | integer | Size of content chunks in characters | 1000 |
| includeLinks | boolean | Preserve hyperlinks in extracted content | true |
| stripNavigation | boolean | Remove navigation menus, headers, footers | true |
| useResidentialProxy | boolean | Enable residential proxy for blocked sites | false |
Output data
The actor produces a dataset where each item represents one content chunk from a crawled page. Pages with more content produce multiple chunks. Here is an example output:
{"url": "https://docs.apify.com/platform/actors","title": "Actors - Apify Documentation","description": "Learn about Apify Actors and how to use them.","content": "# Actors\n\nActors are serverless cloud programs that can run for a few seconds to hours. They accept input, perform a task, and produce output...","chunkIndex": 0,"totalChunks": 5,"contentLength": 987,"outputFormat": "markdown","scrapedAt": "2026-02-19T12:00:00.000Z"}
Cost of usage
Website Content Crawler for RAG uses pay-per-event pricing at $0.75 per 1,000 results. Each content chunk counts as one result. A typical documentation site with 100 pages averaging 3 chunks each produces roughly 300 results, costing approximately $0.225 in platform fees. Actual compute costs depend on the number of pages and proxy usage.
Tips and advanced usage
- Tune chunk size to match your embedding model. OpenAI text-embedding-3 works well with 500-1500 character chunks. Larger context window models can handle 3000+ characters
- Schedule recurring crawls using Apify Schedules to keep your RAG knowledge base up to date with the latest content
- Combine with vector databases by using Apify integrations to automatically push new chunks to Pinecone, Weaviate, or Qdrant
- Use plain text format when feeding content to models that do not understand Markdown syntax
- Disable link preservation for cleaner text when URLs are not needed in your embeddings
- Start with a sitemap URL to ensure comprehensive coverage of large documentation sites
- Set higher maxPages (1000+) for complete site coverage when building comprehensive knowledge bases
Built with Crawlee and Apify SDK. See more scrapers by donnycodesdefi on Apify Store.