Website Content Crawler
Pricing
from $0.70 / 1,000 page scrapeds
Website Content Crawler
Deep crawl websites and extract clean text, Markdown, or HTML for LLMs, RAG, and AI apps. Removes navigation, ads, cookie banners. Supports headless browser & HTTP. Sitemap discovery, URL scoping, file downloads. Feed ChatGPT, LangChain, LlamaIndex, Pinecone. The cheapest content crawler on Apify.
Pricing
from $0.70 / 1,000 page scrapeds
Rating
0.0
(0)
Developer
kata Kuri
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
18 hours ago
Last modified
Categories
Share
Website Content Crawler is an Apify Actor that performs deep crawls of websites and extracts clean text content from web pages. It is designed for feeding large language models (LLMs), RAG pipelines, vector databases, and AI applications with high-quality web data.
Key Features
- Multiple crawler engines - Adaptive mode tries headless Firefox first and automatically falls back to HTTP if the site blocks browsers. Or choose a specific engine manually.
- Clean content extraction - Automatically removes navigation, headers, footers, cookie banners, ads, modals, and other irrelevant page elements.
- Flexible output formats - Save content as Markdown, plain text, or HTML.
- Smart URL scoping - Stays within the start URL path. Supports include/exclude glob patterns for fine-grained control.
- Sitemap discovery - Automatically finds and parses sitemaps to discover more pages.
- Canonical URL deduplication - Skips duplicate pages identified by the same canonical URL.
- Dynamic content support - Wait for JavaScript rendering, scroll to trigger lazy loading, expand accordions and tabs.
- Cookie banner dismissal - Automatically detects and dismisses cookie consent popups.
- File downloads - Optionally download linked PDF, DOC, DOCX, XLS, XLSX, and CSV files.
- Rich metadata extraction - Extracts title, description, author, keywords, language, and canonical URL from every page.
Use Cases
Feed LLMs and AI Applications
Crawl documentation sites, knowledge bases, help centers, or blogs and feed the extracted content directly into your LLM, ChatGPT, or custom AI assistant.
Retrieval Augmented Generation (RAG)
Build a knowledge base from any website. Use the crawled content with vector databases like Pinecone, Qdrant, or Weaviate to power RAG-based question answering.
Custom GPTs and AI Assistants
Export crawled data as JSON and upload it as knowledge files to your custom OpenAI GPTs or AI assistants.
Content Processing at Scale
Scrape content for summarization, translation, proofreading, or style transformation using LLMs.
LangChain and LlamaIndex Integration
Use the Apify integration with LangChain or LlamaIndex to feed crawled content directly into your AI pipeline.
How It Works
The crawler operates in three stages:
- Crawling - Discovers and downloads web pages starting from your URLs, following links within scope.
- HTML Processing - Cleans the DOM by removing navigation, ads, cookie warnings, and other noise.
- Output - Converts the cleaned HTML to your chosen format (Markdown, text, or HTML) with metadata.
Input Configuration
The only required input is Start URLs. All other settings have sensible defaults.
| Setting | Description | Default |
|---|---|---|
| Start URLs | URLs to begin crawling from | (required) |
| Crawler type | Engine: Adaptive, Firefox browser, or Cheerio HTTP | Adaptive |
| Max pages | Maximum number of pages to crawl | 100 |
| Max crawling depth | How deep to follow links from start URLs | 20 |
| Output format | Markdown, plain text, or HTML | Markdown |
| Exclude URLs (globs) | Glob patterns for URLs to skip | (none) |
| Include URLs (globs) | Only crawl URLs matching these globs | (none) |
| Remove elements (CSS) | Additional CSS selectors to remove | (none, defaults always applied) |
| Extract elements (CSS) | Only keep content from these elements | (none) |
| Remove cookie warnings | Auto-dismiss cookie consent banners | Yes |
| Wait for dynamic content | Time to wait for JS rendering (ms) | 1000 |
| Scroll height | Scroll to trigger lazy loading (px) | 0 |
| Expand clickables | Click accordions/tabs to expand | No |
| Save files | Download linked PDF/DOC/XLS files | No |
| Use sitemaps | Discover URLs from sitemaps | Yes |
Output Format
Each crawled page produces a JSON object:
{"url": "https://example.com/docs/getting-started","crawl": {"loadedUrl": "https://example.com/docs/getting-started","loadedTime": "2024-01-15T10:30:00.000Z","depth": 1},"metadata": {"canonicalUrl": "https://example.com/docs/getting-started","title": "Getting Started | Example Docs","description": "Learn how to get started with Example.","author": "Example Team","keywords": "docs, getting started","languageCode": "en"},"text": null,"markdown": "# Getting Started\n\nWelcome to Example...","html": null}
The content field (text, markdown, or html) is populated based on your chosen output format. The other two fields will be null.
Pricing
Only $0.001 per page ($1.00 per 1,000 pages) via pay-per-event billing.
| This Actor | Official Apify Crawler | Firecrawl-based Actors | |
|---|---|---|---|
| Price per page | $0.001 | $0.005 - $0.05 | $0.004 |
| 1,000 pages | $1.00 | $5.00 - $50.00 | $4.00 |
| 10,000 pages | $10.00 | $50.00 - $500.00 | $40.00 |
- 4x cheaper than Firecrawl-based alternatives
- Up to 5x cheaper than the official browser crawler
- You only pay for pages successfully crawled and saved to the dataset
Apify's free plan includes $5/month in credits, enough to crawl ~5,000 pages for free.
Integration Examples
Python (LangChain)
from langchain_community.utilities import ApifyWrapperapify = ApifyWrapper()loader = apify.call_actor(actor_id="worshipful_knife/website-content-crawler",run_input={"startUrls": [{"url": "https://docs.example.com/"}],"maxCrawlPages": 50},dataset_mapping_function=lambda item: Document(page_content=item["markdown"] or item["text"] or "",metadata={"source": item["url"]}),)
Node.js (Apify Client)
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_TOKEN' });const run = await client.actor('worshipful_knife/website-content-crawler').call({startUrls: [{ url: 'https://docs.example.com/' }],maxCrawlPages: 50,outputFormat: 'markdown',});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
Troubleshooting
- Missing content? Try switching to headless browser crawler type, which renders JavaScript.
- Too much noise in output? Use the "Remove HTML elements" or "Extract HTML elements" CSS selectors to fine-tune.
- Crawler too slow? Increase "Max concurrency" or switch to Cheerio crawler for static sites.
- Getting blocked? Use the headless browser crawler type with residential proxies.
Support
If you have any questions or feedback, please open an issue on the Actor's GitHub page or contact us through Apify support.