Website Content Crawler
Pricing
$10.00/month + usage
Website Content Crawler
Deep-crawl websites to extract clean text, Markdown, or HTML for AI/LLM apps, RAG pipelines, and vector databases. Supports adaptive crawling, HTML cleaning, file downloads, and structured dataset output. Easily integrates with LangChain, LlamaIndex, and other LLM tools.
Pricing
$10.00/month + usage
Rating
0.0
(0)
Developer

mikolabs
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Website Content Crawler — AI-Ready Web Scraping Actor
Extract clean text, Markdown, and structured content from any website for LLMs, RAG pipelines, vector databases, and AI applications — at a fraction of the cost.
$10 per 1,000 pages · Pay only for what you use · No subscriptions
What Is Website Content Crawler?
Website Content Crawler is a powerful Apify Actor that deep-crawls entire websites and extracts clean, structured content optimized for AI consumption. Whether you're building a RAG (Retrieval-Augmented Generation) pipeline, training an LLM, populating a vector database, or creating a custom AI chatbot, this actor delivers publication-ready content with no manual cleaning required.
Unlike generic scrapers, Website Content Crawler is purpose-built for AI workflows — it strips navigation menus, headers, footers, cookie banners, ads, and all other "noise" from every page, leaving only the meaningful article content. The result is high-quality text that feeds directly into your models without preprocessing.
Why Choose This Actor?
💰 Up to 10× Cheaper Than Alternatives
Most web crawling actors on the Apify Store charge $5–$50 per 1,000 pages. Website Content Crawler delivers the same high-quality AI-ready output at $10 per 1,000 pages — with no minimum spend, no monthly commitment, and no wasted compute on empty or duplicate pages.
| Actor | Price per 1,000 pages | AI-ready output | Markdown support | File downloads |
|---|---|---|---|---|
| Website Content Crawler | $10 | ✅ | ✅ | ✅ |
| Typical competitor A | $25–$50 | ❌ | ❌ | ❌ |
| Typical competitor B | $15–$30 | Partial | ❌ | ❌ |
⚡ Built for Scale
Crawl a single blog post or an entire documentation site with millions of pages — the actor scales automatically using Apify's cloud infrastructure. Concurrency, throttling, and retries are all managed for you.
🤖 LangChain, LlamaIndex & Vector DB Ready
The output schema matches what LangChain's ApifyWrapper, LlamaIndex's ApifyActor reader, and Pinecone/Qdrant integration actors expect out of the box — zero configuration required.
Key Features
🕷️ Intelligent Crawling
Multiple crawler types for every situation:
- Adaptive (recommended) — automatically switches between fast HTTP requests and a headless Firefox browser depending on whether a page requires JavaScript rendering. You get maximum speed where possible and full JS support where needed.
- Firefox + Playwright — headless browser that renders JavaScript, bypasses common anti-bot protections, and handles single-page applications. Best for modern websites.
- Chrome + Playwright — alternative browser option for sites that respond differently to Chrome vs Firefox fingerprints.
- Cheerio (raw HTTP) — the fastest option for static websites. No browser overhead, extremely low cost, ideal for documentation sites, blogs, and news sites.
Smart URL management:
- Crawls all sub-pages under your start URLs automatically — provide
https://docs.example.com/and it discovers every page beneath it - Include URL globs — use wildcard patterns like
https://{docs,blog}.example.com/**to expand the crawl scope across multiple subdomains or sections - Exclude URL globs — skip login pages, pagination, or any URL pattern with glob rules like
https://example.com/tag/** - Sitemap discovery — automatically reads
sitemap.xmlfiles to find pages that aren't linked from the main navigation - llms.txt support — the emerging standard for AI-readable site indexes; discovers and crawls URLs listed in
/llms.txtfiles
Deduplication:
- Canonical URL deduplication — pages that share a
<link rel="canonical">are stored only once, preventing duplicate content in your dataset - ETag deduplication — unchanged pages (same ETag header) are automatically skipped on re-crawls, saving cost
- URL fragment control — optionally treat
page#sectionas a unique URL for single-page applications
Depth and size controls:
- Set maximum crawl depth (how many links deep to follow)
- Set maximum total pages crawled
- Set maximum dataset results saved (independent of pages fetched)
- Initial concurrency + max concurrency with AutoThrottle for polite crawling
🧹 Advanced HTML Cleaning
This is where Website Content Crawler stands apart from raw scrapers. Every page goes through a multi-stage cleaning pipeline before any text is extracted:
Stage 1 — Noise removal: Automatically removes navigation bars, headers, footers, sidebars, advertisements, modals, ARIA dialogs, cookie consent banners, and inline scripts. The default removal rules mirror industry best practices used by major AI data pipelines.
Stage 2 — Content scoping:
Use a CSS selector to keep only the elements you care about — for example, article.post-content to extract just the blog body, ignoring related posts, author bios, and share buttons.
Stage 3 — Readability extraction: Applies Mozilla's Readability algorithm (the same one used by Firefox Reader Mode) to strip page chrome and isolate the primary article content. A configurable character threshold ensures the algorithm only applies when it produces a meaningful result.
Stage 4 — Aggressive pruning (optional): An extra cleaning pass that removes widgets, pagination controls, social share buttons, newsletter signups, and breadcrumbs — useful for sites with heavy supplementary content.
Cookie banner removal: Uses keyword-matching heuristics to detect and remove cookie consent notices that appear in the page body, keeping your extracted text clean.
📄 Output Formats
Every crawled page produces a structured record with:
Text — always included. Clean plain text with no HTML, no markup, no noise. Ready to paste into a prompt or embed into a vector store.
Markdown — preserves document structure (headings, lists, bold/italic, code blocks, links) in a format that LLMs understand natively. Ideal for retrieval pipelines where structure matters.
HTML snippet — the cleaned HTML after all noise removal, useful if you need to render the content or do further processing downstream.
Raw HTML file — the complete original page HTML uploaded to Apify's Key-Value Store, with a public URL in the output record. Useful for archiving or re-processing.
Screenshots — full-page PNG screenshots captured by the browser (Playwright crawlers only), stored in the Key-Value Store.
File downloads — PDF, Word (DOC/DOCX), Excel (XLS/XLSX), and CSV files linked from crawled pages are automatically downloaded and stored in the Key-Value Store. Files respect your exclude URL rules but are not limited to your start URL domain — cross-domain documents are collected too.
📊 Rich Metadata Extraction
Every output record includes structured metadata automatically extracted from the page:
| Field | Source |
|---|---|
title | <title>, og:title, twitter:title, or first <h1> |
description | meta[name=description], og:description, twitter:description |
author | meta[name=author], dc.creator, article:author, twitter:creator |
keywords | meta[name=keywords] |
canonicalUrl | <link rel=canonical>, og:url, or request URL |
languageCode | <html lang="..."> attribute, with automatic detection fallback |
publishedAt | article:published_time, datePublished, pubdate, or <time datetime> |
The crawl object in every record also includes the loaded URL (after any redirects), timestamp, referring URL, crawl depth, and HTTP status code.
🔐 Authentication & Session Support
Login with cookies — provide session cookies extracted from your browser (using tools like EditThisCookie) and the crawler injects them on every request. Supports name, value, domain, and path fields per cookie.
Custom HTTP headers — add any header to every request: Bearer tokens for API authentication, custom User-Agent strings, or any proprietary header your target site requires.
Proxy support:
- Apify Proxy — access residential and datacenter IPs in 100+ countries with automatic rotation
- Custom proxy URLs — bring your own proxies with round-robin rotation
⚙️ Browser Rendering Controls
For JavaScript-heavy websites, fine-tune how the browser processes each page:
- Wait for selector — don't extract content until a specific CSS selector appears in the DOM (useful for lazy-loaded content)
- Dynamic content wait — pause a fixed number of seconds after page load for animations or async data fetches to complete
- Infinite scroll — scroll down to a configurable pixel height to trigger lazy-loaded content sections
- Click elements — click expandable DOM elements (accordions, "Read more" buttons, tabs) using a CSS selector before extracting content
- Expand iframes — include content from embedded iframes in the extracted text
🔍 robots.txt Compliance
Enable respectRobotsTxtFile to have the crawler consult and obey robots.txt rules on every domain it visits. Disabled by default for maximum reach; enable it when crawling third-party sites where compliance is required.
Output Record Example
{"url": "https://docs.example.com/getting-started","crawl": {"loadedUrl": "https://docs.example.com/getting-started","loadedTime": "2025-03-15T10:30:00.000Z","referrerUrl": "https://docs.example.com/","depth": 1,"httpStatus": 200},"metadata": {"canonicalUrl": "https://docs.example.com/getting-started","title": "Getting Started — Example Docs","description": "Learn how to get up and running in under 5 minutes.","author": "Example Team","keywords": null,"languageCode": "en","publishedAt": "2024-11-01T00:00:00Z"},"screenshotUrl": null,"text": "Getting Started\nLearn how to get up and running in under 5 minutes.\n\nInstallation\nRun the following command to install...","markdown": "# Getting Started\n\nLearn how to get up and running in under 5 minutes.\n\n## Installation\n\nRun the following command...","html": null}
Use Cases
🧠 RAG (Retrieval-Augmented Generation)
Crawl your product documentation, knowledge base, or blog and feed the extracted text directly into a vector database like Pinecone, Qdrant, or Chroma. Your AI assistant can then answer questions grounded in your actual content rather than hallucinating.
🤖 Custom AI Chatbots
Let customers onboard by typing their website URL. The crawler indexes their content in minutes, giving your chatbot deep product knowledge instantly — without any manual data entry.
📚 LLM Fine-Tuning Datasets
Collect large volumes of high-quality, clean text from curated websites to build domain-specific fine-tuning datasets. The Markdown output preserves document structure that modern LLMs handle well.
🔎 Semantic Search
Crawl your internal wikis, support docs, or any website and build a semantic search engine powered by embeddings. The clean text output embeds cleanly without noise diluting the semantic signal.
📝 Content Summarization at Scale
Crawl an entire blog archive and batch-process the text through the OpenAI API for summarization, translation, proofreading, or tone-of-voice analysis.
🏢 Competitive Intelligence
Monitor competitor websites, product pages, and documentation for changes. Combine with a scheduled run to detect updates automatically.
📖 Custom GPT Knowledge Files
Export the crawled dataset as JSON and upload it directly to your custom OpenAI GPT as a knowledge file — no reformatting required.
🗃️ Content Archiving
Create searchable archives of websites, news sources, or any online content for compliance, research, or historical preservation.
🔗 LangChain & LlamaIndex Integration
The output schema is identical to what Apify's official LangChain and LlamaIndex integrations expect — drop this actor in as a direct replacement with no code changes.
Pricing
| Volume | Price | Per page |
|---|---|---|
| First 1,000 pages | $10 | $0.010 |
| 1,000–10,000 pages | $10/1k | $0.010 |
| 10,000–100,000 pages | $10/1k | $0.010 |
| 100,000+ pages | Contact us | Volume discount |
What counts as a page? One crawled URL — whether it returns content or not. Duplicate pages that are skipped by canonical or ETag deduplication are not charged. File downloads count as one item each.
Compared to Apify's native actor:
The official apify/website-content-crawler uses Apify Compute Units (CUs), which costs $0.50–$5.00 per 1,000 pages with a browser and $0.20 per 1,000 pages with raw HTTP. Website Content Crawler gives you the same output at a flat, predictable $10 per 1,000 pages regardless of crawler type — no surprises.
Input Parameters
Crawling
| Parameter | Default | Description |
|---|---|---|
startUrls | — | One or more URLs to start crawling from (required) |
crawlerType | playwright:firefox | Crawler engine: playwright:adaptive, playwright:firefox, playwright:chrome, cheerio, jsdom |
includeUrlGlobs | [] | Glob patterns for URLs to include (overrides scope when set) |
excludeUrlGlobs | [] | Glob patterns for URLs to skip |
maxCrawlDepth | 20 | Maximum link-following depth |
maxCrawlPages | unlimited | Hard cap on total pages fetched |
maxResults | unlimited | Cap on dataset records saved |
maxConcurrency | 16 | Maximum parallel requests |
initialConcurrency | 1 | Starting concurrency (ramps up automatically) |
maxRequestRetries | 3 | Retry attempts per failed request |
useSitemaps | false | Parse sitemap.xml for extra URL discovery |
useLlmsTxt | false | Parse /llms.txt for AI-curated URL lists |
respectRobotsTxtFile | false | Obey robots.txt exclusion rules |
keepUrlFragments | false | Treat #fragment as part of URL identity |
ignoreCanonicalUrl | false | Deduplicate by actual URL, not canonical |
Browser Rendering
| Parameter | Default | Description |
|---|---|---|
dynamicContentWaitSecs | 0 | Seconds to wait after page load for dynamic content |
maxScrollHeightPixels | 0 | Scroll height in pixels to trigger infinite scroll |
waitForSelector | — | Wait for this CSS selector before extracting |
clickElementsCssSelector | — | Click these elements to expand content |
expandIframes | true | Include iframe content in extraction |
HTML Processing
| Parameter | Default | Description |
|---|---|---|
htmlTransformer | readableText | readableText (article extraction) or none |
readableTextCharThreshold | 100 | Minimum chars for readability to succeed |
aggressivePrune | false | Extra removal of widgets, sidebars, pagination |
removeElementsCssSelector | built-in | CSS selector for additional elements to strip |
keepElementsCssSelector | — | Keep only these elements, discard everything else |
removeCookieWarnings | true | Remove cookie consent banners |
Output
| Parameter | Default | Description |
|---|---|---|
saveMarkdown | true | Include Markdown in output records |
saveHtml | false | Include cleaned HTML snippet |
saveHtmlAsFile | false | Upload raw HTML to Key-Value Store |
saveScreenshots | false | Capture full-page screenshot (browser only) |
saveFiles | false | Download linked PDF/DOCX/XLSX/CSV files |
minFileDownloadSpeedKBps | 64 | Abort file downloads slower than this speed |
Authentication & Proxy
| Parameter | Default | Description |
|---|---|---|
proxyConfiguration | — | Apify Proxy or custom proxy URLs |
initialCookies | [] | Cookies for authenticated crawling |
customHttpHeaders | {} | Custom headers on every request |
Debug
| Parameter | Default | Description |
|---|---|---|
debugMode | false | Add cleanHtml and response headers to records |
debugLog | false | Enable verbose Scrapy debug logging |
Integrations
LangChain (Python)
from langchain_community.utilities import ApifyWrapperfrom langchain_core.document_loaders.base import Documentapify = ApifyWrapper()loader = apify.call_actor(actor_id="YOUR_ACTOR_ID",run_input={"startUrls": [{"url": "https://docs.yoursite.com/"}],"maxCrawlPages": 500,},dataset_mapping_function=lambda item: Document(page_content=item["text"] or "",metadata={"source": item["url"],"title": item["metadata"]["title"],},),)
LlamaIndex (Python)
from llama_index.readers.apify import ApifyActorreader = ApifyActor("<YOUR_APIFY_TOKEN>")documents = reader.load_data(actor_id="YOUR_ACTOR_ID",run_input={"startUrls": [{"url": "https://docs.yoursite.com/"}]},dataset_mapping_function=lambda item: Document(text=item.get("text"),metadata={"url": item.get("url")},),)
Pinecone / Qdrant
Use the Apify Pinecone or Qdrant integration actors to stream crawl results directly into your vector database with incremental updates — only changed pages are re-embedded on subsequent runs.
OpenAI Custom GPTs
Export the dataset as JSON and upload directly as a knowledge file to any custom GPT in OpenAI's interface.
Troubleshooting
No content extracted / text is empty
Switch to playwright:firefox or playwright:adaptive. Many modern sites require JavaScript to render their content and will return an empty shell page to raw HTTP requests.
Content includes too much noise (navigation, sidebars)
Use the keepElementsCssSelector input to target only the main content element (e.g. main, article, .post-body). Alternatively, add unwanted element selectors to removeElementsCssSelector.
Crawl is too slow
Increase maxConcurrency (try 32 or 64) and set initialConcurrency to the same value to skip the ramp-up phase. For large sites, cheerio is 3–5× faster than browser crawlers.
Site is blocking requests
Use playwright:firefox with Apify's residential proxies (proxyConfiguration: { useApifyProxy: true, apifyProxyGroups: ["RESIDENTIAL"] }). The browser fingerprinting and IP rotation combination bypasses most commercial anti-bot systems.
Crawl misses pages
Enable useSitemaps: true to discover pages that aren't linked from the main navigation. Also check your excludeUrlGlobs — an overly broad pattern may be filtering out valid pages.
Login-protected pages not crawled
Export your session cookies using the EditThisCookie browser extension and paste them into initialCookies. The crawler injects them on every request, maintaining your authenticated session throughout the crawl.
Legal Notice
Web scraping is generally legal when applied to publicly available, non-personal data. Always review the target website's Terms of Service before crawling. Content extracted from websites (documentation, articles, blog posts) is typically subject to copyright — ensure your use case complies with applicable law. When in doubt, seek qualified legal advice.
Support
Have a question, found a bug, or need a custom feature? Open an issue in the Apify Console issue tracker or contact us directly. We respond to all issues within 24 hours on business days.
Website Content Crawler — the most cost-effective way to turn any website into AI-ready content.