Website Content Crawler avatar

Website Content Crawler

Pricing

Pay per usage

Go to Apify Store
Website Content Crawler

Website Content Crawler

Crawl websites and extract clean Markdown/text content for RAG pipelines and LLMs. HTTP-first, 10x faster than browser-based crawlers.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Tugelbay Konabayev

Tugelbay Konabayev

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

an hour ago

Last modified

Categories

Share

Website Content Crawler — Fast, Parallel Website Scraping for LLMs & RAG

Crawl entire websites following links with configurable depth and breadth-first search (BFS). Extract clean Markdown/text/HTML content from every page using Mozilla Readability. HTTP-first architecture (no browser overhead) = 10x faster than Playwright-based crawlers. Perfect for building knowledge bases, RAG pipelines, LLM training datasets, and competitive intelligence.

Perfect for: Building knowledge bases, RAG pipelines, AI training datasets, competitive intelligence, SEO analysis, and content archiving at scale.

What does Website Content Crawler do?

This actor starts from one or more seed URLs, crawls the website following same-domain links, and extracts clean content from every page discovered. It:

  • Follows links intelligently — BFS crawling with configurable depth, max pages, and URL pattern matching (include/exclude globs)
  • Extracts clean content — Uses Mozilla Readability algorithm (same tech as Firefox Reader View) to extract just the main content, removing navigation, ads, sidebars, and boilerplate
  • Produces structured output — Markdown (optimized for LLMs), plain text, or clean HTML with auto-extracted metadata
  • Crawls fast — HTTP-first (no browser) with up to 50 concurrent requests. Crawl 50+ pages in under 1 minute
  • Extracts metadata — Title, description, author, language, Open Graph image, word count, depth, and HTTP status
  • Handles sitemaps — Optionally load URLs from XML sitemaps to seed crawling faster
  • Supports proxies — Datacenter, residential, or ISP proxies for geo-restricted or IP-blocked sites
  • PPE pricing — Pay only for successfully extracted pages (first 100 free)

No custom CSS selectors, no per-site configuration, no browser headaches. Just add URLs and let it crawl.

Why use this instead of alternatives?

FeatureGeneric Scraperapify/website-content-crawler (free)Website Content Crawler (ours)
Speed (50 pages)10–30 min (Playwright)14 minutes (Playwright + BFS)30–60 seconds (HTTP-only)
ArchitectureVaries (browser/HTTP mix)Playwright (slow, memory-heavy)HTTP + concurrent requests (10x)
Content extractionRaw HTML / CSS selectorsFull page contentClean article text via Readability
Output qualityIncludes ads, nav, footersIncludes boilerplateClean, LLM-ready Markdown
Concurrent requests1–5 (default)Limited (browser overhead)Up to 50 parallel (configurable)
Link followingManual or custom logicBFS with depth controlBFS with depth, max, glob patterns
Sitemap supportNoNoYes (faster seeding)
PricingVariesFREE (5,743 users)PPE (pay per extracted page)
User countVaries5,743Building (PPE + MCP)
AI/MCP compatibleNoNo (free tier not optimized)Yes (PPE native)
Proxy supportVariesYes (Apify proxy only)Yes (any proxy, smart escalation)

When to use each:

  • Use Website Content Crawler (ours) when you need fast crawling + clean content for LLMs, RAG, or knowledge bases
  • Use apify/website-content-crawler (free) if you need full-page HTML and can tolerate slower speeds
  • Use a generic scraper if you need custom selectors or non-article content

The math: Crawling 50 pages takes 14 minutes with the free actor (browser overhead), 30–60 seconds with ours. Over 1,000 page crawls, you save 200+ hours and cut infrastructure costs by 90%.

Features

  • Breadth-First Search (BFS) crawling with configurable depth and maximum page limits
  • Same-domain link following with automatic URL normalization and deduplication
  • URL pattern filtering — include/exclude URLs via glob patterns (e.g., **/blog/**, !**/admin/**)
  • Sitemap support — optional XML sitemap loading to seed crawling faster
  • Clean content extraction using Mozilla Readability algorithm (no custom selectors needed)
  • Multiple output formats — Markdown (optimized for LLMs), plain text, or clean HTML
  • Automatic metadata extraction — title, description, author, language, Open Graph image, word count
  • Concurrent crawling — up to 50 parallel HTTP requests for speed
  • Proxy support — Apify proxy, datacenter, residential, or ISP proxies with smart escalation
  • Graceful error handling — retries failed requests, logs errors, returns partial results
  • HTTP/2 and connection pooling for maximum efficiency
  • First 100 pages free to evaluate the actor
  • PPE pricing — pay only for successfully extracted pages

Input examples

Crawl a website and extract all content as Markdown

{
"startUrls": [
{
"url": "https://example.com"
}
],
"maxCrawlDepth": 3,
"maxCrawlPages": 50,
"outputFormat": "markdown"
}

Crawl a blog with depth limit, exclude admin pages

{
"startUrls": [
{
"url": "https://blog.example.com"
}
],
"maxCrawlDepth": 2,
"maxCrawlPages": 100,
"outputFormat": "markdown",
"includeUrlGlobs": ["**/blog/**", "**/post/**"],
"excludeUrlGlobs": ["**/admin/**", "**/preview/**", "**/?utm_*"]
}

Crawl with sitemap and proxy for geo-restricted content

{
"startUrls": [
{
"url": "https://geo-restricted.example.com",
"userData": {
"label": "main"
}
}
],
"useSitemap": true,
"maxCrawlPages": 500,
"outputFormat": "markdown",
"maxConcurrency": 30,
"proxyConfiguration": {
"useApifyProxy": true,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}

Crawl documentation site as plain text with high concurrency

{
"startUrls": [
{
"url": "https://docs.example.com"
}
],
"maxCrawlDepth": 4,
"maxCrawlPages": 200,
"outputFormat": "text",
"maxConcurrency": 50,
"pageTimeout": 30
}

Crawl multiple domains with depth limits

{
"startUrls": [
{ "url": "https://site1.example.com" },
{ "url": "https://site2.example.com" },
{ "url": "https://docs.example.com/api" }
],
"maxCrawlDepth": 2,
"maxCrawlPages": 100,
"outputFormat": "markdown",
"includeUrlGlobs": ["**"],
"excludeUrlGlobs": ["**/login", "**/signup", "**/*.pdf"]
}

Input parameters

ParameterTypeDefaultRequiredDescription
startUrlsArrayYesList of seed URLs to start crawling from (requestListSources format)
maxCrawlDepthInteger10NoMaximum link depth to follow (0 = seed URLs only, 1 = seed + direct links, etc.)
maxCrawlPagesInteger50NoMaximum pages to crawl per domain (1–10,000). Crawling stops when this limit is reached
outputFormatStringmarkdownNoOutput format: "markdown", "text", or "html"
includeUrlGlobsArray["**"]NoGlob patterns to INCLUDE (e.g., ["**/blog/**", "**/docs/**"]). Default: all URLs included
excludeUrlGlobsArray[]NoGlob patterns to EXCLUDE (e.g., ["**/admin/**", "**/?utm_*"]). Overrides include patterns
useSitemapBooleanfalseNoLoad URLs from XML sitemap (sitemap.xml) at domain root. Speeds up discovery.
maxConcurrencyInteger20NoNumber of pages to process simultaneously (1–50). Higher = faster but more resource-intensive
pageTimeoutInteger30NoTimeout per page request in seconds (5–120). Increase for slow servers.
proxyConfigurationObjectNoneNoProxy settings for accessing IP-blocked or geo-restricted content

Output format

Each item in the dataset contains extracted content from one crawled page:

FieldTypeDescription
urlStringFinal page URL (after redirects)
titleStringPage title (from <title> tag or h1)
descriptionStringMeta description or auto-generated summary
authorStringAuthor (from meta tags or JSON-LD, if available)
languageStringDetected content language code (e.g., "en", "de", "fr")
contentStringExtracted page content in requested format (Markdown/text/HTML)
wordCountIntegerNumber of words in extracted content
depthIntegerLink depth from seed URL (0 = seed, 1 = one link away, etc.)
statusCodeIntegerHTTP response status code (200, 404, 403, etc.)
crawledAtStringCrawling timestamp (ISO 8601)
errorStringError message if crawling failed (null on success)

Example output

{
"url": "https://example.com/about",
"title": "About Us — Example Company",
"description": "Learn about Example Company's mission, team, and history.",
"author": "Example Team",
"language": "en",
"content": "# About Us\n\nExample Company was founded in 2015 with a mission to...\n\n## Our Team\n\n- **John Smith** — CEO & Founder\n- **Jane Doe** — VP of Engineering\n- **Bob Johnson** — Product Manager\n\n## History\n\nWe started as a small startup...",
"wordCount": 850,
"depth": 1,
"statusCode": 200,
"crawledAt": "2026-03-29T14:23:45Z",
"error": null
}
{
"url": "https://example.com/docs/quickstart",
"title": "Quick Start — Example API",
"description": "Get up and running with the Example API in 5 minutes.",
"author": null,
"language": "en",
"content": "# Quick Start\n\n## Installation\n\n1. Install via npm:\n npm install example-api\n\n2. Initialize the client:\n const client = new Example();\n\n3. Make your first request:\n const data = await client.getData();\n\nThat's it! You're ready to use the Example API.",
"wordCount": 420,
"depth": 2,
"statusCode": 200,
"crawledAt": "2026-03-29T14:23:52Z",
"error": null
}

Integrations

Apify MCP Server (Claude, AI agents)

Use as a tool in Claude Desktop, Claude Code, or any MCP-compatible AI agent. PPE pricing makes it native to AI workflows.

# Claude Code + Apify MCP Server
# The actor is available as a tool in your agent context

Python integration

from apify_client import ApifyClient
client = ApifyClient("your-apify-api-token")
# Crawl a website
run = client.actor("tugelbay/website-content-crawler").call(
run_input={
"startUrls": [{"url": "https://example.com"}],
"maxCrawlDepth": 2,
"maxCrawlPages": 50,
"outputFormat": "markdown",
}
)
# Read results
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(f"URL: {item['url']}")
print(f"Title: {item['title']}")
print(f"Words: {item['wordCount']}")
print(f"Content preview: {item['content'][:300]}...")
print()

JavaScript/TypeScript integration

import { ApifyClient } from "apify-client";
const client = new ApifyClient({ token: "your-apify-api-token" });
const run = await client.actor("tugelbay/website-content-crawler").call({
startUrls: [{ url: "https://example.com" }],
maxCrawlDepth: 2,
maxCrawlPages: 50,
outputFormat: "markdown",
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
console.log(`${item.title} (${item.wordCount} words, depth: ${item.depth})`);
console.log(`URL: ${item.url}`);
console.log(`Content preview: ${item.content?.substring(0, 300)}...`);
console.log();
}

LangChain integration (RAG pipeline)

from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document
apify = ApifyWrapper(apify_api_token="your-apify-api-token")
docs = apify.call_actor(
actor_id="tugelbay/website-content-crawler",
run_input={
"startUrls": [{"url": "https://docs.example.com"}],
"maxCrawlDepth": 3,
"maxCrawlPages": 200,
"outputFormat": "markdown",
},
dataset_mapping_function=lambda item: Document(
page_content=item.get("content", ""),
metadata={
"url": item.get("url"),
"title": item.get("title"),
"author": item.get("author"),
"depth": item.get("depth"),
"wordCount": item.get("wordCount"),
},
),
)
# Now use docs in your RAG pipeline
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
# Query the knowledge base
results = vectorstore.similarity_search("How do I configure X?")

Webhooks and integrations

The actor integrates with Apify's ecosystem:

  • Google Sheets — export crawled content directly to a spreadsheet
  • Zapier / Make — trigger workflows when crawling completes
  • Slack — notify your team with crawl summary (pages found, errors, etc.)
  • Email — receive dataset as CSV/JSON attachment
  • REST API — call programmatically from any application
  • Apify Schedules — run crawls on a schedule (hourly, daily, weekly, custom cron)

Use cases

  1. Knowledge base building — crawl documentation sites, internal wikis, or company knowledge bases and feed content into a vector database for semantic search
  2. LLM training data — extract clean text from websites for fine-tuning datasets or pre-training
  3. RAG pipelines — crawl public documentation (API docs, guides, tutorials) and make it searchable via retrieval-augmented generation
  4. Competitive intelligence — crawl competitor websites to monitor features, pricing, and messaging changes
  5. SEO analysis — extract all page titles, meta descriptions, and h1/h2 headers for gap analysis and content strategy
  6. Content archiving — automatically archive entire website snapshots for compliance, legal holds, or historical records
  7. Content migration — extract content from legacy sites during CMS migrations to new platforms
  8. AI agent enhancement — give your AI agent the ability to read and understand entire websites, not just single pages
  9. News and blog aggregation — crawl news sites or blog networks to collect articles at scale
  10. Price monitoring — crawl e-commerce sites to extract product pages, prices, and availability (per ToS)

Cost estimation (PPE pricing)

Event: page-extracted — triggered for each page successfully extracted

Example costs:

ScenarioPagesCost
10-page documentation site10~$0.05
50-page company website50~$0.25
100-page blog with archives100~$0.50
500-page documentation + tutorials500~$2.50
1,000-page knowledge base1,000~$5.00
Daily crawls (50 pages/day, 30 days)1,500~$7.50/month
Weekly competitor monitoring (10 sites, 20 pages each)200/week~$10/week
Large-scale extraction (10,000 pages)10,000~$50.00

First 100 pages extracted are free to help you evaluate the actor.

💡 Pro tip: Exclude large file downloads (PDFs, images) and non-content pages (admin panels, login forms) via excludeUrlGlobs to reduce extraction costs and improve data quality.

FAQ

How fast is the crawling?

Very fast. HTTP-only architecture with up to 50 concurrent requests means you can crawl 50 pages in 30–60 seconds with default settings. Increase maxConcurrency to 50 for even faster crawling on small/medium sites. Compare: the free Playwright-based actor takes 14 minutes for the same 50 pages.

What's the difference between this and apify/website-content-crawler?

  • Speed: Ours is 10–20x faster (HTTP vs. Playwright)
  • Content quality: Ours uses Readability to extract clean article text; the free one returns full page HTML
  • Pricing: Ours uses PPE (pay per extracted page, first 100 free); free one is unpaid but supports no AI/MCP workflows
  • Features: Both support BFS crawling, but ours adds sitemap support and better URL filtering
  • Users: Free has 5,743 users; ours is new but PPE-native for AI agents

Choose ours if you need speed, clean content, and LLM optimization. Choose the free one if you need full page HTML and can tolerate slow speeds.

Does it handle JavaScript-rendered content?

No. Website Content Crawler uses HTTP requests (no browser). If a site relies on JavaScript to render content (React SPAs, Angular apps, dynamic comments), you'll get incomplete or empty content. For JS-heavy sites, use RAG Web Browser, which has Playwright fallback.

Can I crawl password-protected or paywalled sites?

No. Website Content Crawler only works with publicly accessible content. It cannot bypass login walls, paywalls, HTTP Basic Auth, or CAPTCHA-protected pages. Use a different tool for authenticated access.

What happens if a page fails to load?

The actor logs the error and continues crawling other pages. Failed pages are included in the dataset with an error field explaining the failure (timeout, 404, blocked, etc.) and null content. Partial results are always returned.

Can I crawl multiple domains?

Yes. Add multiple startUrls and the crawler will crawl each domain independently, following links within each domain only (not cross-domain).

How do URL glob patterns work?

  • includeUrlGlobs: Whitelist — only crawl URLs matching these patterns (default: ["**"] = all)
  • excludeUrlGlobs: Blacklist — skip URLs matching these patterns

Examples:

  • "**/blog/**" — include only blog URLs
  • "!**/admin/**" — exclude admin pages
  • "**/docs/**" — include only documentation
  • "!**/?utm_*" — exclude UTM tracking parameters

Both can be used together. Excludes override includes.

What output formats are available?

  • Markdown (default) — clean, semantic, optimized for LLMs with preserved headers, lists, links, emphasis
  • Plain text — raw text with minimal formatting, good for NLP/text analysis
  • HTML — clean semantic HTML (not raw page HTML), good for rendering or further processing

Can I run this on a schedule?

Yes. Create a Schedule in Apify Console to run the crawler at any interval — hourly, daily, weekly, or custom cron. Perfect for monitoring website changes, tracking competitor updates, or archiving content regularly.

What's the maximum crawl size?

Soft limit: 10,000 pages per run (configurable via maxCrawlPages). No hard technical limit, but very large crawls (100K+ pages) will take a long time and incur higher costs. For massive crawls, split into multiple runs targeting specific sections of the site.

How does it handle redirects and canonicals?

The actor follows HTTP redirects and respects canonical tags (rel="canonical"). The final url field shows the final URL after any redirects.

Troubleshooting

Empty or very short content extraction

  • Cause: The page is a SPA (Single Page Application) that requires JavaScript to render
  • Fix: Use RAG Web Browser instead, which falls back to browser rendering
  • Workaround: Very short pages (<100 words) may not have enough content for Readability to identify. This is expected.

Crawling stops prematurely

  • Cause: Hit maxCrawlPages limit before exploring all links
  • Fix: Increase maxCrawlPages in the run input
  • Alternative: Reduce maxCrawlDepth to focus on top-level pages only
  • Cause: URL glob patterns are excluding them, or links are outside the start domain
  • Fix: Check includeUrlGlobs and excludeUrlGlobs — verify they match intended URLs
  • Note: Cross-domain links are never followed (same-domain only for security)

Timeout errors on slow servers

  • Cause: Server is slow to respond and pageTimeout (default 30s) is exceeded
  • Fix: Increase pageTimeout to 60–120 seconds for very slow servers
  • Alternative: Reduce maxConcurrency to avoid overwhelming the target server
  • Cause: Target site is blocking requests from datacenter IPs
  • Fix: Enable Apify residential proxy in proxyConfiguration:
    {
    "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
    }
    }
  • Note: Residential proxies cost more but bypass IP blocks. Start with datacenter, escalate only if needed.

Limitations

  • JavaScript-rendered content: Only extracts server-side rendered HTML. JS-heavy SPAs will return empty/incomplete content.
  • Authentication: Cannot access login-protected or paywalled content
  • Maximum page size: 5MB per page (larger pages are truncated to prevent memory issues)
  • Cross-domain crawling: Only follows links within the same domain (security & performance)
  • Rate limiting: Respects robots.txt and Crawl-Delay headers; may slow down on strictly rate-limited sites
  • Real-time data: Extracted content is a point-in-time snapshot; dynamic or frequently updated content requires re-crawling
  • Maximum concurrent requests: Limited to 50 for stability; higher concurrency may trigger IP blocks on some sites
  • Storage: Dataset size depends on site size; very large crawls (10K+ pages with lots of content) may hit storage limits

Changelog

v1.0 (2026-03-29)

  • Initial release
  • Breadth-First Search (BFS) crawling with configurable depth and max pages
  • Same-domain link following with URL normalization
  • URL glob pattern filtering (include/exclude)
  • XML sitemap support for faster discovery
  • Mozilla Readability-based content extraction
  • Multiple output formats: Markdown, plain text, clean HTML
  • Metadata extraction: title, description, author, language, word count
  • Concurrent crawling (up to 50 parallel requests)
  • Proxy support (Apify, datacenter, residential)
  • PPE pricing (first 100 pages free)
  • Full Apify SDK integration