Website Content Crawler avatar

Website Content Crawler

Pricing

$10.00/month + usage

Go to Apify Store
Website Content Crawler

Website Content Crawler

Deep-crawl websites to extract clean text, Markdown, or HTML for AI/LLM apps, RAG pipelines, and vector databases. Supports adaptive crawling, HTML cleaning, file downloads, and structured dataset output. Easily integrates with LangChain, LlamaIndex, and other LLM tools.

Pricing

$10.00/month + usage

Rating

0.0

(0)

Developer

mikolabs

mikolabs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Website Content Crawler — AI-Ready Web Scraping Actor

Extract clean text, Markdown, and structured content from any website for LLMs, RAG pipelines, vector databases, and AI applications — at a fraction of the cost.

$10 per 1,000 pages · Pay only for what you use · No subscriptions

Rating Pages Crawled Uptime


What Is Website Content Crawler?

Website Content Crawler is a powerful Apify Actor that deep-crawls entire websites and extracts clean, structured content optimized for AI consumption. Whether you're building a RAG (Retrieval-Augmented Generation) pipeline, training an LLM, populating a vector database, or creating a custom AI chatbot, this actor delivers publication-ready content with no manual cleaning required.

Unlike generic scrapers, Website Content Crawler is purpose-built for AI workflows — it strips navigation menus, headers, footers, cookie banners, ads, and all other "noise" from every page, leaving only the meaningful article content. The result is high-quality text that feeds directly into your models without preprocessing.


Why Choose This Actor?

💰 Up to 10× Cheaper Than Alternatives

Most web crawling actors on the Apify Store charge $5–$50 per 1,000 pages. Website Content Crawler delivers the same high-quality AI-ready output at $10 per 1,000 pages — with no minimum spend, no monthly commitment, and no wasted compute on empty or duplicate pages.

ActorPrice per 1,000 pagesAI-ready outputMarkdown supportFile downloads
Website Content Crawler$10
Typical competitor A$25–$50
Typical competitor B$15–$30Partial

⚡ Built for Scale

Crawl a single blog post or an entire documentation site with millions of pages — the actor scales automatically using Apify's cloud infrastructure. Concurrency, throttling, and retries are all managed for you.

🤖 LangChain, LlamaIndex & Vector DB Ready

The output schema matches what LangChain's ApifyWrapper, LlamaIndex's ApifyActor reader, and Pinecone/Qdrant integration actors expect out of the box — zero configuration required.


Key Features

🕷️ Intelligent Crawling

Multiple crawler types for every situation:

  • Adaptive (recommended) — automatically switches between fast HTTP requests and a headless Firefox browser depending on whether a page requires JavaScript rendering. You get maximum speed where possible and full JS support where needed.
  • Firefox + Playwright — headless browser that renders JavaScript, bypasses common anti-bot protections, and handles single-page applications. Best for modern websites.
  • Chrome + Playwright — alternative browser option for sites that respond differently to Chrome vs Firefox fingerprints.
  • Cheerio (raw HTTP) — the fastest option for static websites. No browser overhead, extremely low cost, ideal for documentation sites, blogs, and news sites.

Smart URL management:

  • Crawls all sub-pages under your start URLs automatically — provide https://docs.example.com/ and it discovers every page beneath it
  • Include URL globs — use wildcard patterns like https://{docs,blog}.example.com/** to expand the crawl scope across multiple subdomains or sections
  • Exclude URL globs — skip login pages, pagination, or any URL pattern with glob rules like https://example.com/tag/**
  • Sitemap discovery — automatically reads sitemap.xml files to find pages that aren't linked from the main navigation
  • llms.txt support — the emerging standard for AI-readable site indexes; discovers and crawls URLs listed in /llms.txt files

Deduplication:

  • Canonical URL deduplication — pages that share a <link rel="canonical"> are stored only once, preventing duplicate content in your dataset
  • ETag deduplication — unchanged pages (same ETag header) are automatically skipped on re-crawls, saving cost
  • URL fragment control — optionally treat page#section as a unique URL for single-page applications

Depth and size controls:

  • Set maximum crawl depth (how many links deep to follow)
  • Set maximum total pages crawled
  • Set maximum dataset results saved (independent of pages fetched)
  • Initial concurrency + max concurrency with AutoThrottle for polite crawling

🧹 Advanced HTML Cleaning

This is where Website Content Crawler stands apart from raw scrapers. Every page goes through a multi-stage cleaning pipeline before any text is extracted:

Stage 1 — Noise removal: Automatically removes navigation bars, headers, footers, sidebars, advertisements, modals, ARIA dialogs, cookie consent banners, and inline scripts. The default removal rules mirror industry best practices used by major AI data pipelines.

Stage 2 — Content scoping: Use a CSS selector to keep only the elements you care about — for example, article.post-content to extract just the blog body, ignoring related posts, author bios, and share buttons.

Stage 3 — Readability extraction: Applies Mozilla's Readability algorithm (the same one used by Firefox Reader Mode) to strip page chrome and isolate the primary article content. A configurable character threshold ensures the algorithm only applies when it produces a meaningful result.

Stage 4 — Aggressive pruning (optional): An extra cleaning pass that removes widgets, pagination controls, social share buttons, newsletter signups, and breadcrumbs — useful for sites with heavy supplementary content.

Cookie banner removal: Uses keyword-matching heuristics to detect and remove cookie consent notices that appear in the page body, keeping your extracted text clean.


📄 Output Formats

Every crawled page produces a structured record with:

Text — always included. Clean plain text with no HTML, no markup, no noise. Ready to paste into a prompt or embed into a vector store.

Markdown — preserves document structure (headings, lists, bold/italic, code blocks, links) in a format that LLMs understand natively. Ideal for retrieval pipelines where structure matters.

HTML snippet — the cleaned HTML after all noise removal, useful if you need to render the content or do further processing downstream.

Raw HTML file — the complete original page HTML uploaded to Apify's Key-Value Store, with a public URL in the output record. Useful for archiving or re-processing.

Screenshots — full-page PNG screenshots captured by the browser (Playwright crawlers only), stored in the Key-Value Store.

File downloads — PDF, Word (DOC/DOCX), Excel (XLS/XLSX), and CSV files linked from crawled pages are automatically downloaded and stored in the Key-Value Store. Files respect your exclude URL rules but are not limited to your start URL domain — cross-domain documents are collected too.


📊 Rich Metadata Extraction

Every output record includes structured metadata automatically extracted from the page:

FieldSource
title<title>, og:title, twitter:title, or first <h1>
descriptionmeta[name=description], og:description, twitter:description
authormeta[name=author], dc.creator, article:author, twitter:creator
keywordsmeta[name=keywords]
canonicalUrl<link rel=canonical>, og:url, or request URL
languageCode<html lang="..."> attribute, with automatic detection fallback
publishedAtarticle:published_time, datePublished, pubdate, or <time datetime>

The crawl object in every record also includes the loaded URL (after any redirects), timestamp, referring URL, crawl depth, and HTTP status code.


🔐 Authentication & Session Support

Login with cookies — provide session cookies extracted from your browser (using tools like EditThisCookie) and the crawler injects them on every request. Supports name, value, domain, and path fields per cookie.

Custom HTTP headers — add any header to every request: Bearer tokens for API authentication, custom User-Agent strings, or any proprietary header your target site requires.

Proxy support:

  • Apify Proxy — access residential and datacenter IPs in 100+ countries with automatic rotation
  • Custom proxy URLs — bring your own proxies with round-robin rotation

⚙️ Browser Rendering Controls

For JavaScript-heavy websites, fine-tune how the browser processes each page:

  • Wait for selector — don't extract content until a specific CSS selector appears in the DOM (useful for lazy-loaded content)
  • Dynamic content wait — pause a fixed number of seconds after page load for animations or async data fetches to complete
  • Infinite scroll — scroll down to a configurable pixel height to trigger lazy-loaded content sections
  • Click elements — click expandable DOM elements (accordions, "Read more" buttons, tabs) using a CSS selector before extracting content
  • Expand iframes — include content from embedded iframes in the extracted text

🔍 robots.txt Compliance

Enable respectRobotsTxtFile to have the crawler consult and obey robots.txt rules on every domain it visits. Disabled by default for maximum reach; enable it when crawling third-party sites where compliance is required.


Output Record Example

{
"url": "https://docs.example.com/getting-started",
"crawl": {
"loadedUrl": "https://docs.example.com/getting-started",
"loadedTime": "2025-03-15T10:30:00.000Z",
"referrerUrl": "https://docs.example.com/",
"depth": 1,
"httpStatus": 200
},
"metadata": {
"canonicalUrl": "https://docs.example.com/getting-started",
"title": "Getting Started — Example Docs",
"description": "Learn how to get up and running in under 5 minutes.",
"author": "Example Team",
"keywords": null,
"languageCode": "en",
"publishedAt": "2024-11-01T00:00:00Z"
},
"screenshotUrl": null,
"text": "Getting Started\nLearn how to get up and running in under 5 minutes.\n\nInstallation\nRun the following command to install...",
"markdown": "# Getting Started\n\nLearn how to get up and running in under 5 minutes.\n\n## Installation\n\nRun the following command...",
"html": null
}

Use Cases

🧠 RAG (Retrieval-Augmented Generation)

Crawl your product documentation, knowledge base, or blog and feed the extracted text directly into a vector database like Pinecone, Qdrant, or Chroma. Your AI assistant can then answer questions grounded in your actual content rather than hallucinating.

🤖 Custom AI Chatbots

Let customers onboard by typing their website URL. The crawler indexes their content in minutes, giving your chatbot deep product knowledge instantly — without any manual data entry.

📚 LLM Fine-Tuning Datasets

Collect large volumes of high-quality, clean text from curated websites to build domain-specific fine-tuning datasets. The Markdown output preserves document structure that modern LLMs handle well.

Crawl your internal wikis, support docs, or any website and build a semantic search engine powered by embeddings. The clean text output embeds cleanly without noise diluting the semantic signal.

📝 Content Summarization at Scale

Crawl an entire blog archive and batch-process the text through the OpenAI API for summarization, translation, proofreading, or tone-of-voice analysis.

🏢 Competitive Intelligence

Monitor competitor websites, product pages, and documentation for changes. Combine with a scheduled run to detect updates automatically.

📖 Custom GPT Knowledge Files

Export the crawled dataset as JSON and upload it directly to your custom OpenAI GPT as a knowledge file — no reformatting required.

🗃️ Content Archiving

Create searchable archives of websites, news sources, or any online content for compliance, research, or historical preservation.

🔗 LangChain & LlamaIndex Integration

The output schema is identical to what Apify's official LangChain and LlamaIndex integrations expect — drop this actor in as a direct replacement with no code changes.


Pricing

VolumePricePer page
First 1,000 pages$10$0.010
1,000–10,000 pages$10/1k$0.010
10,000–100,000 pages$10/1k$0.010
100,000+ pagesContact usVolume discount

What counts as a page? One crawled URL — whether it returns content or not. Duplicate pages that are skipped by canonical or ETag deduplication are not charged. File downloads count as one item each.

Compared to Apify's native actor: The official apify/website-content-crawler uses Apify Compute Units (CUs), which costs $0.50–$5.00 per 1,000 pages with a browser and $0.20 per 1,000 pages with raw HTTP. Website Content Crawler gives you the same output at a flat, predictable $10 per 1,000 pages regardless of crawler type — no surprises.


Input Parameters

Crawling

ParameterDefaultDescription
startUrlsOne or more URLs to start crawling from (required)
crawlerTypeplaywright:firefoxCrawler engine: playwright:adaptive, playwright:firefox, playwright:chrome, cheerio, jsdom
includeUrlGlobs[]Glob patterns for URLs to include (overrides scope when set)
excludeUrlGlobs[]Glob patterns for URLs to skip
maxCrawlDepth20Maximum link-following depth
maxCrawlPagesunlimitedHard cap on total pages fetched
maxResultsunlimitedCap on dataset records saved
maxConcurrency16Maximum parallel requests
initialConcurrency1Starting concurrency (ramps up automatically)
maxRequestRetries3Retry attempts per failed request
useSitemapsfalseParse sitemap.xml for extra URL discovery
useLlmsTxtfalseParse /llms.txt for AI-curated URL lists
respectRobotsTxtFilefalseObey robots.txt exclusion rules
keepUrlFragmentsfalseTreat #fragment as part of URL identity
ignoreCanonicalUrlfalseDeduplicate by actual URL, not canonical

Browser Rendering

ParameterDefaultDescription
dynamicContentWaitSecs0Seconds to wait after page load for dynamic content
maxScrollHeightPixels0Scroll height in pixels to trigger infinite scroll
waitForSelectorWait for this CSS selector before extracting
clickElementsCssSelectorClick these elements to expand content
expandIframestrueInclude iframe content in extraction

HTML Processing

ParameterDefaultDescription
htmlTransformerreadableTextreadableText (article extraction) or none
readableTextCharThreshold100Minimum chars for readability to succeed
aggressivePrunefalseExtra removal of widgets, sidebars, pagination
removeElementsCssSelectorbuilt-inCSS selector for additional elements to strip
keepElementsCssSelectorKeep only these elements, discard everything else
removeCookieWarningstrueRemove cookie consent banners

Output

ParameterDefaultDescription
saveMarkdowntrueInclude Markdown in output records
saveHtmlfalseInclude cleaned HTML snippet
saveHtmlAsFilefalseUpload raw HTML to Key-Value Store
saveScreenshotsfalseCapture full-page screenshot (browser only)
saveFilesfalseDownload linked PDF/DOCX/XLSX/CSV files
minFileDownloadSpeedKBps64Abort file downloads slower than this speed

Authentication & Proxy

ParameterDefaultDescription
proxyConfigurationApify Proxy or custom proxy URLs
initialCookies[]Cookies for authenticated crawling
customHttpHeaders{}Custom headers on every request

Debug

ParameterDefaultDescription
debugModefalseAdd cleanHtml and response headers to records
debugLogfalseEnable verbose Scrapy debug logging

Integrations

LangChain (Python)

from langchain_community.utilities import ApifyWrapper
from langchain_core.document_loaders.base import Document
apify = ApifyWrapper()
loader = apify.call_actor(
actor_id="YOUR_ACTOR_ID",
run_input={
"startUrls": [{"url": "https://docs.yoursite.com/"}],
"maxCrawlPages": 500,
},
dataset_mapping_function=lambda item: Document(
page_content=item["text"] or "",
metadata={
"source": item["url"],
"title": item["metadata"]["title"],
},
),
)

LlamaIndex (Python)

from llama_index.readers.apify import ApifyActor
reader = ApifyActor("<YOUR_APIFY_TOKEN>")
documents = reader.load_data(
actor_id="YOUR_ACTOR_ID",
run_input={"startUrls": [{"url": "https://docs.yoursite.com/"}]},
dataset_mapping_function=lambda item: Document(
text=item.get("text"),
metadata={"url": item.get("url")},
),
)

Pinecone / Qdrant

Use the Apify Pinecone or Qdrant integration actors to stream crawl results directly into your vector database with incremental updates — only changed pages are re-embedded on subsequent runs.

OpenAI Custom GPTs

Export the dataset as JSON and upload directly as a knowledge file to any custom GPT in OpenAI's interface.


Troubleshooting

No content extracted / text is empty Switch to playwright:firefox or playwright:adaptive. Many modern sites require JavaScript to render their content and will return an empty shell page to raw HTTP requests.

Content includes too much noise (navigation, sidebars) Use the keepElementsCssSelector input to target only the main content element (e.g. main, article, .post-body). Alternatively, add unwanted element selectors to removeElementsCssSelector.

Crawl is too slow Increase maxConcurrency (try 32 or 64) and set initialConcurrency to the same value to skip the ramp-up phase. For large sites, cheerio is 3–5× faster than browser crawlers.

Site is blocking requests Use playwright:firefox with Apify's residential proxies (proxyConfiguration: { useApifyProxy: true, apifyProxyGroups: ["RESIDENTIAL"] }). The browser fingerprinting and IP rotation combination bypasses most commercial anti-bot systems.

Crawl misses pages Enable useSitemaps: true to discover pages that aren't linked from the main navigation. Also check your excludeUrlGlobs — an overly broad pattern may be filtering out valid pages.

Login-protected pages not crawled Export your session cookies using the EditThisCookie browser extension and paste them into initialCookies. The crawler injects them on every request, maintaining your authenticated session throughout the crawl.


Web scraping is generally legal when applied to publicly available, non-personal data. Always review the target website's Terms of Service before crawling. Content extracted from websites (documentation, articles, blog posts) is typically subject to copyright — ensure your use case complies with applicable law. When in doubt, seek qualified legal advice.


Support

Have a question, found a bug, or need a custom feature? Open an issue in the Apify Console issue tracker or contact us directly. We respond to all issues within 24 hours on business days.


Website Content Crawler — the most cost-effective way to turn any website into AI-ready content.