Pricing

$10.00/month + usage

Try for free

Go to Apify Store

Website Content Crawler

Try for free

Deep-crawl websites to extract clean text, Markdown, or HTML for AI/LLM apps, RAG pipelines, and vector databases. Supports adaptive crawling, HTML cleaning, file downloads, and structured dataset output. Easily integrates with LangChain, LlamaIndex, and other LLM tools.

Pricing

$10.00/month + usage

Rating

5.0

(1)

Developer

mikolabs

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

Website Content Crawler — AI-Ready Web Scraping Actor

Extract clean text, Markdown, and structured content from any website for LLMs, RAG pipelines, vector databases, and AI applications — at a fraction of the cost.

$10 per 1,000 pages · Pay only for what you use · No subscriptions

What Is Website Content Crawler?

Website Content Crawler is a powerful Apify Actor that deep-crawls entire websites and extracts clean, structured content optimized for AI consumption. Whether you're building a RAG (Retrieval-Augmented Generation) pipeline, training an LLM, populating a vector database, or creating a custom AI chatbot, this actor delivers publication-ready content with no manual cleaning required.

Unlike generic scrapers, Website Content Crawler is purpose-built for AI workflows — it strips navigation menus, headers, footers, cookie banners, ads, and all other "noise" from every page, leaving only the meaningful article content. The result is high-quality text that feeds directly into your models without preprocessing.

Why Choose This Actor?

💰 Up to 10× Cheaper Than Alternatives

Most web crawling actors on the Apify Store charge $5–$50 per 1,000 pages. Website Content Crawler delivers the same high-quality AI-ready output at $10 per 1,000 pages — with no minimum spend, no monthly commitment, and no wasted compute on empty or duplicate pages.

Actor	Price per 1,000 pages	AI-ready output	Markdown support	File downloads
Website Content Crawler	$10	✅	✅	✅
Typical competitor A	$25–$50	❌	❌	❌
Typical competitor B	$15–$30	Partial	❌	❌

⚡ Built for Scale

Crawl a single blog post or an entire documentation site with millions of pages — the actor scales automatically using Apify's cloud infrastructure. Concurrency, throttling, and retries are all managed for you.

🤖 LangChain, LlamaIndex & Vector DB Ready

The output schema matches what LangChain's ApifyWrapper, LlamaIndex's ApifyActor reader, and Pinecone/Qdrant integration actors expect out of the box — zero configuration required.

Key Features

🕷️ Intelligent Crawling

Multiple crawler types for every situation:

Adaptive (recommended) — automatically switches between fast HTTP requests and a headless Firefox browser depending on whether a page requires JavaScript rendering. You get maximum speed where possible and full JS support where needed.
Firefox + Playwright — headless browser that renders JavaScript, bypasses common anti-bot protections, and handles single-page applications. Best for modern websites.
Chrome + Playwright — alternative browser option for sites that respond differently to Chrome vs Firefox fingerprints.
Cheerio (raw HTTP) — the fastest option for static websites. No browser overhead, extremely low cost, ideal for documentation sites, blogs, and news sites.

Smart URL management:

Crawls all sub-pages under your start URLs automatically — provide https://docs.example.com/ and it discovers every page beneath it
Include URL globs — use wildcard patterns like https://{docs,blog}.example.com/** to expand the crawl scope across multiple subdomains or sections
Exclude URL globs — skip login pages, pagination, or any URL pattern with glob rules like https://example.com/tag/**
Sitemap discovery — automatically reads sitemap.xml files to find pages that aren't linked from the main navigation
llms.txt support — the emerging standard for AI-readable site indexes; discovers and crawls URLs listed in /llms.txt files

Deduplication:

Canonical URL deduplication — pages that share a <link rel="canonical"> are stored only once, preventing duplicate content in your dataset
ETag deduplication — unchanged pages (same ETag header) are automatically skipped on re-crawls, saving cost
URL fragment control — optionally treat page#section as a unique URL for single-page applications

Depth and size controls:

Set maximum crawl depth (how many links deep to follow)
Set maximum total pages crawled
Set maximum dataset results saved (independent of pages fetched)
Initial concurrency + max concurrency with AutoThrottle for polite crawling

🧹 Advanced HTML Cleaning

This is where Website Content Crawler stands apart from raw scrapers. Every page goes through a multi-stage cleaning pipeline before any text is extracted:

Stage 1 — Noise removal: Automatically removes navigation bars, headers, footers, sidebars, advertisements, modals, ARIA dialogs, cookie consent banners, and inline scripts. The default removal rules mirror industry best practices used by major AI data pipelines.

Stage 2 — Content scoping: Use a CSS selector to keep only the elements you care about — for example, article.post-content to extract just the blog body, ignoring related posts, author bios, and share buttons.

Stage 3 — Readability extraction: Applies Mozilla's Readability algorithm (the same one used by Firefox Reader Mode) to strip page chrome and isolate the primary article content. A configurable character threshold ensures the algorithm only applies when it produces a meaningful result.

Stage 4 — Aggressive pruning (optional): An extra cleaning pass that removes widgets, pagination controls, social share buttons, newsletter signups, and breadcrumbs — useful for sites with heavy supplementary content.

Cookie banner removal: Uses keyword-matching heuristics to detect and remove cookie consent notices that appear in the page body, keeping your extracted text clean.

📄 Output Formats

Every crawled page produces a structured record with:

Text — always included. Clean plain text with no HTML, no markup, no noise. Ready to paste into a prompt or embed into a vector store.

Markdown — preserves document structure (headings, lists, bold/italic, code blocks, links) in a format that LLMs understand natively. Ideal for retrieval pipelines where structure matters.

HTML snippet — the cleaned HTML after all noise removal, useful if you need to render the content or do further processing downstream.

Raw HTML file — the complete original page HTML uploaded to Apify's Key-Value Store, with a public URL in the output record. Useful for archiving or re-processing.

Screenshots — full-page PNG screenshots captured by the browser (Playwright crawlers only), stored in the Key-Value Store.

File downloads — PDF, Word (DOC/DOCX), Excel (XLS/XLSX), and CSV files linked from crawled pages are automatically downloaded and stored in the Key-Value Store. Files respect your exclude URL rules but are not limited to your start URL domain — cross-domain documents are collected too.

📊 Rich Metadata Extraction

Every output record includes structured metadata automatically extracted from the page:

Field	Source
`title`	`<title>`, `og:title`, `twitter:title`, or first `<h1>`
`description`	`meta[name=description]`, `og:description`, `twitter:description`
`author`	`meta[name=author]`, `dc.creator`, `article:author`, `twitter:creator`
`keywords`	`meta[name=keywords]`
`canonicalUrl`	`<link rel=canonical>`, `og:url`, or request URL
`languageCode`	`<html lang="...">` attribute, with automatic detection fallback
`publishedAt`	`article:published_time`, `datePublished`, `pubdate`, or `<time datetime>`

The crawl object in every record also includes the loaded URL (after any redirects), timestamp, referring URL, crawl depth, and HTTP status code.

🔐 Authentication & Session Support

Login with cookies — provide session cookies extracted from your browser (using tools like EditThisCookie) and the crawler injects them on every request. Supports name, value, domain, and path fields per cookie.

Custom HTTP headers — add any header to every request: Bearer tokens for API authentication, custom User-Agent strings, or any proprietary header your target site requires.

Proxy support:

Apify Proxy — access residential and datacenter IPs in 100+ countries with automatic rotation
Custom proxy URLs — bring your own proxies with round-robin rotation

⚙️ Browser Rendering Controls

For JavaScript-heavy websites, fine-tune how the browser processes each page:

Wait for selector — don't extract content until a specific CSS selector appears in the DOM (useful for lazy-loaded content)
Dynamic content wait — pause a fixed number of seconds after page load for animations or async data fetches to complete
Infinite scroll — scroll down to a configurable pixel height to trigger lazy-loaded content sections
Click elements — click expandable DOM elements (accordions, "Read more" buttons, tabs) using a CSS selector before extracting content
Expand iframes — include content from embedded iframes in the extracted text

🔍 robots.txt Compliance

Enable respectRobotsTxtFile to have the crawler consult and obey robots.txt rules on every domain it visits. Disabled by default for maximum reach; enable it when crawling third-party sites where compliance is required.

Output Record Example

{
  "url": "https://docs.example.com/getting-started",
  "crawl": {
    "loadedUrl": "https://docs.example.com/getting-started",
    "loadedTime": "2025-03-15T10:30:00.000Z",
    "referrerUrl": "https://docs.example.com/",
    "depth": 1,
    "httpStatus": 200
  },
  "metadata": {
    "canonicalUrl": "https://docs.example.com/getting-started",
    "title": "Getting Started — Example Docs",
    "description": "Learn how to get up and running in under 5 minutes.",
    "author": "Example Team",
    "keywords": null,
    "languageCode": "en",
    "publishedAt": "2024-11-01T00:00:00Z"
  },
  "screenshotUrl": null,
  "text": "Getting Started\nLearn how to get up and running in under 5 minutes.\n\nInstallation\nRun the following command to install...",
  "markdown": "# Getting Started\n\nLearn how to get up and running in under 5 minutes.\n\n## Installation\n\nRun the following command...",
  "html": null
}

Use Cases

🧠 RAG (Retrieval-Augmented Generation)

Crawl your product documentation, knowledge base, or blog and feed the extracted text directly into a vector database like Pinecone, Qdrant, or Chroma. Your AI assistant can then answer questions grounded in your actual content rather than hallucinating.

🤖 Custom AI Chatbots

Let customers onboard by typing their website URL. The crawler indexes their content in minutes, giving your chatbot deep product knowledge instantly — without any manual data entry.

📚 LLM Fine-Tuning Datasets

Collect large volumes of high-quality, clean text from curated websites to build domain-specific fine-tuning datasets. The Markdown output preserves document structure that modern LLMs handle well.

🔎 Semantic Search

Crawl your internal wikis, support docs, or any website and build a semantic search engine powered by embeddings. The clean text output embeds cleanly without noise diluting the semantic signal.

📝 Content Summarization at Scale

Crawl an entire blog archive and batch-process the text through the OpenAI API for summarization, translation, proofreading, or tone-of-voice analysis.

🏢 Competitive Intelligence

Monitor competitor websites, product pages, and documentation for changes. Combine with a scheduled run to detect updates automatically.

📖 Custom GPT Knowledge Files

Export the crawled dataset as JSON and upload it directly to your custom OpenAI GPT as a knowledge file — no reformatting required.

🗃️ Content Archiving

Create searchable archives of websites, news sources, or any online content for compliance, research, or historical preservation.

🔗 LangChain & LlamaIndex Integration

The output schema is identical to what Apify's official LangChain and LlamaIndex integrations expect — drop this actor in as a direct replacement with no code changes.

Pricing

Volume	Price	Per page
First 1,000 pages	$10	$0.010
1,000–10,000 pages	$10/1k	$0.010
10,000–100,000 pages	$10/1k	$0.010
100,000+ pages	Contact us	Volume discount

What counts as a page? One crawled URL — whether it returns content or not. Duplicate pages that are skipped by canonical or ETag deduplication are not charged. File downloads count as one item each.

Compared to Apify's native actor: The official apify/website-content-crawler uses Apify Compute Units (CUs), which costs $0.50–$5.00 per 1,000 pages with a browser and $0.20 per 1,000 pages with raw HTTP. Website Content Crawler gives you the same output at a flat, predictable $10 per 1,000 pages regardless of crawler type — no surprises.

Input Parameters

Crawling

Parameter	Default	Description
`startUrls`	—	One or more URLs to start crawling from (required)
`crawlerType`	`playwright:firefox`	Crawler engine: `playwright:adaptive`, `playwright:firefox`, `playwright:chrome`, `cheerio`, `jsdom`
`includeUrlGlobs`	`[]`	Glob patterns for URLs to include (overrides scope when set)
`excludeUrlGlobs`	`[]`	Glob patterns for URLs to skip
`maxCrawlDepth`	`20`	Maximum link-following depth
`maxCrawlPages`	unlimited	Hard cap on total pages fetched
`maxResults`	unlimited	Cap on dataset records saved
`maxConcurrency`	`16`	Maximum parallel requests
`initialConcurrency`	`1`	Starting concurrency (ramps up automatically)
`maxRequestRetries`	`3`	Retry attempts per failed request
`useSitemaps`	`false`	Parse `sitemap.xml` for extra URL discovery
`useLlmsTxt`	`false`	Parse `/llms.txt` for AI-curated URL lists
`respectRobotsTxtFile`	`false`	Obey robots.txt exclusion rules
`keepUrlFragments`	`false`	Treat `#fragment` as part of URL identity
`ignoreCanonicalUrl`	`false`	Deduplicate by actual URL, not canonical

Browser Rendering

Parameter	Default	Description
`dynamicContentWaitSecs`	`0`	Seconds to wait after page load for dynamic content
`maxScrollHeightPixels`	`0`	Scroll height in pixels to trigger infinite scroll
`waitForSelector`	—	Wait for this CSS selector before extracting
`clickElementsCssSelector`	—	Click these elements to expand content
`expandIframes`	`true`	Include iframe content in extraction

HTML Processing

Parameter	Default	Description
`htmlTransformer`	`readableText`	`readableText` (article extraction) or `none`
`readableTextCharThreshold`	`100`	Minimum chars for readability to succeed
`aggressivePrune`	`false`	Extra removal of widgets, sidebars, pagination
`removeElementsCssSelector`	built-in	CSS selector for additional elements to strip
`keepElementsCssSelector`	—	Keep only these elements, discard everything else
`removeCookieWarnings`	`true`	Remove cookie consent banners

Output

Parameter	Default	Description
`saveMarkdown`	`true`	Include Markdown in output records
`saveHtml`	`false`	Include cleaned HTML snippet
`saveHtmlAsFile`	`false`	Upload raw HTML to Key-Value Store
`saveScreenshots`	`false`	Capture full-page screenshot (browser only)
`saveFiles`	`false`	Download linked PDF/DOCX/XLSX/CSV files
`minFileDownloadSpeedKBps`	`64`	Abort file downloads slower than this speed

Authentication & Proxy

Parameter	Default	Description
`proxyConfiguration`	—	Apify Proxy or custom proxy URLs
`initialCookies`	`[]`	Cookies for authenticated crawling
`customHttpHeaders`	`{}`	Custom headers on every request

Debug

Parameter	Default	Description
`debugMode`	`false`	Add cleanHtml and response headers to records
`debugLog`	`false`	Enable verbose Scrapy debug logging

Integrations

LangChain (Python)

from langchain_community.utilities import ApifyWrapper
from langchain_core.document_loaders.base import Document

apify = ApifyWrapper()
loader = apify.call_actor(
    actor_id="YOUR_ACTOR_ID",
    run_input={
        "startUrls": [{"url": "https://docs.yoursite.com/"}],
        "maxCrawlPages": 500,
    },
    dataset_mapping_function=lambda item: Document(
        page_content=item["text"] or "",
        metadata={
            "source": item["url"],
            "title": item["metadata"]["title"],
        },
    ),
)

LlamaIndex (Python)

from llama_index.readers.apify import ApifyActor

reader = ApifyActor("<YOUR_APIFY_TOKEN>")
documents = reader.load_data(
    actor_id="YOUR_ACTOR_ID",
    run_input={"startUrls": [{"url": "https://docs.yoursite.com/"}]},
    dataset_mapping_function=lambda item: Document(
        text=item.get("text"),
        metadata={"url": item.get("url")},
    ),
)

Pinecone / Qdrant

Use the Apify Pinecone or Qdrant integration actors to stream crawl results directly into your vector database with incremental updates — only changed pages are re-embedded on subsequent runs.

OpenAI Custom GPTs

Export the dataset as JSON and upload directly as a knowledge file to any custom GPT in OpenAI's interface.

Troubleshooting

No content extracted / text is empty Switch to playwright:firefox or playwright:adaptive. Many modern sites require JavaScript to render their content and will return an empty shell page to raw HTTP requests.

Content includes too much noise (navigation, sidebars) Use the keepElementsCssSelector input to target only the main content element (e.g. main, article, .post-body). Alternatively, add unwanted element selectors to removeElementsCssSelector.

Crawl is too slow Increase maxConcurrency (try 32 or 64) and set initialConcurrency to the same value to skip the ramp-up phase. For large sites, cheerio is 3–5× faster than browser crawlers.

Site is blocking requests Use playwright:firefox with Apify's residential proxies (proxyConfiguration: { useApifyProxy: true, apifyProxyGroups: ["RESIDENTIAL"] }). The browser fingerprinting and IP rotation combination bypasses most commercial anti-bot systems.

Crawl misses pages Enable useSitemaps: true to discover pages that aren't linked from the main navigation. Also check your excludeUrlGlobs — an overly broad pattern may be filtering out valid pages.

Login-protected pages not crawled Export your session cookies using the EditThisCookie browser extension and paste them into initialCookies. The crawler injects them on every request, maintaining your authenticated session throughout the crawl.

Legal Notice

Web scraping is generally legal when applied to publicly available, non-personal data. Always review the target website's Terms of Service before crawling. Content extracted from websites (documentation, articles, blog posts) is typically subject to copyright — ensure your use case complies with applicable law. When in doubt, seek qualified legal advice.

Support

Have a question, found a bug, or need a custom feature? Open an issue in the Apify Console issue tracker or contact us directly. We respond to all issues within 24 hours on business days.

Website Content Crawler — the most cost-effective way to turn any website into AI-ready content.

Court Records Scraper

automation-lab/court-records-scraper

Search and extract federal court records from CourtListener. Find court opinions, dockets, and case parties by keyword, court, and date. No API key needed. 400+ courts covered.

Stas Persiianenko

Unicode Text Inspector

automation-lab/unicode-text-inspector

Scan text for hidden Unicode characters: zero-width spaces, RTL override attacks, homoglyphs, and control characters. Get risk level + full codepoint details per character.

Stas Persiianenko

JusBrasil Court Scraper - Brazil Case Law & Jurisprudence

jungle_synthesizer/brazil-jusbrasil-court-scraper

Scrape Brazilian court records and jurisprudence from JusBrasil.com.br. Search by keyword or process number to extract case details, parties, lawyers, judges, rulings, and movement history from STJ, STF, TJ state courts, and more.

BowTiedRaccoon

Stocktwits Scraper

automation-lab/stocktwits-scraper

Extract stock and crypto messages, bullish/bearish sentiment, trending symbols, and user posts from Stocktwits. Monitor market mood for any ticker. No API key needed.

Stas Persiianenko

123

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

140K

4.6

Stocktwits Sentiment Scraper

shahidirfan/stocktwits-sentiment-scraper

Scrape real-time bullish/bearish sentiment from Stocktwits. Extract investor opinions, stock insights & market trends instantly. Perfect for traders & financial analysts seeking actionable trading data & sentiment-driven market analysis.

Shahid Irfan

5.0

Website Content Crawler

worshipful_knife/website-content-crawler

Deep crawl websites and extract clean text, Markdown, or HTML for LLMs, RAG, and AI apps. Removes navigation, ads, cookie banners. Supports headless browser & HTTP. Sitemap discovery, URL scoping, file downloads. Feed ChatGPT, LangChain, LlamaIndex, Pinecone. The cheapest content crawler on Apify.

kata Kuri

Website Content Crawler for AI & LLM Data

your_scraper_guy/website-content-crawler-lite

Crawl any website from a seed URL and extract clean Markdown content, ready for LLM training data, RAG pipelines, and vector databases. Set crawl depth, page limits, and domain scope.

Code With Aqib

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Stocktwits Scraper

maximedupre/stocktwits-scraper

Scrape public Stocktwits ticker messages, user posts, trending messages, and trending symbols. Export sentiment, cashtags, mentions, media, author metrics, source URLs, and timestamps for market research.

Maxime Dupré

Website Content Crawler

Website Content Crawler — AI-Ready Web Scraping Actor

What Is Website Content Crawler?

Why Choose This Actor?

💰 Up to 10× Cheaper Than Alternatives

⚡ Built for Scale

🤖 LangChain, LlamaIndex & Vector DB Ready

Key Features

🕷️ Intelligent Crawling

🧹 Advanced HTML Cleaning

📄 Output Formats

📊 Rich Metadata Extraction

🔐 Authentication & Session Support

⚙️ Browser Rendering Controls

🔍 robots.txt Compliance

Output Record Example

Use Cases

🧠 RAG (Retrieval-Augmented Generation)

🤖 Custom AI Chatbots

📚 LLM Fine-Tuning Datasets

🔎 Semantic Search

📝 Content Summarization at Scale

🏢 Competitive Intelligence

📖 Custom GPT Knowledge Files

🗃️ Content Archiving

🔗 LangChain & LlamaIndex Integration

Pricing

Input Parameters

Crawling

Browser Rendering

HTML Processing

Output

Authentication & Proxy

Debug

Integrations

LangChain (Python)

LlamaIndex (Python)

Pinecone / Qdrant

OpenAI Custom GPTs

Troubleshooting

Legal Notice

Support

You might also like

Court Records Scraper

Unicode Text Inspector

JusBrasil Court Scraper - Brazil Case Law & Jurisprudence

Stocktwits Scraper

Website Content Crawler

Stocktwits Sentiment Scraper

Website Content Crawler

Website Content Crawler for AI & LLM Data

Website to Markdown Crawler for LLM & RAG

Stocktwits Scraper

Related articles