Website Content Crawler avatar

Website Content Crawler

Pricing

from $0.70 / 1,000 page scrapeds

Go to Apify Store
Website Content Crawler

Website Content Crawler

Deep crawl websites and extract clean text, Markdown, or HTML for LLMs, RAG, and AI apps. Removes navigation, ads, cookie banners. Supports headless browser & HTTP. Sitemap discovery, URL scoping, file downloads. Feed ChatGPT, LangChain, LlamaIndex, Pinecone. The cheapest content crawler on Apify.

Pricing

from $0.70 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

kata Kuri

kata Kuri

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

18 hours ago

Last modified

Share

Website Content Crawler is an Apify Actor that performs deep crawls of websites and extracts clean text content from web pages. It is designed for feeding large language models (LLMs), RAG pipelines, vector databases, and AI applications with high-quality web data.

Key Features

  • Multiple crawler engines - Adaptive mode tries headless Firefox first and automatically falls back to HTTP if the site blocks browsers. Or choose a specific engine manually.
  • Clean content extraction - Automatically removes navigation, headers, footers, cookie banners, ads, modals, and other irrelevant page elements.
  • Flexible output formats - Save content as Markdown, plain text, or HTML.
  • Smart URL scoping - Stays within the start URL path. Supports include/exclude glob patterns for fine-grained control.
  • Sitemap discovery - Automatically finds and parses sitemaps to discover more pages.
  • Canonical URL deduplication - Skips duplicate pages identified by the same canonical URL.
  • Dynamic content support - Wait for JavaScript rendering, scroll to trigger lazy loading, expand accordions and tabs.
  • Cookie banner dismissal - Automatically detects and dismisses cookie consent popups.
  • File downloads - Optionally download linked PDF, DOC, DOCX, XLS, XLSX, and CSV files.
  • Rich metadata extraction - Extracts title, description, author, keywords, language, and canonical URL from every page.

Use Cases

Feed LLMs and AI Applications

Crawl documentation sites, knowledge bases, help centers, or blogs and feed the extracted content directly into your LLM, ChatGPT, or custom AI assistant.

Retrieval Augmented Generation (RAG)

Build a knowledge base from any website. Use the crawled content with vector databases like Pinecone, Qdrant, or Weaviate to power RAG-based question answering.

Custom GPTs and AI Assistants

Export crawled data as JSON and upload it as knowledge files to your custom OpenAI GPTs or AI assistants.

Content Processing at Scale

Scrape content for summarization, translation, proofreading, or style transformation using LLMs.

LangChain and LlamaIndex Integration

Use the Apify integration with LangChain or LlamaIndex to feed crawled content directly into your AI pipeline.

How It Works

The crawler operates in three stages:

  1. Crawling - Discovers and downloads web pages starting from your URLs, following links within scope.
  2. HTML Processing - Cleans the DOM by removing navigation, ads, cookie warnings, and other noise.
  3. Output - Converts the cleaned HTML to your chosen format (Markdown, text, or HTML) with metadata.

Input Configuration

The only required input is Start URLs. All other settings have sensible defaults.

SettingDescriptionDefault
Start URLsURLs to begin crawling from(required)
Crawler typeEngine: Adaptive, Firefox browser, or Cheerio HTTPAdaptive
Max pagesMaximum number of pages to crawl100
Max crawling depthHow deep to follow links from start URLs20
Output formatMarkdown, plain text, or HTMLMarkdown
Exclude URLs (globs)Glob patterns for URLs to skip(none)
Include URLs (globs)Only crawl URLs matching these globs(none)
Remove elements (CSS)Additional CSS selectors to remove(none, defaults always applied)
Extract elements (CSS)Only keep content from these elements(none)
Remove cookie warningsAuto-dismiss cookie consent bannersYes
Wait for dynamic contentTime to wait for JS rendering (ms)1000
Scroll heightScroll to trigger lazy loading (px)0
Expand clickablesClick accordions/tabs to expandNo
Save filesDownload linked PDF/DOC/XLS filesNo
Use sitemapsDiscover URLs from sitemapsYes

Output Format

Each crawled page produces a JSON object:

{
"url": "https://example.com/docs/getting-started",
"crawl": {
"loadedUrl": "https://example.com/docs/getting-started",
"loadedTime": "2024-01-15T10:30:00.000Z",
"depth": 1
},
"metadata": {
"canonicalUrl": "https://example.com/docs/getting-started",
"title": "Getting Started | Example Docs",
"description": "Learn how to get started with Example.",
"author": "Example Team",
"keywords": "docs, getting started",
"languageCode": "en"
},
"text": null,
"markdown": "# Getting Started\n\nWelcome to Example...",
"html": null
}

The content field (text, markdown, or html) is populated based on your chosen output format. The other two fields will be null.

Pricing

Only $0.001 per page ($1.00 per 1,000 pages) via pay-per-event billing.

This ActorOfficial Apify CrawlerFirecrawl-based Actors
Price per page$0.001$0.005 - $0.05$0.004
1,000 pages$1.00$5.00 - $50.00$4.00
10,000 pages$10.00$50.00 - $500.00$40.00
  • 4x cheaper than Firecrawl-based alternatives
  • Up to 5x cheaper than the official browser crawler
  • You only pay for pages successfully crawled and saved to the dataset

Apify's free plan includes $5/month in credits, enough to crawl ~5,000 pages for free.

Integration Examples

Python (LangChain)

from langchain_community.utilities import ApifyWrapper
apify = ApifyWrapper()
loader = apify.call_actor(
actor_id="worshipful_knife/website-content-crawler",
run_input={
"startUrls": [{"url": "https://docs.example.com/"}],
"maxCrawlPages": 50
},
dataset_mapping_function=lambda item: Document(
page_content=item["markdown"] or item["text"] or "",
metadata={"source": item["url"]}
),
)

Node.js (Apify Client)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('worshipful_knife/website-content-crawler').call({
startUrls: [{ url: 'https://docs.example.com/' }],
maxCrawlPages: 50,
outputFormat: 'markdown',
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Troubleshooting

  • Missing content? Try switching to headless browser crawler type, which renders JavaScript.
  • Too much noise in output? Use the "Remove HTML elements" or "Extract HTML elements" CSS selectors to fine-tune.
  • Crawler too slow? Increase "Max concurrency" or switch to Cheerio crawler for static sites.
  • Getting blocked? Use the headless browser crawler type with residential proxies.

Support

If you have any questions or feedback, please open an issue on the Actor's GitHub page or contact us through Apify support.