Website Content Crawler avatar

Website Content Crawler

Pricing

$10.00 / 1,000 results

Go to Apify Store
Website Content Crawler

Website Content Crawler

A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.

Pricing

$10.00 / 1,000 results

Rating

0.0

(0)

Developer

Jason Giang

Jason Giang

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Web Content Crawler

A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.

Features

  • Multiple Crawling Engines: Choose between Playwright (Chrome/Firefox), Cheerio (fast HTTP client), or JSDOM based on your needs
  • Markdown Output: Automatically converts HTML content to clean Markdown format
  • Smart Content Extraction: Removes unwanted elements like cookie banners, navigation, ads, and more
  • Customizable Selectors: Keep or remove specific elements using CSS selectors
  • Deep Crawling: Recursively crawl websites with configurable depth limits
  • AI-Ready Output: Structured data perfect for feeding into AI models and vector databases
  • Proxy Support: Built-in proxy configuration for reliable crawling
  • Screenshot Capture: Optional screenshot capture for visual documentation (Playwright only)
  • File Downloads: Download and save linked files like PDFs and documents

Use Cases

  • Knowledge Base Extraction: Crawl documentation sites and help centers
  • Content Aggregation: Collect articles, blog posts, and web content at scale
  • AI Training Data: Extract clean text for training or fine-tuning language models
  • RAG Pipelines: Feed content into retrieval-augmented generation systems
  • Vector Database Population: Prepare text content for embedding and semantic search
  • Website Migration: Extract content from existing websites for migration
  • Competitive Analysis: Monitor and analyze competitor content

Input Parameters

Required

  • Start URLs (startUrls): Array of URLs where the crawler will begin. The crawler will only process pages under these URLs.

Crawler Configuration

  • Crawler Type (crawlerType): Select the crawling engine

    • cheerio (default): Fast HTTP client, best for static websites
    • playwright:chrome: Chrome browser with full JavaScript support
    • playwright:firefox: Firefox browser, useful for sites with anti-bot measures
    • jsdom: Experimental JavaScript-capable crawler
  • Max Crawling Depth (maxCrawlDepth): Maximum link depth from start URLs (default: 1)

    • 0 = Only crawl start URLs
    • 1 = Crawl start URLs and pages directly linked from them
    • 2+ = Continue crawling to specified depth
  • Max Pages (maxCrawlPages): Maximum number of pages to crawl (default: 100)

  • Max Requests Per Minute (maxRequestsPerMinute): Rate limiting (default: 0 = unlimited)

Content Extraction

  • Readable Text Char Threshold (readableTextCharThreshold): Minimum characters required to save a page (default: 100)

  • Remove Cookie Warnings (removeCookieWarnings): Automatically remove cookie consent dialogs (default: true)

  • Click Elements CSS Selector (clickElementsCssSelector): CSS selector for elements to click before extraction (e.g., "Show more" buttons)

  • HTML Transformer (htmlTransformer): How to process HTML

    • readableText (default): Remove scripts, styles, navigation
    • none: Keep original HTML
  • Remove Elements CSS Selector (removeElementsCssSelector): CSS selector for elements to remove (e.g., nav, footer, .ads)

  • Keep Elements CSS Selector (keepElementsCssSelector): CSS selector for elements to keep (removes everything else)

Output Options

  • Save Markdown (saveMarkdown): Convert content to Markdown format (default: true)

  • Save HTML (saveHtml): Save raw HTML to key-value store (default: false)

  • Save Screenshots (saveScreenshots): Capture page screenshots (Playwright only, default: false)

  • Save Files (saveFiles): Download linked files like PDFs (default: false)

Advanced Options

  • Max Scroll Height (maxScrollHeightPixels): Scroll down pages with infinite scroll (default: 0 = disabled)

  • Proxy Configuration (proxyConfiguration): Proxy settings for the crawler

  • Max Request Retries (maxRequestRetries): Number of retry attempts for failed requests (default: 3)

  • Debug Mode (debugMode): Enable detailed logging (default: false)

Output Format

Each crawled page produces a dataset item with the following structure:

{
"url": "https://example.com/page",
"title": "Page Title",
"description": "Page meta description",
"canonicalUrl": "https://example.com/page",
"text": "Extracted plain text content...",
"markdown": "# Page Title\n\nExtracted content in Markdown...",
"crawl": {
"loadedUrl": "https://example.com/page",
"depth": 1,
"httpStatusCode": 200,
"loadedAt": "2024-01-01T12:00:00.000Z"
}
}

Example Usage

Basic Crawl

{
"startUrls": [
{ "url": "https://example.com/docs" }
],
"crawlerType": "cheerio",
"maxCrawlDepth": 2,
"maxCrawlPages": 50
}

Advanced Configuration

{
"startUrls": [
{ "url": "https://example.com/blog" }
],
"crawlerType": "playwright:chrome",
"maxCrawlDepth": 3,
"maxCrawlPages": 200,
"removeElementsCssSelector": "nav, footer, .sidebar, .comments",
"removeCookieWarnings": true,
"saveMarkdown": true,
"saveScreenshots": false,
"maxRequestRetries": 5,
"proxyConfiguration": {
"useApifyProxy": true
}
}

Extract Specific Content

{
"startUrls": [
{ "url": "https://example.com" }
],
"keepElementsCssSelector": "article, .content, main",
"htmlTransformer": "readableText",
"readableTextCharThreshold": 500,
"saveMarkdown": true
}

How It Works

The crawler starts from your specified URLs and:

  1. Fetches and processes each page using your selected crawling engine
  2. Extracts and cleans the content by removing unwanted elements
  3. Converts the content to your preferred format (Markdown, plain text, or HTML)
  4. Follows links to discover and crawl additional pages (up to your depth limit)
  5. Saves all extracted data to the dataset for easy access