Pricing

$10.00 / 1,000 results

Website Content Crawler

A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.

Pricing

$10.00 / 1,000 results

Rating

0.0

(0)

Developer

Jason Giang

Actor stats

Bookmarked

Total users

Monthly active users

24 days ago

Last modified

Web Content Crawler

A powerful web crawler that extracts text content from websites, optimized for AI models, Large Language Models (LLMs), vector databases, and Retrieval-Augmented Generation (RAG) pipelines.

Features

Multiple Crawling Engines: Choose between Playwright (Chrome/Firefox), Cheerio (fast HTTP client), or JSDOM based on your needs
Markdown Output: Automatically converts HTML content to clean Markdown format
Smart Content Extraction: Removes unwanted elements like cookie banners, navigation, ads, and more
Customizable Selectors: Keep or remove specific elements using CSS selectors
Deep Crawling: Recursively crawl websites with configurable depth limits
AI-Ready Output: Structured data perfect for feeding into AI models and vector databases
Proxy Support: Built-in proxy configuration for reliable crawling
Screenshot Capture: Optional screenshot capture for visual documentation (Playwright only)
File Downloads: Download and save linked files like PDFs and documents

Use Cases

Knowledge Base Extraction: Crawl documentation sites and help centers
Content Aggregation: Collect articles, blog posts, and web content at scale
AI Training Data: Extract clean text for training or fine-tuning language models
RAG Pipelines: Feed content into retrieval-augmented generation systems
Vector Database Population: Prepare text content for embedding and semantic search
Website Migration: Extract content from existing websites for migration
Competitive Analysis: Monitor and analyze competitor content

Input Parameters

Required

Start URLs (startUrls): Array of URLs where the crawler will begin. The crawler will only process pages under these URLs.

Crawler Configuration

Crawler Type (crawlerType): Select the crawling engine
- cheerio (default): Fast HTTP client, best for static websites
- playwright:chrome: Chrome browser with full JavaScript support
- playwright:firefox: Firefox browser, useful for sites with anti-bot measures
- jsdom: Experimental JavaScript-capable crawler
Max Crawling Depth (maxCrawlDepth): Maximum link depth from start URLs (default: 1)
- 0 = Only crawl start URLs
- 1 = Crawl start URLs and pages directly linked from them
- 2+ = Continue crawling to specified depth
Max Pages (maxCrawlPages): Maximum number of pages to crawl (default: 100)
Max Requests Per Minute (maxRequestsPerMinute): Rate limiting (default: 0 = unlimited)

Content Extraction

Readable Text Char Threshold (readableTextCharThreshold): Minimum characters required to save a page (default: 100)
Remove Cookie Warnings (removeCookieWarnings): Automatically remove cookie consent dialogs (default: true)
Click Elements CSS Selector (clickElementsCssSelector): CSS selector for elements to click before extraction (e.g., "Show more" buttons)
HTML Transformer (htmlTransformer): How to process HTML
- readableText (default): Remove scripts, styles, navigation
- none: Keep original HTML
Remove Elements CSS Selector (removeElementsCssSelector): CSS selector for elements to remove (e.g., nav, footer, .ads)
Keep Elements CSS Selector (keepElementsCssSelector): CSS selector for elements to keep (removes everything else)

Output Options

Save Markdown (saveMarkdown): Convert content to Markdown format (default: true)
Save HTML (saveHtml): Save raw HTML to key-value store (default: false)
Save Screenshots (saveScreenshots): Capture page screenshots (Playwright only, default: false)
Save Files (saveFiles): Download linked files like PDFs (default: false)

Advanced Options

Max Scroll Height (maxScrollHeightPixels): Scroll down pages with infinite scroll (default: 0 = disabled)
Proxy Configuration (proxyConfiguration): Proxy settings for the crawler
Max Request Retries (maxRequestRetries): Number of retry attempts for failed requests (default: 3)
Debug Mode (debugMode): Enable detailed logging (default: false)

Output Format

Each crawled page produces a dataset item with the following structure:

{
  "url": "https://example.com/page",
  "title": "Page Title",
  "description": "Page meta description",
  "canonicalUrl": "https://example.com/page",
  "text": "Extracted plain text content...",
  "markdown": "# Page Title\n\nExtracted content in Markdown...",
  "crawl": {
    "loadedUrl": "https://example.com/page",
    "depth": 1,
    "httpStatusCode": 200,
    "loadedAt": "2024-01-01T12:00:00.000Z"
  }
}

Example Usage

Basic Crawl

{
  "startUrls": [
    { "url": "https://example.com/docs" }
  ],
  "crawlerType": "cheerio",
  "maxCrawlDepth": 2,
  "maxCrawlPages": 50
}

Advanced Configuration

{
  "startUrls": [
    { "url": "https://example.com/blog" }
  ],
  "crawlerType": "playwright:chrome",
  "maxCrawlDepth": 3,
  "maxCrawlPages": 200,
  "removeElementsCssSelector": "nav, footer, .sidebar, .comments",
  "removeCookieWarnings": true,
  "saveMarkdown": true,
  "saveScreenshots": false,
  "maxRequestRetries": 5,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}

Extract Specific Content

{
  "startUrls": [
    { "url": "https://example.com" }
  ],
  "keepElementsCssSelector": "article, .content, main",
  "htmlTransformer": "readableText",
  "readableTextCharThreshold": 500,
  "saveMarkdown": true
}

How It Works

The crawler starts from your specified URLs and:

Fetches and processes each page using your selected crawling engine
Extracts and cleans the content by removing unwanted elements
Converts the content to your preferred format (Markdown, plain text, or HTML)
Follows links to discover and crawl additional pages (up to your depth limit)
Saves all extracted data to the dataset for easy access

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.

halam

472

3.4

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

106K

4.3

No-BS Content Crawler 🖕

successful_nonagon/no-bs-content-crawler

Fast web crawler that extracts clean text from websites. Returns readable content, headings, and links. Perfect for content aggregation, SEO research, and data collection.

hafsah nuzhat

5.0

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

Artashes Arakelyan

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David

3.3K

4.9

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Andok

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

Kai Agent

Ai Text Generation

vivid_astronaut/ai-text-generation

Generate high-quality text content using AI. Create articles, emails, product descriptions, and more. Powered by advanced language models for natural, engaging content.

Fabio Suizu

Website Content Crawler for LLM's

salesblaster-ai/website-content-crawler

Extract contact information + turn any website into clean, structured content ready for LLM's (e.g. AI lead magnets, RAG pipelines, and outbound personalization). Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for LLM's, and optimized for lead generation.

SalesBlaster AI

RAG Browser

visita/rag-browser

This Actor provides essential web browsing and content extraction functionality for AI Agents, LLM applications, and Retrieval-Augmented Generation (RAG) pipelines. It functions similarly to the web search feature in popular LLM chatbots, providing fresh, contextualized data directly from the web.