Pricing

from $20.00 / 1,000 page extracteds

AI Web Crawler

Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.

Pricing

from $20.00 / 1,000 page extracteds

Rating

0.0

(0)

Developer

Hounderd

Actor stats

Bookmarked

Total users

Monthly active users

13 days ago

Last modified

AI Web Crawler — LLM-Ready Content Extractor

Turn any website into clean, structured content for AI pipelines, RAG systems, and data workflows. Uses a real browser with stealth rendering to bypass Cloudflare, anti-bot systems, and JavaScript-heavy pages that basic scrapers can't touch.

🚀 What does this do?

This Actor crawls websites and extracts page content in formats built for LLMs:

Clean Markdown — boilerplate stripped, ready to feed into any AI model
AI-Optimized Markdown — noise removed via intelligent content filtering, maximizes signal-to-noise for RAG and embeddings
Full-site crawling — follow links automatically with BFS or DFS traversal
Stealth browser extraction — Camoufox-based rendering improves success on Cloudflare-challenged, anti-bot, and JavaScript-heavy pages
Structured metadata — title, description, Open Graph, author, language per page
Token estimation — word count and estimated token count for every page

Runs via Apify API, webhooks, and schedules — no code required to get started.

📦 Output Data

Field	Description
url	The crawled page URL
title	Page `<title>` tag
statusCode	HTTP response status code
markdown	Full page content as clean Markdown
fitMarkdown	AI-optimized Markdown with boilerplate filtered out
rawHtml	Original HTML (optional)
cleanedHtml	HTML with boilerplate removed (optional)
screenshot	Base64 PNG screenshot of the page (optional)
wordCount	Number of words in the extracted content
estimatedTokens	Rough token count (~4 chars/token)
contentLength	Character count of extracted content
metadata.description	Meta description
metadata.keywords	Meta keywords
metadata.author	Page author
metadata.language	Page language
metadata.ogTitle	Open Graph title
metadata.ogDescription	Open Graph description
metadata.ogImage	Open Graph image URL

Example output

{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started — Example Docs",
  "statusCode": 200,
  "markdown": "# Getting Started\n\nWelcome to Example...",
  "fitMarkdown": "# Getting Started\n\nWelcome to Example...",
  "wordCount": 843,
  "estimatedTokens": 1124,
  "contentLength": 4498,
  "metadata": {
    "description": "Learn how to get started with Example in minutes.",
    "keywords": "getting started, tutorial, example",
    "author": "Example Team",
    "language": "en",
    "ogTitle": "Getting Started — Example Docs",
    "ogDescription": "Learn how to get started with Example in minutes.",
    "ogImage": "https://docs.example.com/og-getting-started.png"
  }
}

💡 Use Cases

RAG Pipelines — Ingest documentation, blogs, or knowledge bases into vector stores
AI Research — Gather clean text from multiple pages for analysis or summarization
Documentation Scraping — Extract entire doc sites into Markdown for offline use or fine-tuning
Competitive Intelligence — Monitor competitor pages and detect content changes
Content Migration — Convert any website to Markdown for import into Notion, Obsidian, or CMS tools
LLM Context Prep — Feed live web content into AI agents and chatbots

⚙️ Options

Option	Description
startUrls	One or more URLs to crawl
crawlMode	`single` (start URLs only), `bfs` (breadth-first), or `dfs` (depth-first)
maxCrawlDepth	How many link-hops deep to follow from start URLs (BFS/DFS only)
maxCrawlPages	Maximum total pages to crawl per run
sameDomainOnly	Only follow links within the same domain (default: on)
includeUrlPatterns	Regex patterns — only follow URLs that match
excludeUrlPatterns	Regex patterns — skip URLs that match (e.g. `/login`, `\.pdf$`)
outputFormats	Choose any combination: `markdown`, `fitMarkdown`, `rawHtml`, `cleanedHtml`, `screenshot`
cssSelector	Restrict extraction to a specific part of the page (e.g. `article`, `main`, `#content`)
excludeSelectors	CSS selectors for elements to strip before extraction (e.g. `nav`, `.sidebar`)
waitForSelector	Wait for a CSS selector to appear before extracting — useful for JS-rendered pages
waitForTimeout	Extra wait time in ms after page load (for lazy-loaded content)
executeJavaScript	Custom JS to run on each page before extraction (dismiss popups, click "show more", etc.)
scrollToBottom	Scroll the full page to trigger lazy-loaded and infinite-scroll content
includeLinks	Preserve hyperlinks in Markdown output (default: on)
includeImages	Include image references in Markdown output (default: on)
includeMetadata	Extract and include page metadata block (default: on)
maxConcurrency	Pages to crawl in parallel in standard mode (default: 5, max: 20). Stealth mode crawls sequentially for reliability.
requestTimeout	Max total seconds to spend on a page before giving up. In stealth mode this budget includes page load, challenge waits, selector waits, and retries (default: 30)
stealthMode	Enable stealth browser rendering to bypass bot detection (default: on, recommended)
proxyConfiguration	Optional proxy settings — Residential proxies are recommended for protected sites, but not required for ordinary public pages

🛡️ Anti-Bot & Cloudflare Bypass

Most scrapers fail on modern websites because they're caught by bot detection at the browser and IP reputation layers. This Actor uses a hardened stealth browser path plus proxy support to reduce fingerprint-based detection and improve extraction success on tougher targets.

For the best results on Cloudflare-protected or heavily guarded sites:

Enable Stealth Mode (default: on) — uses the Camoufox-based path for lower-friction browser fingerprinting
Use Residential Proxies for guarded targets — datacenter IPs are blocked much more aggressively by systems like Cloudflare and Akamai, but many ordinary public sites do not need proxy spend at all

These settings materially improve compatibility with sites protected by systems like Cloudflare, Akamai, DataDome, and PerimeterX, but some sites may still challenge or block requests depending on IP reputation and challenge type.

🔗 API Usage

Trigger a crawl via the Apify API:

curl -X POST \
  "https://api.apify.com/v2/acts/hounderd~ai-web-crawler/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "startUrls": [{"url": "https://docs.example.com"}],
    "crawlMode": "bfs",
    "maxCrawlDepth": 2,
    "maxCrawlPages": 50,
    "outputFormats": ["markdown", "fitMarkdown"]
  }'

Results are available in the run's default dataset once the status is SUCCEEDED.

AI Content Crawler

kai-agent/ai-content-crawler

Crawl any website and get clean, AI-ready content in markdown format. Perfect for RAG pipelines, LLM training data, and vector database ingestion. Features smart chunking, metadata extraction, and multiple output formats.

Kai Agent

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Stealth Website Crawler

nocturne/stealth-website-crawler

Crawl websites protected by Cloudflare, DataDome, and other anti-bot systems. Extract clean text or markdown for AI/LLM pipelines. Like Website Content Crawler, but for sites that block you.

Nocturne

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Crawl4ai To Markdown Pro2

juryless_rainbow/crawl4ai-to-markdown-pro2

A high-performance web-to-markdown crawler for AI agents, optimized for LLM data extraction using Crawl4AI. Features stealth browsing and high-fidelity content extraction.

aaron jungs

AI / RAG Web Crawler

groupoject/ai-rag-web-crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Group Oject

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

AI Web Reader (RAG Ready)

viinaysonii/ai-web-reader-rag-ready

Convert any webpage into clean, structured, AI-ready Markdown. Removes ads, images, and UI noise, normalizes content, and outputs data optimized for LLMs, RAG pipelines, and AI agents. Fast, scalable, and built for real-world AI workflows.

🧠 RAG Web Browser — Web Content for AI & LLMs

nexgendata/rag-web-browser

Web browser for RAG pipelines and AI agents. Search Google, scrape top results, return clean Markdown. Feed your LLM with real-time web data. Works with Claude, GPT, LangChain, CrewAI. No API key needed.

NexGenData

Website Content Crawler Pro

datascoutapi/website-content-crawler-pro

Crawl websites and extract clean, structured content in Markdown, JSON, or plain text for AI models, LLMs, vector DBs, or RAG pipelines. Fast, reliable, and stealthy, with bulk processing, advanced metadata extraction, and seamless integration with LangChain, LlamaIndex, and AI workflows.