AI Web Crawler avatar

AI Web Crawler

Pricing

Pay per usage

Go to Apify Store
AI Web Crawler

AI Web Crawler

Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Hounderd

Hounderd

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

AI Web Crawler — LLM-Ready Content Extractor

Turn any website into clean, structured content for AI pipelines, RAG systems, and data workflows. Uses a real browser with stealth rendering to bypass Cloudflare, anti-bot systems, and JavaScript-heavy pages that basic scrapers can't touch.


🚀 What does this do?

This Actor crawls websites and extracts page content in formats built for LLMs:

  • Clean Markdown — boilerplate stripped, ready to feed into any AI model
  • AI-Optimized Markdown — noise removed via intelligent content filtering, maximizes signal-to-noise for RAG and embeddings
  • Full-site crawling — follow links automatically with BFS or DFS traversal
  • Stealth browser extraction — Camoufox-based rendering improves success on Cloudflare-challenged, anti-bot, and JavaScript-heavy pages
  • Structured metadata — title, description, Open Graph, author, language per page
  • Token estimation — word count and estimated token count for every page

Runs via Apify API, webhooks, and schedules — no code required to get started.


📦 Output Data

FieldDescription
urlThe crawled page URL
titlePage <title> tag
statusCodeHTTP response status code
markdownFull page content as clean Markdown
fitMarkdownAI-optimized Markdown with boilerplate filtered out
rawHtmlOriginal HTML (optional)
cleanedHtmlHTML with boilerplate removed (optional)
screenshotBase64 PNG screenshot of the page (optional)
wordCountNumber of words in the extracted content
estimatedTokensRough token count (~4 chars/token)
contentLengthCharacter count of extracted content
metadata.descriptionMeta description
metadata.keywordsMeta keywords
metadata.authorPage author
metadata.languagePage language
metadata.ogTitleOpen Graph title
metadata.ogDescriptionOpen Graph description
metadata.ogImageOpen Graph image URL

Example output

{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started — Example Docs",
"statusCode": 200,
"markdown": "# Getting Started\n\nWelcome to Example...",
"fitMarkdown": "# Getting Started\n\nWelcome to Example...",
"wordCount": 843,
"estimatedTokens": 1124,
"contentLength": 4498,
"metadata": {
"description": "Learn how to get started with Example in minutes.",
"keywords": "getting started, tutorial, example",
"author": "Example Team",
"language": "en",
"ogTitle": "Getting Started — Example Docs",
"ogDescription": "Learn how to get started with Example in minutes.",
"ogImage": "https://docs.example.com/og-getting-started.png"
}
}

💡 Use Cases

  • RAG Pipelines — Ingest documentation, blogs, or knowledge bases into vector stores
  • AI Research — Gather clean text from multiple pages for analysis or summarization
  • Documentation Scraping — Extract entire doc sites into Markdown for offline use or fine-tuning
  • Competitive Intelligence — Monitor competitor pages and detect content changes
  • Content Migration — Convert any website to Markdown for import into Notion, Obsidian, or CMS tools
  • LLM Context Prep — Feed live web content into AI agents and chatbots

⚙️ Options

OptionDescription
startUrlsOne or more URLs to crawl
crawlModesingle (start URLs only), bfs (breadth-first), or dfs (depth-first)
maxCrawlDepthHow many link-hops deep to follow from start URLs (BFS/DFS only)
maxCrawlPagesMaximum total pages to crawl per run
sameDomainOnlyOnly follow links within the same domain (default: on)
includeUrlPatternsRegex patterns — only follow URLs that match
excludeUrlPatternsRegex patterns — skip URLs that match (e.g. /login, \.pdf$)
outputFormatsChoose any combination: markdown, fitMarkdown, rawHtml, cleanedHtml, screenshot
cssSelectorRestrict extraction to a specific part of the page (e.g. article, main, #content)
excludeSelectorsCSS selectors for elements to strip before extraction (e.g. nav, .sidebar)
waitForSelectorWait for a CSS selector to appear before extracting — useful for JS-rendered pages
waitForTimeoutExtra wait time in ms after page load (for lazy-loaded content)
executeJavaScriptCustom JS to run on each page before extraction (dismiss popups, click "show more", etc.)
scrollToBottomScroll the full page to trigger lazy-loaded and infinite-scroll content
includeLinksPreserve hyperlinks in Markdown output (default: on)
includeImagesInclude image references in Markdown output (default: on)
includeMetadataExtract and include page metadata block (default: on)
maxConcurrencyPages to crawl in parallel in standard mode (default: 5, max: 20). Stealth mode crawls sequentially for reliability.
requestTimeoutMax total seconds to spend on a page before giving up. In stealth mode this budget includes page load, challenge waits, selector waits, and retries (default: 30)
stealthModeEnable stealth browser rendering to bypass bot detection (default: on, recommended)
proxyConfigurationOptional proxy settings — Residential proxies are recommended for protected sites, but not required for ordinary public pages

🛡️ Anti-Bot & Cloudflare Bypass

Most scrapers fail on modern websites because they're caught by bot detection at the browser and IP reputation layers. This Actor uses a hardened stealth browser path plus proxy support to reduce fingerprint-based detection and improve extraction success on tougher targets.

For the best results on Cloudflare-protected or heavily guarded sites:

  1. Enable Stealth Mode (default: on) — uses the Camoufox-based path for lower-friction browser fingerprinting
  2. Use Residential Proxies for guarded targets — datacenter IPs are blocked much more aggressively by systems like Cloudflare and Akamai, but many ordinary public sites do not need proxy spend at all

These settings materially improve compatibility with sites protected by systems like Cloudflare, Akamai, DataDome, and PerimeterX, but some sites may still challenge or block requests depending on IP reputation and challenge type.


🔗 API Usage

Trigger a crawl via the Apify API:

curl -X POST \
"https://api.apify.com/v2/acts/hounderd~ai-web-crawler/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"startUrls": [{"url": "https://docs.example.com"}],
"crawlMode": "bfs",
"maxCrawlDepth": 2,
"maxCrawlPages": 50,
"outputFormats": ["markdown", "fitMarkdown"]
}'

Results are available in the run's default dataset once the status is SUCCEEDED.