AI Web Crawler
Pricing
Pay per usage
AI Web Crawler
Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Hounderd
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
AI Web Crawler — LLM-Ready Content Extractor
Turn any website into clean, structured content for AI pipelines, RAG systems, and data workflows. Uses a real browser with stealth rendering to bypass Cloudflare, anti-bot systems, and JavaScript-heavy pages that basic scrapers can't touch.
🚀 What does this do?
This Actor crawls websites and extracts page content in formats built for LLMs:
- Clean Markdown — boilerplate stripped, ready to feed into any AI model
- AI-Optimized Markdown — noise removed via intelligent content filtering, maximizes signal-to-noise for RAG and embeddings
- Full-site crawling — follow links automatically with BFS or DFS traversal
- Stealth browser extraction — Camoufox-based rendering improves success on Cloudflare-challenged, anti-bot, and JavaScript-heavy pages
- Structured metadata — title, description, Open Graph, author, language per page
- Token estimation — word count and estimated token count for every page
Runs via Apify API, webhooks, and schedules — no code required to get started.
📦 Output Data
| Field | Description |
|---|---|
| url | The crawled page URL |
| title | Page <title> tag |
| statusCode | HTTP response status code |
| markdown | Full page content as clean Markdown |
| fitMarkdown | AI-optimized Markdown with boilerplate filtered out |
| rawHtml | Original HTML (optional) |
| cleanedHtml | HTML with boilerplate removed (optional) |
| screenshot | Base64 PNG screenshot of the page (optional) |
| wordCount | Number of words in the extracted content |
| estimatedTokens | Rough token count (~4 chars/token) |
| contentLength | Character count of extracted content |
| metadata.description | Meta description |
| metadata.keywords | Meta keywords |
| metadata.author | Page author |
| metadata.language | Page language |
| metadata.ogTitle | Open Graph title |
| metadata.ogDescription | Open Graph description |
| metadata.ogImage | Open Graph image URL |
Example output
{"url": "https://docs.example.com/getting-started","title": "Getting Started — Example Docs","statusCode": 200,"markdown": "# Getting Started\n\nWelcome to Example...","fitMarkdown": "# Getting Started\n\nWelcome to Example...","wordCount": 843,"estimatedTokens": 1124,"contentLength": 4498,"metadata": {"description": "Learn how to get started with Example in minutes.","keywords": "getting started, tutorial, example","author": "Example Team","language": "en","ogTitle": "Getting Started — Example Docs","ogDescription": "Learn how to get started with Example in minutes.","ogImage": "https://docs.example.com/og-getting-started.png"}}
💡 Use Cases
- RAG Pipelines — Ingest documentation, blogs, or knowledge bases into vector stores
- AI Research — Gather clean text from multiple pages for analysis or summarization
- Documentation Scraping — Extract entire doc sites into Markdown for offline use or fine-tuning
- Competitive Intelligence — Monitor competitor pages and detect content changes
- Content Migration — Convert any website to Markdown for import into Notion, Obsidian, or CMS tools
- LLM Context Prep — Feed live web content into AI agents and chatbots
⚙️ Options
| Option | Description |
|---|---|
| startUrls | One or more URLs to crawl |
| crawlMode | single (start URLs only), bfs (breadth-first), or dfs (depth-first) |
| maxCrawlDepth | How many link-hops deep to follow from start URLs (BFS/DFS only) |
| maxCrawlPages | Maximum total pages to crawl per run |
| sameDomainOnly | Only follow links within the same domain (default: on) |
| includeUrlPatterns | Regex patterns — only follow URLs that match |
| excludeUrlPatterns | Regex patterns — skip URLs that match (e.g. /login, \.pdf$) |
| outputFormats | Choose any combination: markdown, fitMarkdown, rawHtml, cleanedHtml, screenshot |
| cssSelector | Restrict extraction to a specific part of the page (e.g. article, main, #content) |
| excludeSelectors | CSS selectors for elements to strip before extraction (e.g. nav, .sidebar) |
| waitForSelector | Wait for a CSS selector to appear before extracting — useful for JS-rendered pages |
| waitForTimeout | Extra wait time in ms after page load (for lazy-loaded content) |
| executeJavaScript | Custom JS to run on each page before extraction (dismiss popups, click "show more", etc.) |
| scrollToBottom | Scroll the full page to trigger lazy-loaded and infinite-scroll content |
| includeLinks | Preserve hyperlinks in Markdown output (default: on) |
| includeImages | Include image references in Markdown output (default: on) |
| includeMetadata | Extract and include page metadata block (default: on) |
| maxConcurrency | Pages to crawl in parallel in standard mode (default: 5, max: 20). Stealth mode crawls sequentially for reliability. |
| requestTimeout | Max total seconds to spend on a page before giving up. In stealth mode this budget includes page load, challenge waits, selector waits, and retries (default: 30) |
| stealthMode | Enable stealth browser rendering to bypass bot detection (default: on, recommended) |
| proxyConfiguration | Optional proxy settings — Residential proxies are recommended for protected sites, but not required for ordinary public pages |
🛡️ Anti-Bot & Cloudflare Bypass
Most scrapers fail on modern websites because they're caught by bot detection at the browser and IP reputation layers. This Actor uses a hardened stealth browser path plus proxy support to reduce fingerprint-based detection and improve extraction success on tougher targets.
For the best results on Cloudflare-protected or heavily guarded sites:
- Enable Stealth Mode (default: on) — uses the Camoufox-based path for lower-friction browser fingerprinting
- Use Residential Proxies for guarded targets — datacenter IPs are blocked much more aggressively by systems like Cloudflare and Akamai, but many ordinary public sites do not need proxy spend at all
These settings materially improve compatibility with sites protected by systems like Cloudflare, Akamai, DataDome, and PerimeterX, but some sites may still challenge or block requests depending on IP reputation and challenge type.
🔗 API Usage
Trigger a crawl via the Apify API:
curl -X POST \"https://api.apify.com/v2/acts/hounderd~ai-web-crawler/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": [{"url": "https://docs.example.com"}],"crawlMode": "bfs","maxCrawlDepth": 2,"maxCrawlPages": 50,"outputFormats": ["markdown", "fitMarkdown"]}'
Results are available in the run's default dataset once the status is SUCCEEDED.