AI-Ready Website Crawler
Pricing
Pay per usage
AI-Ready Website Crawler
Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Fulcria Labs
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
Crawls websites and converts pages to clean markdown suitable for AI/RAG knowledge bases, LLM fine-tuning, and document pipelines.
What it does
This actor takes a starting URL, crawls the website following same-domain links, and outputs each page as clean markdown with metadata. It strips out navigation, ads, scripts, and other non-content elements to produce AI-ready text.
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrl | string | required | Primary URL to start crawling |
additionalUrls | string[] | [] | Extra URLs to include in the crawl |
maxPages | integer | 50 | Maximum pages to crawl (1-10000) |
maxDepth | integer | 3 | Maximum link depth from start URL |
requestsPerSecond | number | 2 | Rate limit for politeness |
respectRobotsTxt | boolean | true | Honor robots.txt rules |
includeUrlPatterns | string[] | [] | Regex patterns - only crawl matching URLs |
excludeUrlPatterns | string[] | see below | Regex patterns - skip matching URLs |
removeSelectors | string[] | see below | CSS selectors for elements to remove |
contentSelectors | string[] | [] | CSS selectors to isolate main content |
requestTimeoutSecs | integer | 30 | Per-request timeout |
userAgent | string | AIReadyWebsiteCrawler/1.0 | User-Agent header |
Default exclude patterns
\.(pdf|zip|tar|gz|mp4|mp3|...)$/api//login, /logout, /signin, /signup, /auth/
Default remove selectors
nav, footer, header, aside, .sidebar, .advertisement, .cookie-banner, script, style, noscript, iframe, svg, and more.
Output
Each crawled page produces a dataset item with:
{"url": "https://docs.example.com/getting-started","title": "Getting Started - Example Docs","markdown": "---\ntitle: \"Getting Started\"\nurl: https://...\ncrawl_date: 2026-02-23T12:00:00Z\n---\n\n# Getting Started\n\nWelcome to...","crawl_date": "2026-02-23T12:00:00+00:00","depth": 1,"word_count": 342}
The markdown field includes YAML frontmatter with title, URL, and crawl date, followed by the cleaned content.
Example input
Crawl documentation site
{"startUrl": "https://docs.example.com","maxPages": 100,"maxDepth": 5,"requestsPerSecond": 2}
Crawl specific section only
{"startUrl": "https://example.com/docs/api","maxPages": 50,"maxDepth": 3,"includeUrlPatterns": ["/docs/api/"],"contentSelectors": [".docs-content", "article"]}
Crawl multiple sites
{"startUrl": "https://docs.example.com","additionalUrls": ["https://blog.example.com","https://wiki.example.com"],"maxPages": 200}
How the content cleaning works
- HTML fetching - Uses httpx with HTTP/2 support and configurable timeouts
- Element removal - Strips nav, footer, ads, scripts, styles via CSS selectors
- Content isolation - Auto-detects
<main>,<article>, or content divs (or uses your custom selectors) - Markdown conversion - Converts to markdown preserving headings, lists, tables, code blocks, and links
- Whitespace cleanup - Removes excessive blank lines and trailing whitespace
- Quality filter - Skips pages with fewer than 10 words of content
Use cases
- Build RAG knowledge bases from documentation sites
- Create training datasets for LLM fine-tuning
- Index product documentation for AI assistants
- Archive website content in a portable format
- Feed content into vector databases (Pinecone, Weaviate, etc.)
Technical details
- Async crawling with httpx for fast performance
- BFS traversal with configurable depth limits
- URL deduplication with fragment removal and normalization
- robots.txt compliance with per-domain caching
- Token bucket rate limiting for polite crawling
- Same-domain restriction prevents crawling external sites
- lxml parser for fast, robust HTML parsing