AI-Ready Website Crawler avatar

AI-Ready Website Crawler

Pricing

Pay per usage

Go to Apify Store
AI-Ready Website Crawler

AI-Ready Website Crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Fulcria Labs

Fulcria Labs

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

2 days ago

Last modified

Categories

Share

Crawls websites and converts pages to clean markdown suitable for AI/RAG knowledge bases, LLM fine-tuning, and document pipelines.

What it does

This actor takes a starting URL, crawls the website following same-domain links, and outputs each page as clean markdown with metadata. It strips out navigation, ads, scripts, and other non-content elements to produce AI-ready text.

Input

FieldTypeDefaultDescription
startUrlstringrequiredPrimary URL to start crawling
additionalUrlsstring[][]Extra URLs to include in the crawl
maxPagesinteger50Maximum pages to crawl (1-10000)
maxDepthinteger3Maximum link depth from start URL
requestsPerSecondnumber2Rate limit for politeness
respectRobotsTxtbooleantrueHonor robots.txt rules
includeUrlPatternsstring[][]Regex patterns - only crawl matching URLs
excludeUrlPatternsstring[]see belowRegex patterns - skip matching URLs
removeSelectorsstring[]see belowCSS selectors for elements to remove
contentSelectorsstring[][]CSS selectors to isolate main content
requestTimeoutSecsinteger30Per-request timeout
userAgentstringAIReadyWebsiteCrawler/1.0User-Agent header

Default exclude patterns

\.(pdf|zip|tar|gz|mp4|mp3|...)$
/api/
/login, /logout, /signin, /signup, /auth/

Default remove selectors

nav, footer, header, aside, .sidebar, .advertisement, .cookie-banner, script, style, noscript, iframe, svg, and more.

Output

Each crawled page produces a dataset item with:

{
"url": "https://docs.example.com/getting-started",
"title": "Getting Started - Example Docs",
"markdown": "---\ntitle: \"Getting Started\"\nurl: https://...\ncrawl_date: 2026-02-23T12:00:00Z\n---\n\n# Getting Started\n\nWelcome to...",
"crawl_date": "2026-02-23T12:00:00+00:00",
"depth": 1,
"word_count": 342
}

The markdown field includes YAML frontmatter with title, URL, and crawl date, followed by the cleaned content.

Example input

Crawl documentation site

{
"startUrl": "https://docs.example.com",
"maxPages": 100,
"maxDepth": 5,
"requestsPerSecond": 2
}

Crawl specific section only

{
"startUrl": "https://example.com/docs/api",
"maxPages": 50,
"maxDepth": 3,
"includeUrlPatterns": ["/docs/api/"],
"contentSelectors": [".docs-content", "article"]
}

Crawl multiple sites

{
"startUrl": "https://docs.example.com",
"additionalUrls": [
"https://blog.example.com",
"https://wiki.example.com"
],
"maxPages": 200
}

How the content cleaning works

  1. HTML fetching - Uses httpx with HTTP/2 support and configurable timeouts
  2. Element removal - Strips nav, footer, ads, scripts, styles via CSS selectors
  3. Content isolation - Auto-detects <main>, <article>, or content divs (or uses your custom selectors)
  4. Markdown conversion - Converts to markdown preserving headings, lists, tables, code blocks, and links
  5. Whitespace cleanup - Removes excessive blank lines and trailing whitespace
  6. Quality filter - Skips pages with fewer than 10 words of content

Use cases

  • Build RAG knowledge bases from documentation sites
  • Create training datasets for LLM fine-tuning
  • Index product documentation for AI assistants
  • Archive website content in a portable format
  • Feed content into vector databases (Pinecone, Weaviate, etc.)

Technical details

  • Async crawling with httpx for fast performance
  • BFS traversal with configurable depth limits
  • URL deduplication with fragment removal and normalization
  • robots.txt compliance with per-domain caching
  • Token bucket rate limiting for polite crawling
  • Same-domain restriction prevents crawling external sites
  • lxml parser for fast, robust HTML parsing