Pricing

Pay per usage

Go to Apify Store

AI-Ready Website Crawler

Try for free

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Fulcria Labs

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

What it does

This actor takes a starting URL, crawls the website following same-domain links, and outputs each page as clean markdown with metadata. It strips out navigation, ads, scripts, and other non-content elements to produce AI-ready text.

Input

Field	Type	Default	Description
`startUrl`	string	required	Primary URL to start crawling
`additionalUrls`	string[]	`[]`	Extra URLs to include in the crawl
`maxPages`	integer	`50`	Maximum pages to crawl (1-10000)
`maxDepth`	integer	`3`	Maximum link depth from start URL
`requestsPerSecond`	number	`2`	Rate limit for politeness
`respectRobotsTxt`	boolean	`true`	Honor robots.txt rules
`includeUrlPatterns`	string[]	`[]`	Regex patterns - only crawl matching URLs
`excludeUrlPatterns`	string[]	see below	Regex patterns - skip matching URLs
`removeSelectors`	string[]	see below	CSS selectors for elements to remove
`contentSelectors`	string[]	`[]`	CSS selectors to isolate main content
`requestTimeoutSecs`	integer	`30`	Per-request timeout
`userAgent`	string	`AIReadyWebsiteCrawler/1.0`	User-Agent header

Default exclude patterns

\.(pdf|zip|tar|gz|mp4|mp3|...)$
/api/
/login, /logout, /signin, /signup, /auth/

Default remove selectors

nav, footer, header, aside, .sidebar, .advertisement, .cookie-banner, script, style, noscript, iframe, svg, and more.

Output

Each crawled page produces a dataset item with:

{
    "url": "https://docs.example.com/getting-started",
    "title": "Getting Started - Example Docs",
    "markdown": "---\ntitle: \"Getting Started\"\nurl: https://...\ncrawl_date: 2026-02-23T12:00:00Z\n---\n\n# Getting Started\n\nWelcome to...",
    "crawl_date": "2026-02-23T12:00:00+00:00",
    "depth": 1,
    "word_count": 342
}

The markdown field includes YAML frontmatter with title, URL, and crawl date, followed by the cleaned content.

Example input

Crawl documentation site

{
    "startUrl": "https://docs.example.com",
    "maxPages": 100,
    "maxDepth": 5,
    "requestsPerSecond": 2
}

Crawl specific section only

{
    "startUrl": "https://example.com/docs/api",
    "maxPages": 50,
    "maxDepth": 3,
    "includeUrlPatterns": ["/docs/api/"],
    "contentSelectors": [".docs-content", "article"]
}

Crawl multiple sites

{
    "startUrl": "https://docs.example.com",
    "additionalUrls": [
        "https://blog.example.com",
        "https://wiki.example.com"
    ],
    "maxPages": 200
}

How the content cleaning works

HTML fetching - Uses httpx with HTTP/2 support and configurable timeouts
Element removal - Strips nav, footer, ads, scripts, styles via CSS selectors
Content isolation - Auto-detects <main>, <article>, or content divs (or uses your custom selectors)
Markdown conversion - Converts to markdown preserving headings, lists, tables, code blocks, and links
Whitespace cleanup - Removes excessive blank lines and trailing whitespace
Quality filter - Skips pages with fewer than 10 words of content

Use cases

Build RAG knowledge bases from documentation sites
Create training datasets for LLM fine-tuning
Index product documentation for AI assistants
Archive website content in a portable format
Feed content into vector databases (Pinecone, Weaviate, etc.)

Technical details

Async crawling with httpx for fast performance
BFS traversal with configurable depth limits
URL deduplication with fragment removal and normalization
robots.txt compliance with per-domain caching
Token bucket rate limiting for polite crawling
Same-domain restriction prevents crawling external sites
lxml parser for fast, robust HTML parsing

Wayback Machine CDX URL List Scraper

parseforge/wayback-cdx-scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

ParseForge

SEO Rank Checker

masiting/seo-rank-checker

SEO Rank Checker lets you instantly check domain SEO metrics using Semrush, Moz, and Majestic. Built for automation, APIs, and scalable SEO workflows.

Rafi Halilintar

5.0

Domain Inspector

visita/domain-inspector

A powerful, all-in-one tool to perform DNS lookups, WHOIS queries, HTTP status checks, and SSL certificate validation for a list of domains. It can clean full URLs down to the bare domain (e.g., https://www.apify.com/store → apify.com) and run all checks in a single batch.

Visita Intelligence

353

5.0

Semrush Keyword Magic Tool

burbn/semrush-keyword-magic-tool

Extract Semrush Keyword Magic Tool keyword ideas and variations into a clean dataset. Get average monthly search volume, Low/High CPC, competition level & index, search intent + confidence, SERP feature type, monetization score, and monthly search trends. Perfect for SEO keyword research.

Kevin

419

Semrush Scraper

radeance/semrush-scraper

Extract enterprise-level SEO data from Semrush including domain authority, traffic analytics, keyword rankings, backlink profiles, trending sites & competitor insights. Supports bulk analysis with historical trends. Download as CSV, JSON, Excel, XML & more.

Radeance

1.3K

4.9

SimilarWeb Website Scraper - AI Referral, WHOIS & Ranking

sourabhbgp/similarweb-scraper

Extract SimilarWeb traffic analytics for any domain: rankings, monthly visits, bounce rate, traffic sources, keywords, AI chatbot referrals. Plus RDAP WHOIS and 1-to-5-word keyword density. $1 per 1,000 results. 50 domains in ~10s, 1,000 in under 3 minutes.

Sourabh Kumar

153

5.0

Domain Availability Checker — Bulk DNS & WHOIS Lookup

automation-lab/domain-availability-checker

Check exact domain names in bulk and return available/registered verdicts with DNS/WHOIS method, registrar, creation/expiry dates, name servers, timing, and errors in structured JSON.

Stas Persiianenko

iOS & Android App Rankings Scraper

slothtechlabs/ios-android-app-rankings-scraper

Scrape Apple App Store and Google Play top chart rankings (Top Free, Top Paid, Top Grossing) across 60+ countries and 50+ categories in a single run. Track app rankings daily with batch processing. The affordable Sensor Tower alternative — get the same ranking data at 1/100th the cost.

SlothTechLabs

5.0

Similarweb scraper

curious_coder/similarweb-scraper

Scrape similarweb to get insights for websites in bulk including website traffic, pages per visit, time spent on website, bounce rate, rank, etc

Curious Coder

2.4K

3.7

Semrush Full API SEO Scraper - Traffic, Authority, Backlinks

pro100chok/semrush-scraper

All-in-one Semrush & Moz scraper — no login required. Domain authority, traffic analytics, backlinks, AI traffic (ChatGPT/Gemini/Claude), competitors, top 100 trending websites. 80+ fields per domain, 10 parallel workers, unique IP per request. Two modes: Domain Analysis & Top Websites Ranking.