Pricing

Pay per usage

Website Content Crawler for AI — Clean Markdown, 4x Cheaper

Crawl any website and extract clean text/markdown for LLMs, RAG pipelines, vector databases. BFS crawl with depth control, robots.txt support, boilerplate removal. Perfect for feeding AI models. $0.001/page — 4x cheaper than the official Apify crawler.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ken Digital

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

🕷️ Website Content Crawler for AI — Clean Markdown, 4x Cheaper

Crawl any website and extract clean, structured content as Markdown, plain text, or HTML. Built specifically for feeding AI models, LLM applications, vector databases, and RAG pipelines.

Why This Actor?

Feature	This Actor	Apify Web Scraper	Generic Crawlers
Price per page	$0.001	$0.004+	$0.005+
Output format	Markdown, Text, HTML	Raw HTML	Raw HTML
AI-ready content	✅ Clean, no boilerplate	❌ Manual cleaning needed	❌ Manual cleaning needed
Strips ads/nav/scripts	✅ Automatic	❌ No	❌ No
robots.txt	✅ Respected	⚠️ Optional	❌ Often ignored
Zero config	✅ Just add URLs	❌ Needs selectors	❌ Needs setup

4x cheaper than alternatives. Same quality output. No configuration needed.

🎯 Perfect For

RAG pipelines — Feed clean documents into your retrieval system
LLM fine-tuning — Training data without HTML noise
Vector databases — Chunk clean markdown for embeddings (Pinecone, Weaviate, Qdrant)
Knowledge bases — Build structured content libraries
Content analysis — Word counts, link graphs, language detection
AI agents — Give your agents access to any website's content

🚀 Quick Start

Input

{
    "startUrls": [
        { "url": "https://docs.python.org/3/" }
    ],
    "maxPages": 50,
    "maxDepth": 3,
    "outputFormat": "markdown"
}

Output (per page)

{
    "url": "https://docs.python.org/3/tutorial/index.html",
    "title": "The Python Tutorial",
    "content": "# The Python Tutorial\n\nPython is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming...\n\n## An Informal Introduction to Python\n\nIn the following examples, input and output are distinguished by the presence or absence of prompts...\n\n- [Whetting Your Appetite](appetite.html)\n- [Using the Python Interpreter](interpreter.html)\n- [An Informal Introduction to Python](introduction.html)\n- [More Control Flow Tools](controlflow.html)",
    "wordCount": 1247,
    "language": "en",
    "links": [
        "https://docs.python.org/3/tutorial/appetite.html",
        "https://docs.python.org/3/tutorial/interpreter.html"
    ],
    "crawledAt": "2026-03-28T21:00:00.000Z",
    "statusCode": 200
}

⚙️ Configuration

Parameter	Type	Default	Description
`startUrls`	Array	required	URLs to start crawling from
`maxPages`	Number	50	Maximum pages to crawl
`maxDepth`	Number	3	How deep to follow links (0 = start URLs only)
`sameDomainOnly`	Boolean	true	Only follow links on the same domain
`includeGlobs`	Array	[]	Only crawl URLs matching these glob patterns
`excludeGlobs`	Array	[]	Skip URLs matching these glob patterns
`outputFormat`	Enum	"markdown"	Output format: `markdown`, `text`, or `html`

🧹 What Gets Cleaned

The crawler automatically removes:

✂️ Navigation bars (<nav>, menu classes)
✂️ Headers & footers (site-wide, not content headings)
✂️ Scripts & styles (JavaScript, CSS)
✂️ Ads & tracking (common ad container patterns)
✂️ Cookie banners & popups
✂️ Social share buttons
✂️ Sidebars & widgets
✂️ Comment sections

What's preserved:

✅ Headings (H1-H6 → # to ######)
✅ Paragraphs with proper spacing
✅ Lists (ordered and unordered)
✅ Links with URLs
✅ Code blocks
✅ Bold and italic text
✅ Tables
✅ Image alt text
✅ Blockquotes

🔗 Integration Examples

Feed into OpenAI / LangChain

from langchain.text_splitter import RecursiveCharacterTextSplitter
from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("your-username/website-content-crawler").call(
    run_input={"startUrls": [{"url": "https://example.com"}], "maxPages": 100}
)

splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    chunks = splitter.split_text(item["content"])
    # Feed chunks to your LLM / vector DB

Load into Pinecone

import pinecone
from openai import OpenAI

# After running the crawler...
for item in dataset.iterate_items():
    embedding = openai_client.embeddings.create(
        input=item["content"][:8000],
        model="text-embedding-3-small"
    ).data[0].embedding

    index.upsert([(item["url"], embedding, {"title": item["title"], "content": item["content"]})])

💰 Pricing

$0.001 per page crawled — that's it.

Pages	Cost	vs. Alternatives
100	$0.10	Save $0.30+
1,000	$1.00	Save $3.00+
10,000	$10.00	Save $30.00+
100,000	$100.00	Save $300.00+

No monthly fees. No minimum commitment. Pay only for what you crawl.

🛡️ Responsible Crawling

✅ Respects robots.txt directives
✅ Rate-limited requests (max ~2 req/sec per domain)
✅ Proper User-Agent identification
✅ Follows redirects correctly
✅ Skips binary files automatically

📊 Technical Details

Engine: httpx with HTTP/2 support
Parser: Python stdlib html.parser (fast, no heavy dependencies)
Crawl strategy: Breadth-first search (BFS) with depth control
Deduplication: URL normalization prevents re-crawling
Encoding: Auto-detected from Content-Type headers
Language detection: Heuristic-based from content analysis

Changelog

v1.0 (2026-03-28)

Initial release
BFS crawling with depth control
Markdown/text/HTML output formats
robots.txt compliance
Boilerplate removal (nav, footer, ads, scripts)
Link extraction and same-domain filtering
Glob pattern matching for URL inclusion/exclusion
Pay-per-event pricing at $0.001/page

🔗 More Scrapers by Ken Digital

Scraper	What it does	Price
YouTube Channel Scraper	Videos, stats, metadata via official API	$0.001/video
France Job Scraper	WTTJ + France Travail + Hellowork	$0.005/job
France Real Estate Scraper	5 sources + DVF price analysis	$0.008/listing
Website Content Crawler	HTML to Markdown for AI/RAG	$0.001/page
Google Trends Scraper	Keywords, regions, related queries	$0.002/keyword
GitHub Repo Scraper	Stars, forks, languages, topics	$0.002/repo
RSS News Aggregator	Multi-source feed parsing	$0.0005/article
Instagram Profile Scraper	Followers, bio, posts	$0.0015/profile
Google Maps Scraper	Businesses, reviews, contacts	$0.002/result
TikTok Scraper	Videos, likes, shares	$0.001/video
Google SERP Scraper	Search results, PAA, snippets	$0.003/search
Trustpilot Scraper	Reviews, ratings, sentiment	$0.001/review

👉 View all scrapers

Website Content Crawler

novashieldai/website-content-crawler

Universal website crawler that extracts clean text/markdown content, metadata, links, and images from any URL. Features sitemap parsing, robots.txt respect, and multi-page BFS crawling with depth control.

Ali haydar Karadaş

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Website to Markdown Crawler - Full-Site Text for LLMs & RAG

entranced_gelato/website-to-markdown-crawler

Crawl any website from a start URL and get every page as clean text + Markdown for LLMs, RAG, and AI agents. Follows internal links with depth and page limits, strips nav and ads, and returns one structured record per page. A fast, no-config site-to-Markdown crawler.

AIDevs

Crawl4ai

kael_odin/crawl4ai

Extract page content (markdown/HTML/text), metadata, and link stats. Uses crawl4ai.

Kael Odin

Website Content Crawler Lite

fetch_cat/website-content-crawler-lite

Crawl public website pages and extract clean text, Markdown, metadata, and links for AI, SEO, and monitoring workflows.

Hanna Nosova

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Hitman studio

LLM Markdown Crawler

sleek_waveform/llm-markdown-crawler

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

Daniel Dimitrov

Website to Markdown for RAG & LLMs

hereditary_model/website-to-markdown

Crawls a website and converts every page into clean, LLM-ready Markdown for RAG pipelines, vector databases, and AI agents. Removes nav, ads, and boilerplate. Predictable pricing: $0.004 per page converted.

Aaron Marxsen

Website Content Crawler

parseforge/website-content-crawler

Crawl any website and pull clean Markdown content ready for AI! Follow links across a whole domain and extract page text, titles, headings, images, and metadata. Perfect for building RAG pipelines, training datasets, knowledge bases, and vector databases. Start crawling content in minutes!

ParseForge

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler