Website Content Crawler for AI — Clean Markdown, 4x Cheaper avatar

Website Content Crawler for AI — Clean Markdown, 4x Cheaper

Pricing

Pay per usage

Go to Apify Store
Website Content Crawler for AI — Clean Markdown, 4x Cheaper

Website Content Crawler for AI — Clean Markdown, 4x Cheaper

Crawl any website and extract clean text/markdown for LLMs, RAG pipelines, vector databases. BFS crawl with depth control, robots.txt support, boilerplate removal. Perfect for feeding AI models. $0.001/page — 4x cheaper than the official Apify crawler.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ken Digital

Ken Digital

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 hours ago

Last modified

Categories

Share

🕷️ Website Content Crawler for AI — Clean Markdown, 4x Cheaper

Crawl any website and extract clean, structured content as Markdown, plain text, or HTML. Built specifically for feeding AI models, LLM applications, vector databases, and RAG pipelines.

Why This Actor?

FeatureThis ActorApify Web ScraperGeneric Crawlers
Price per page$0.001$0.004+$0.005+
Output formatMarkdown, Text, HTMLRaw HTMLRaw HTML
AI-ready content✅ Clean, no boilerplate❌ Manual cleaning needed❌ Manual cleaning needed
Strips ads/nav/scripts✅ Automatic❌ No❌ No
robots.txt✅ Respected⚠️ Optional❌ Often ignored
Zero config✅ Just add URLs❌ Needs selectors❌ Needs setup

4x cheaper than alternatives. Same quality output. No configuration needed.

🎯 Perfect For

  • RAG pipelines — Feed clean documents into your retrieval system
  • LLM fine-tuning — Training data without HTML noise
  • Vector databases — Chunk clean markdown for embeddings (Pinecone, Weaviate, Qdrant)
  • Knowledge bases — Build structured content libraries
  • Content analysis — Word counts, link graphs, language detection
  • AI agents — Give your agents access to any website's content

🚀 Quick Start

Input

{
"startUrls": [
{ "url": "https://docs.python.org/3/" }
],
"maxPages": 50,
"maxDepth": 3,
"outputFormat": "markdown"
}

Output (per page)

{
"url": "https://docs.python.org/3/tutorial/index.html",
"title": "The Python Tutorial",
"content": "# The Python Tutorial\n\nPython is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming...\n\n## An Informal Introduction to Python\n\nIn the following examples, input and output are distinguished by the presence or absence of prompts...\n\n- [Whetting Your Appetite](appetite.html)\n- [Using the Python Interpreter](interpreter.html)\n- [An Informal Introduction to Python](introduction.html)\n- [More Control Flow Tools](controlflow.html)",
"wordCount": 1247,
"language": "en",
"links": [
"https://docs.python.org/3/tutorial/appetite.html",
"https://docs.python.org/3/tutorial/interpreter.html"
],
"crawledAt": "2026-03-28T21:00:00.000Z",
"statusCode": 200
}

⚙️ Configuration

ParameterTypeDefaultDescription
startUrlsArrayrequiredURLs to start crawling from
maxPagesNumber50Maximum pages to crawl
maxDepthNumber3How deep to follow links (0 = start URLs only)
sameDomainOnlyBooleantrueOnly follow links on the same domain
includeGlobsArray[]Only crawl URLs matching these glob patterns
excludeGlobsArray[]Skip URLs matching these glob patterns
outputFormatEnum"markdown"Output format: markdown, text, or html

🧹 What Gets Cleaned

The crawler automatically removes:

  • ✂️ Navigation bars (<nav>, menu classes)
  • ✂️ Headers & footers (site-wide, not content headings)
  • ✂️ Scripts & styles (JavaScript, CSS)
  • ✂️ Ads & tracking (common ad container patterns)
  • ✂️ Cookie banners & popups
  • ✂️ Social share buttons
  • ✂️ Sidebars & widgets
  • ✂️ Comment sections

What's preserved:

  • ✅ Headings (H1-H6 → # to ######)
  • ✅ Paragraphs with proper spacing
  • ✅ Lists (ordered and unordered)
  • ✅ Links with URLs
  • ✅ Code blocks
  • ✅ Bold and italic text
  • ✅ Tables
  • ✅ Image alt text
  • ✅ Blockquotes

🔗 Integration Examples

Feed into OpenAI / LangChain

from langchain.text_splitter import RecursiveCharacterTextSplitter
from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("your-username/website-content-crawler").call(
run_input={"startUrls": [{"url": "https://example.com"}], "maxPages": 100}
)
splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
chunks = splitter.split_text(item["content"])
# Feed chunks to your LLM / vector DB

Load into Pinecone

import pinecone
from openai import OpenAI
# After running the crawler...
for item in dataset.iterate_items():
embedding = openai_client.embeddings.create(
input=item["content"][:8000],
model="text-embedding-3-small"
).data[0].embedding
index.upsert([(item["url"], embedding, {"title": item["title"], "content": item["content"]})])

💰 Pricing

$0.001 per page crawled — that's it.

PagesCostvs. Alternatives
100$0.10Save $0.30+
1,000$1.00Save $3.00+
10,000$10.00Save $30.00+
100,000$100.00Save $300.00+

No monthly fees. No minimum commitment. Pay only for what you crawl.

🛡️ Responsible Crawling

  • ✅ Respects robots.txt directives
  • ✅ Rate-limited requests (max ~2 req/sec per domain)
  • ✅ Proper User-Agent identification
  • ✅ Follows redirects correctly
  • ✅ Skips binary files automatically

📊 Technical Details

  • Engine: httpx with HTTP/2 support
  • Parser: Python stdlib html.parser (fast, no heavy dependencies)
  • Crawl strategy: Breadth-first search (BFS) with depth control
  • Deduplication: URL normalization prevents re-crawling
  • Encoding: Auto-detected from Content-Type headers
  • Language detection: Heuristic-based from content analysis

Changelog

v1.0 (2026-03-28)

  • Initial release
  • BFS crawling with depth control
  • Markdown/text/HTML output formats
  • robots.txt compliance
  • Boilerplate removal (nav, footer, ads, scripts)
  • Link extraction and same-domain filtering
  • Glob pattern matching for URL inclusion/exclusion
  • Pay-per-event pricing at $0.001/page