Website Content Crawler for AI — Clean Markdown, 4x Cheaper
Pricing
Pay per usage
Website Content Crawler for AI — Clean Markdown, 4x Cheaper
Crawl any website and extract clean text/markdown for LLMs, RAG pipelines, vector databases. BFS crawl with depth control, robots.txt support, boilerplate removal. Perfect for feeding AI models. $0.001/page — 4x cheaper than the official Apify crawler.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Ken Digital
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 hours ago
Last modified
Categories
Share
🕷️ Website Content Crawler for AI — Clean Markdown, 4x Cheaper
Crawl any website and extract clean, structured content as Markdown, plain text, or HTML. Built specifically for feeding AI models, LLM applications, vector databases, and RAG pipelines.
Why This Actor?
| Feature | This Actor | Apify Web Scraper | Generic Crawlers |
|---|---|---|---|
| Price per page | $0.001 | $0.004+ | $0.005+ |
| Output format | Markdown, Text, HTML | Raw HTML | Raw HTML |
| AI-ready content | ✅ Clean, no boilerplate | ❌ Manual cleaning needed | ❌ Manual cleaning needed |
| Strips ads/nav/scripts | ✅ Automatic | ❌ No | ❌ No |
| robots.txt | ✅ Respected | ⚠️ Optional | ❌ Often ignored |
| Zero config | ✅ Just add URLs | ❌ Needs selectors | ❌ Needs setup |
4x cheaper than alternatives. Same quality output. No configuration needed.
🎯 Perfect For
- RAG pipelines — Feed clean documents into your retrieval system
- LLM fine-tuning — Training data without HTML noise
- Vector databases — Chunk clean markdown for embeddings (Pinecone, Weaviate, Qdrant)
- Knowledge bases — Build structured content libraries
- Content analysis — Word counts, link graphs, language detection
- AI agents — Give your agents access to any website's content
🚀 Quick Start
Input
{"startUrls": [{ "url": "https://docs.python.org/3/" }],"maxPages": 50,"maxDepth": 3,"outputFormat": "markdown"}
Output (per page)
{"url": "https://docs.python.org/3/tutorial/index.html","title": "The Python Tutorial","content": "# The Python Tutorial\n\nPython is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming...\n\n## An Informal Introduction to Python\n\nIn the following examples, input and output are distinguished by the presence or absence of prompts...\n\n- [Whetting Your Appetite](appetite.html)\n- [Using the Python Interpreter](interpreter.html)\n- [An Informal Introduction to Python](introduction.html)\n- [More Control Flow Tools](controlflow.html)","wordCount": 1247,"language": "en","links": ["https://docs.python.org/3/tutorial/appetite.html","https://docs.python.org/3/tutorial/interpreter.html"],"crawledAt": "2026-03-28T21:00:00.000Z","statusCode": 200}
⚙️ Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | Array | required | URLs to start crawling from |
maxPages | Number | 50 | Maximum pages to crawl |
maxDepth | Number | 3 | How deep to follow links (0 = start URLs only) |
sameDomainOnly | Boolean | true | Only follow links on the same domain |
includeGlobs | Array | [] | Only crawl URLs matching these glob patterns |
excludeGlobs | Array | [] | Skip URLs matching these glob patterns |
outputFormat | Enum | "markdown" | Output format: markdown, text, or html |
🧹 What Gets Cleaned
The crawler automatically removes:
- ✂️ Navigation bars (
<nav>, menu classes) - ✂️ Headers & footers (site-wide, not content headings)
- ✂️ Scripts & styles (JavaScript, CSS)
- ✂️ Ads & tracking (common ad container patterns)
- ✂️ Cookie banners & popups
- ✂️ Social share buttons
- ✂️ Sidebars & widgets
- ✂️ Comment sections
What's preserved:
- ✅ Headings (H1-H6 →
#to######) - ✅ Paragraphs with proper spacing
- ✅ Lists (ordered and unordered)
- ✅ Links with URLs
- ✅ Code blocks
- ✅ Bold and italic text
- ✅ Tables
- ✅ Image alt text
- ✅ Blockquotes
🔗 Integration Examples
Feed into OpenAI / LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitterfrom apify_client import ApifyClientclient = ApifyClient("YOUR_TOKEN")run = client.actor("your-username/website-content-crawler").call(run_input={"startUrls": [{"url": "https://example.com"}], "maxPages": 100})splitter = RecursiveCharacterTextSplitter(chunk_size=1000)for item in client.dataset(run["defaultDatasetId"]).iterate_items():chunks = splitter.split_text(item["content"])# Feed chunks to your LLM / vector DB
Load into Pinecone
import pineconefrom openai import OpenAI# After running the crawler...for item in dataset.iterate_items():embedding = openai_client.embeddings.create(input=item["content"][:8000],model="text-embedding-3-small").data[0].embeddingindex.upsert([(item["url"], embedding, {"title": item["title"], "content": item["content"]})])
💰 Pricing
$0.001 per page crawled — that's it.
| Pages | Cost | vs. Alternatives |
|---|---|---|
| 100 | $0.10 | Save $0.30+ |
| 1,000 | $1.00 | Save $3.00+ |
| 10,000 | $10.00 | Save $30.00+ |
| 100,000 | $100.00 | Save $300.00+ |
No monthly fees. No minimum commitment. Pay only for what you crawl.
🛡️ Responsible Crawling
- ✅ Respects
robots.txtdirectives - ✅ Rate-limited requests (max ~2 req/sec per domain)
- ✅ Proper User-Agent identification
- ✅ Follows redirects correctly
- ✅ Skips binary files automatically
📊 Technical Details
- Engine: httpx with HTTP/2 support
- Parser: Python stdlib
html.parser(fast, no heavy dependencies) - Crawl strategy: Breadth-first search (BFS) with depth control
- Deduplication: URL normalization prevents re-crawling
- Encoding: Auto-detected from Content-Type headers
- Language detection: Heuristic-based from content analysis
Changelog
v1.0 (2026-03-28)
- Initial release
- BFS crawling with depth control
- Markdown/text/HTML output formats
- robots.txt compliance
- Boilerplate removal (nav, footer, ads, scripts)
- Link extraction and same-domain filtering
- Glob pattern matching for URL inclusion/exclusion
- Pay-per-event pricing at $0.001/page