Deep Website Crawler (DEPRECATED)
DeprecatedPricing
Pay per event
Deep Website Crawler (DEPRECATED)
DeprecatedDEPRECATED — use santamaria-automations/website-content-crawler instead. Same crawl behavior, richer output (clean AI/RAG-ready Markdown vs plain text).
Pricing
Pay per event
Rating
0.0
(0)
Developer
Ale
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
Deep Website Crawler
Crawl any website to configurable depth and extract the title and full text content of every page. Give it a list of start URLs — it follows links level by level and returns one record per page. No API keys or login required.
How It Works
For each start URL you provide, the crawler:
- Fetches the start page
- Extracts all internal links from that page
- Follows those links to the next depth level
- Repeats until the configured depth or page limit is reached
- Returns one record per crawled page with its title, text content, and crawl depth
Challenge pages (bot-protection walls) are skipped automatically so the run keeps going. Pages that return errors are logged and skipped.
Use with AI Agents (MCP)
Connect this actor to any MCP-compatible AI client — Claude Desktop, Claude.ai, Cursor, VS Code, LangChain, LlamaIndex, or custom agents.
Apify MCP server URL:
https://mcp.apify.com?tools=santamaria-automations/deep-website-crawler
Example prompt once connected:
"Use
deep-website-crawlerto crawl https://example.com to depth 2 and return all page titles and text as a table."
Clients that support dynamic tool discovery (Claude.ai, VS Code) will receive the full input schema automatically via add-actor.
Input Example
{"startUrls": ["https://acme-corp.com","https://www.another-company.de/blog"],"maxDepth": 2,"maxPagesPerCrawl": 100,"maxPagesPerDomain": 50}
Both bare domains (acme-corp.com) and full URLs (https://acme-corp.com/about) are accepted.
Output Example
[{"url": "https://acme-corp.com","title": "Acme Corp - Industrial Solutions","text": "Acme Corp is a global leader in industrial solutions. Since 1950 we have...","depth": 0,"start_url": "https://acme-corp.com","links_found": 14,"status_code": 200,"content_length": 3842,"scraped_at": "2026-04-29T10:00:00Z"},{"url": "https://acme-corp.com/about","title": "About Us - Acme Corp","text": "Founded in 1950, Acme Corp has grown from a small family workshop into...","depth": 1,"start_url": "https://acme-corp.com","links_found": 8,"status_code": 200,"content_length": 2190,"scraped_at": "2026-04-29T10:00:01Z"}]
Pricing
You pay per page crawled — only charged for pages you actually receive.
| Event | Price | Description |
|---|---|---|
| Actor start | $0.25 | Covers container startup |
| Page result | $0.0005 | Per page crawled and returned |
Example costs:
| Pages crawled | Cost |
|---|---|
| 0 pages | $0.25 |
| 100 pages | $0.30 |
| 1,000 pages | $0.75 |
| 10,000 pages | $5.25 |
No monthly fees. No minimum spend.
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | string[] | required | URLs to start crawling from |
maxDepth | integer | 2 | Link levels deep to follow (0–5) |
maxPagesPerCrawl | integer | 100 | Max total pages across all start URLs (1–500) |
maxPagesPerDomain | integer | 50 | Max pages per unique domain (1–250) |
proxyConfiguration | object | Apify proxy | Proxy settings |
Output Fields
| Field | Type | Description |
|---|---|---|
url | string | Canonical URL of the crawled page |
title | string | HTML title tag content |
text | string | Visible plain text (truncated at 10,000 characters) |
depth | integer | Crawl depth (0 = start URL, 1 = one link away, etc.) |
start_url | string | The start URL that initiated this crawl path |
links_found | integer | Internal links discovered on this page |
status_code | integer | HTTP status code |
content_length | integer | Characters in extracted text (before truncation) |
scraped_at | string | ISO 8601 UTC timestamp |
Tips
- Depth 2 covers most websites — homepage → section pages → detail pages is typically enough for site audits and content extraction
- Use maxPagesPerCrawl for budget control — set this lower than the theoretical maximum to cap spend on large sites
- Depth 0 is just the start page — useful when you have a precise list of URLs and only need content extraction without following links
- One record per page — each unique URL gets its own row, making it easy to filter, sort, or feed into downstream processing
Related Actors
- Free Email Domain Scraper — extract email addresses from any domain
- Website Contact Extractor — extract full contact records (email + phone + social + address)
- SEO Metadata Extractor — extract meta title, description, canonical, and OG tags
Issues & Feature Requests
If something is not working or you're missing a feature, please open an issue and we'll look into it.