Deep Website Crawler (DEPRECATED) avatar

Deep Website Crawler (DEPRECATED)

Deprecated

Pricing

Pay per event

Go to Apify Store
Deep Website Crawler (DEPRECATED)

Deep Website Crawler (DEPRECATED)

Deprecated

DEPRECATED — use santamaria-automations/website-content-crawler instead. Same crawl behavior, richer output (clean AI/RAG-ready Markdown vs plain text).

Pricing

Pay per event

Rating

0.0

(0)

Developer

Ale

Ale

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 days ago

Last modified

Share

Deep Website Crawler

Crawl any website to configurable depth and extract the title and full text content of every page. Give it a list of start URLs — it follows links level by level and returns one record per page. No API keys or login required.

How It Works

For each start URL you provide, the crawler:

  1. Fetches the start page
  2. Extracts all internal links from that page
  3. Follows those links to the next depth level
  4. Repeats until the configured depth or page limit is reached
  5. Returns one record per crawled page with its title, text content, and crawl depth

Challenge pages (bot-protection walls) are skipped automatically so the run keeps going. Pages that return errors are logged and skipped.

Use with AI Agents (MCP)

Connect this actor to any MCP-compatible AI client — Claude Desktop, Claude.ai, Cursor, VS Code, LangChain, LlamaIndex, or custom agents.

Apify MCP server URL:

https://mcp.apify.com?tools=santamaria-automations/deep-website-crawler

Example prompt once connected:

"Use deep-website-crawler to crawl https://example.com to depth 2 and return all page titles and text as a table."

Clients that support dynamic tool discovery (Claude.ai, VS Code) will receive the full input schema automatically via add-actor.

Input Example

{
"startUrls": [
"https://acme-corp.com",
"https://www.another-company.de/blog"
],
"maxDepth": 2,
"maxPagesPerCrawl": 100,
"maxPagesPerDomain": 50
}

Both bare domains (acme-corp.com) and full URLs (https://acme-corp.com/about) are accepted.

Output Example

[
{
"url": "https://acme-corp.com",
"title": "Acme Corp - Industrial Solutions",
"text": "Acme Corp is a global leader in industrial solutions. Since 1950 we have...",
"depth": 0,
"start_url": "https://acme-corp.com",
"links_found": 14,
"status_code": 200,
"content_length": 3842,
"scraped_at": "2026-04-29T10:00:00Z"
},
{
"url": "https://acme-corp.com/about",
"title": "About Us - Acme Corp",
"text": "Founded in 1950, Acme Corp has grown from a small family workshop into...",
"depth": 1,
"start_url": "https://acme-corp.com",
"links_found": 8,
"status_code": 200,
"content_length": 2190,
"scraped_at": "2026-04-29T10:00:01Z"
}
]

Pricing

You pay per page crawled — only charged for pages you actually receive.

EventPriceDescription
Actor start$0.25Covers container startup
Page result$0.0005Per page crawled and returned

Example costs:

Pages crawledCost
0 pages$0.25
100 pages$0.30
1,000 pages$0.75
10,000 pages$5.25

No monthly fees. No minimum spend.

Input Parameters

ParameterTypeDefaultDescription
startUrlsstring[]requiredURLs to start crawling from
maxDepthinteger2Link levels deep to follow (0–5)
maxPagesPerCrawlinteger100Max total pages across all start URLs (1–500)
maxPagesPerDomaininteger50Max pages per unique domain (1–250)
proxyConfigurationobjectApify proxyProxy settings

Output Fields

FieldTypeDescription
urlstringCanonical URL of the crawled page
titlestringHTML title tag content
textstringVisible plain text (truncated at 10,000 characters)
depthintegerCrawl depth (0 = start URL, 1 = one link away, etc.)
start_urlstringThe start URL that initiated this crawl path
links_foundintegerInternal links discovered on this page
status_codeintegerHTTP status code
content_lengthintegerCharacters in extracted text (before truncation)
scraped_atstringISO 8601 UTC timestamp

Tips

  • Depth 2 covers most websites — homepage → section pages → detail pages is typically enough for site audits and content extraction
  • Use maxPagesPerCrawl for budget control — set this lower than the theoretical maximum to cap spend on large sites
  • Depth 0 is just the start page — useful when you have a precise list of URLs and only need content extraction without following links
  • One record per page — each unique URL gets its own row, making it easy to filter, sort, or feed into downstream processing

Issues & Feature Requests

If something is not working or you're missing a feature, please open an issue and we'll look into it.