MCP Nexus Universal AI Tool Bridge

Pricing

from $0.50 / 1,000 results

Try for free

Go to Apify Store

MCP Nexus Universal AI Tool Bridge

Try for free

Connect AI agents to real data. MCP Nexus runs tools that fetch, extract, summarize, classify and crawl web content with caching, multi LLM support, HMAC webhooks, circuit breakers and full observability in a stateless production ready Apify actor.

Pricing

from $0.50 / 1,000 results

Rating

5.0

(1)

Developer

Țugui Dragoș

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

13 hours ago

Last modified

Quick Start

Run on Apify Platform

Configure your input parameters
Click "Start" to run
View results in the Dataset tab

30-Second Tutorial

Fetch and extract data from any webpage in three simple steps:

Step 1: Select Tool

Choose fetch_web from the tool dropdown

Step 2: Configure

{
  "mode": "single",
  "tool": "fetch_web",
  "params": {
    "url": "https://example.com"
  }
}

Step 3: Run

Click Start and view extracted content in the dataset

One-Line API Call

curl "https://api.apify.com/v2/acts/SRHAma9FEsmuewetK/runs?token=YOUR_TOKEN" \
  -X POST -H "Content-Type: application/json" \
  -d '{"mode":"single","tool":"fetch_web","params":{"url":"https://example.com"}}'

Legal & Compliance Note

This actor respects robots.txt by default. Always review target site Terms of Service. Use proxies and rendering responsibly. You are responsible for compliance (GDPR/PII/ToS) in your jurisdiction.

What MCP Nexus Can Do

MCP Nexus provides 9 specialized tools for web data operations:

fetch_web - Fetch and extract content from web pages
extract - Extract specific data using CSS, XPath, or regex selectors
summarize - Generate AI summaries of text content
classify - Classify text into predefined categories using AI
transform - Transform JSON data with mapping operations
crawl_lite - Crawl multiple pages with depth and link following
extract_structured - Extract structured data using AI and JSON schemas
search_web - Parse sitemaps and RSS feeds for URL discovery
diff_text - Compare two texts and calculate semantic differences

Chapter 1: Core Concepts
Chapter 2: Getting Started
Chapter 3: Tools Reference
Chapter 4: Execution Modes
Chapter 5: AI/LLM Integration
Chapter 6: Performance & Optimization
Chapter 7: Security & Compliance
Chapter 8: Production Deployment
Chapter 9: Development Guide
Chapter 10: API & Integration
Appendix A: Input Schema Reference
Appendix B: Output Schema Reference
Appendix C: Error Codes
Appendix D: Troubleshooting
Appendix E: FAQ
Appendix F: Changelog

Chapter 1: Core Concepts

What is MCP Nexus

MCP Nexus is a universal AI tool bridge that connects AI agents, workflows, and applications to real-world web data. It provides a production-ready actor on the Apify platform that orchestrates nine specialized tools for web scraping, data extraction, AI-powered analysis, and content transformation.

Key Characteristics:

Stateless: Each run is independent with no persistent state
Observable: Full metrics and logging for debugging and monitoring
Resilient: Built-in circuit breakers and retry logic
Scalable: Runs on Apify's cloud infrastructure
Compliant: Respects robots.txt and implements security best practices

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                     MCP Nexus Actor                     │
├─────────────────────────────────────────────────────────┤
│  Input Validation (Zod)                                 │
│  ├─ Single Mode / Batch Mode / DAG Mode                 │
│  └─ Budget Tracking & Quota Management                  │
├─────────────────────────────────────────────────────────┤
│  Tool Router                                            │
│  ├─ fetch_web        ├─ crawl_lite                      │
│  ├─ extract          ├─ extract_structured              │
│  ├─ summarize        ├─ search_web                      │
│  ├─ classify         ├─ diff_text                       │
│  └─ transform                                           │
├─────────────────────────────────────────────────────────┤
│  Infrastructure Layer                                   │
│  ├─ HTTP Client (caching, ETags, Last-Modified)         │
│  ├─ Circuit Breakers (per-domain failure detection)     │
│  ├─ Deduplication (URL/content/hybrid fingerprinting)   │
│  ├─ LLM Client (OpenAI, Anthropic, Azure)               │
│  ├─ Browser (Playwright minimal/full rendering)         │
│  └─ Proxy Manager (Apify Proxy, custom rotation)        │
├─────────────────────────────────────────────────────────┤
│  Output & Storage                                       │
│  ├─ Dataset (structured run reports)                    │
│  ├─ Key-Value Store (HTML, screenshots, text)           │
│  └─ Webhook Delivery (HMAC-signed notifications)        │
└─────────────────────────────────────────────────────────┘

How It Works

Input Processing: Validates JSON input against schema, applies defaults
Tool Selection: Routes to appropriate tool handler based on mode
Execution: Runs tool with context (config, tracking, storage)
Metric Collection: Records bytes, tokens, retries, cache hits
Result Assembly: Builds structured report with metadata
Output: Pushes to dataset, sends webhook if configured

Key Features

Performance:

HTTP caching with ETag/Last-Modified support
Request deduplication (URL, content, hybrid)
Per-domain circuit breakers
Browser rendering (none/minimal/full)
Proxy rotation

AI/LLM:

Multi-provider support (OpenAI, Anthropic, Azure)
Cost tracking per request
Token usage monitoring
Structured JSON extraction

Observability:

Per-tool execution metrics
Cache hit/miss ratios
Circuit breaker trip counts
Correlation IDs for request tracking
Detailed error messages

Security:

HMAC webhook signatures
Robots.txt enforcement
Allow/deny list URL filtering
Log redaction for PII
Secret management via Apify

Chapter 2: Getting Started

Installation

Option 1: Use on Apify Console (Recommended)

Open Actor
Click "Try for free"
Configure input via UI
Click "Start"

Option 2: Deploy to Your Apify Account

Visit the Actor page
Click "Schedule" or "API" to integrate
Use Apify API or SDK to run programmatically

Authentication

Apify API Token:

Get your token from Apify Console → Settings → Integrations

LLM API Keys:

Store as Apify secrets:

Go to Apify Console → Settings → Secrets
Add secret: OPENAI_API_KEY = sk-...
Reference in input: "apiKeySecret": "OPENAI_API_KEY"

Or set as environment variables:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...

Your First Run

Example 1: Fetch a Web Page

{
  "mode": "single",
  "tool": "fetch_web",
  "params": {
    "url": "https://example.com",
    "stripBoilerplate": true
  }
}

Example 2: Summarize Text

{
  "mode": "single",
  "tool": "summarize",
  "params": {
    "text": "Long article text here...",
    "language": "en",
    "style": "concise"
  },
  "llm": {
    "provider": "openai",
    "model": "gpt-4o-mini",
    "apiKeySecret": "OPENAI_API_KEY"
  }
}

Example 3: Extract Data

{
  "mode": "single",
  "tool": "extract",
  "params": {
    "source": "url",
    "input": "https://news.ycombinator.com",
    "selectors": [
      { "name": "titles", "css": ".titleline > a" }
    ]
  }
}

Understanding Results

All runs produce a structured RunReport:

{
  "correlationId": "abc-123",
  "schemaVersion": 1,
  "ok": true,
  "mode": "single",
  "toolsExecuted": 1,
  "usage": {
    "durationMs": 1234,
    "httpBytes": 45678,
    "llmTokens": 150,
    "retries": 0,
    "cacheHits": 0,
    "cacheMisses": 1,
    "circuitBreakerTrips": 0
  },
  "costEstimateUSD": 0.0002,
  "warnings": [],
  "errors": [],
  "timestamp": "2025-01-07T12:34:56.789Z",
  "result": {
    "status": 200,
    "url": "https://example.com",
    "contentText": "Extracted content here...",
    "htmlSnippet": "<html>...",
    "links": []
  }
}

Key Fields:

ok: Overall success indicator
usage: Resource consumption metrics
costEstimateUSD: Estimated LLM costs
result: Tool output (single mode)
results: Array of outputs (batch mode)

Recommended Default Configuration

For optimal performance and cost savings, use these defaults:

{
  "cache": {
    "enabled": true,
    "ttlSec": 3600
  },
  "dedupe": {
    "enabled": true,
    "strategy": "url",
    "ttlSec": 86400
  },
  "budgets": {
    "maxDurationSec": 60,
    "maxTotalBytes": 5242880,
    "maxTotalTokens": 20000
  },
  "security": {
    "redactLogs": true
  }
}

Why these defaults:

Caching (1 hour) provides immediate ROI by avoiding duplicate fetches
URL deduplication (24 hours) prevents processing same pages multiple times
Budget limits prevent runaway costs
Log redaction protects sensitive data

Conversion-Optimized Examples

Example 1: Batch Mix (fetch + extract + summarize)

{
  "mode": "batch",
  "concurrency": 2,
  "dag": true,
  "calls": [
    {
      "callId": "fetch",
      "tool": "fetch_web",
      "params": {"url": "https://example.com/article"}
    },
    {
      "callId": "extract",
      "tool": "extract",
      "params": {
        "source": "text",
        "input": {"ref": "fetch.result.contentText"},
        "selectors": [{"name": "title", "regex": "^#\\s+(.+)$"}]
      },
      "dependsOn": ["fetch"]
    },
    {
      "callId": "summarize",
      "tool": "summarize",
      "params": {"text": {"ref": "fetch.result.contentText"}},
      "dependsOn": ["fetch"]
    }
  ],
  "llm": {
    "provider": "openai",
    "model": "gpt-4o-mini"
  }
}

Example 2: Structured Extract with Schema

{
  "mode": "single",
  "tool": "extract_structured",
  "params": {
    "source": "url",
    "input": "https://example.com/pricing",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "plans": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "price": {"type": "number"}
            }
          }
        }
      }
    }
  },
  "llm": {
    "provider": "openai",
    "model": "gpt-4o-mini"
  }
}

Example 3: Crawl with Storage

{
  "mode": "single",
  "tool": "crawl_lite",
  "params": {
    "startUrl": "https://example.com",
    "maxPages": 10,
    "maxDepth": 2
  },
  "store": {
    "html": true,
    "text": true
  }
}

Chapter 3: Tools Reference

fetch_web

Purpose: Download and parse web pages with smart content extraction

When to Use:

Fetching article content
Downloading HTML for later processing
Extracting clean text from pages

Parameters:

{
  url: string                    
  stripBoilerplate?: boolean     
  headers?: Record<string, string>
  timeoutMs?: number            
  maxBytes?: number              
  respectRobotsTxt?: boolean     
}

Complete Example:

{
  "mode": "single",
  "tool": "fetch_web",
  "params": {
    "url": "https://blog.example.com/article",
    "stripBoilerplate": true
  },
  "cache": {
    "enabled": true,
    "ttlSec": 3600
  }
}

Output:

{
  "status": 200,
  "url": "https://blog.example.com/article",
  "contentText": "Clean article text...",
  "htmlSnippet": "<html>...",
  "links": [
    { "href": "/about", "text": "About Us" }
  ],
  "meta": {
    "finalUrl": "https://blog.example.com/article",
    "contentType": "text/html",
    "bytes": 25678,
    "language": "en",
    "rendered": false
  }
}

Advanced Usage:

Enable browser rendering for JavaScript-heavy sites:

{
  "mode": "single",
  "tool": "fetch_web",
  "params": {
    "url": "https://spa-example.com"
  },
  "render": "minimal"
}

Store artifacts:

{
  "mode": "single",
  "tool": "fetch_web",
  "params": {
    "url": "https://example.com"
  },
  "store": {
    "html": true,
    "text": true,
    "screenshot": true
  }
}

extract

Purpose: Parse and extract data from HTML/text using selectors and patterns

When to Use:

Scraping structured data from web pages
Extracting specific fields
Pattern matching with regex

Parameters:

{
  source: 'url' | 'html' | 'text'  
  input: string                     
  selectors?: Array<{
    name: string
    css?: string      
    xpath?: string    
    regex?: string    
  }>
  patterns?: Array<{
    name: string
    regex: string     
    group?: number    
  }>
}

Complete Example:

{
  "mode": "single",
  "tool": "extract",
  "params": {
    "source": "url",
    "input": "https://news.ycombinator.com",
    "selectors": [
      {
        "name": "titles",
        "css": ".titleline > a"
      },
      {
        "name": "scores",
        "css": ".score"
      }
    ],
    "patterns": [
      {
        "name": "points",
        "regex": "(\\d+) points?",
        "group": 1
      }
    ]
  }
}

Output:

{
  "fields": {
    "titles": [
      "Show HN: My New Project",
      "Ask HN: How do you...",
      "Tell HN: Something..."
    ],
    "scores": ["123 points", "45 points", "67 points"]
  },
  "matches": {
    "points": ["123", "45", "67"]
  }
}

Advanced Usage:

Extract from HTML string:

{
  "mode": "single",
  "tool": "extract",
  "params": {
    "source": "html",
    "input": "<article><h1>Title</h1><p>Body</p></article>",
    "selectors": [
      { "name": "headline", "css": "h1" },
      { "name": "body", "css": "p" }
    ]
  }
}

Use XPath for complex queries:

{
  "mode": "single",
  "tool": "extract",
  "params": {
    "source": "url",
    "input": "https://example.com",
    "selectors": [
      {
        "name": "metadata",
        "xpath": "//meta[@property='og:title']/@content"
      }
    ]
  }
}

summarize

Purpose: AI-powered text summarization with language and style control

When to Use:

Condensing long articles
Creating executive summaries
Generating TL;DR versions

Parameters:

{
  text: string                    
  language?: string               
  style?: string                  
  maxTokens?: number              
  model?: string                  
  apiKeySecret?: string           
}

Complete Example:

{
  "mode": "single",
  "tool": "summarize",
  "params": {
    "text": "Long article about climate change spanning multiple paragraphs...",
    "language": "en",
    "style": "concise",
    "maxTokens": 200
  },
  "llm": {
    "provider": "openai",
    "model": "gpt-4o-mini",
    "apiKeySecret": "OPENAI_API_KEY"
  }
}

Output:

{
  "summary": "Climate change is accelerating due to human activities. Key impacts include rising temperatures, extreme weather, and ecosystem disruption. Immediate action is needed.",
  "tokens": 150
}

Advanced Usage:

Multi-language summarization:

{
  "mode": "single",
  "tool": "summarize",
  "params": {
    "text": "Article en français...",
    "language": "fr",
    "style": "detailed"
  },
  "llm": {
    "provider": "anthropic",
    "model": "claude-3-5-sonnet-20241022"
  }
}

Bullet-point summaries:

{
  "mode": "single",
  "tool": "summarize",
  "params": {
    "text": "Long technical document...",
    "style": "bullet"
  }
}

classify

Purpose: Categorize text into predefined labels using AI

When to Use:

Support ticket routing
Content moderation
Sentiment analysis
Topic classification

Parameters:

{
  text: string           
  labels: string[]       
  maxTokens?: number     
  model?: string         
  apiKeySecret?: string  
}

Complete Example:

{
  "mode": "single",
  "tool": "classify",
  "params": {
    "text": "My account was charged twice for the same purchase. How do I get a refund?",
    "labels": ["billing", "technical", "account", "general"]
  },
  "llm": {
    "provider": "openai",
    "model": "gpt-4o-mini",
    "apiKeySecret": "OPENAI_API_KEY"
  }
}

Output:

{
  "label": "billing",
  "confidence": 0.95,
  "tokens": 50
}

Advanced Usage:

Sentiment classification:

{
  "mode": "single",
  "tool": "classify",
  "params": {
    "text": "This product exceeded my expectations!",
    "labels": ["positive", "neutral", "negative"]
  }
}

transform

Purpose: Transform and reshape JSON data with mapping rules

When to Use:

Data normalization
API response transformation
Field mapping and renaming

Parameters:

{
  inputJson: any             
  mapping: Array<{
    from?: string            
    to: string               
    op?: string              
    value?: any              
  }>
}

Complete Example:

{
  "mode": "single",
  "tool": "transform",
  "params": {
    "inputJson": {
      "user": {
        "firstName": "John",
        "lastName": "Doe",
        "tags": ["vip", "beta"],
        "created": "2025-01-07"
      }
    },
    "mapping": [
      {
        "from": "user.firstName",
        "to": "customer.name"
      },
      {
        "from": "user.tags",
        "to": "customer.segments",
        "op": "join",
        "value": ","
      },
      {
        "from": "user.created",
        "to": "customer.joinDate",
        "op": "dateParse"
      }
    ]
  }
}

Output:

{
  "customer": {
    "name": "John",
    "segments": "vip,beta",
    "joinDate": "2025-01-07T00:00:00.000Z"
  }
}

Available Operations:

copy: Copy value as-is (default)
const: Set constant value
join: Join array elements with delimiter
split: Split string into array
pick: Extract nested value by path
concat: Concatenate values
replace: Replace text patterns
dateParse: Parse date strings
numberParse: Parse numeric values
lookup: Map values using dictionary
pickByPath: Extract by dot notation path

crawl_lite

Purpose: Lightweight web crawler with configurable depth and pagination

When to Use:

Crawling small to medium sites
Following pagination
Discovering internal links

Parameters:

{
  startUrl: string            
  maxPages?: number           
  maxDepth?: number           
  sameOriginOnly?: boolean    
  delayMs?: number            
}

Complete Example:

{
  "mode": "single",
  "tool": "crawl_lite",
  "params": {
    "startUrl": "https://blog.example.com",
    "maxPages": 10,
    "maxDepth": 2,
    "sameOriginOnly": true,
    "delayMs": 500
  },
  "dedupe": {
    "enabled": true,
    "strategy": "url"
  }
}

Output:

{
  "pages": [
    {
      "url": "https://blog.example.com",
      "status": 200,
      "bytes": 12345,
      "linksCount": 15,
      "cached": false
    },
    {
      "url": "https://blog.example.com/about",
      "status": 200,
      "bytes": 8900,
      "linksCount": 5,
      "cached": false
    }
  ]
}

Advanced Usage:

Store crawled HTML:

{
  "mode": "single",
  "tool": "crawl_lite",
  "params": {
    "startUrl": "https://example.com",
    "maxPages": 20
  },
  "store": {
    "html": true
  }
}

extract_structured

Purpose: Extract data matching JSON schemas using AI

When to Use:

Extracting complex structured data
Schema-driven extraction
Semi-structured content parsing

Parameters:

{
  source: 'text' | 'html' | 'url'
  input: string               
  jsonSchema: object          
  llm?: {
    provider?: string
    model?: string
    apiKeySecret?: string
    maxTokens?: number
  }
}

Complete Example:

{
  "mode": "single",
  "tool": "extract_structured",
  "params": {
    "source": "text",
    "input": "John Doe works as a Senior Engineer at Acme Corp. His email is john@acme.com and phone is +1-555-0123. He joined in January 2020.",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "position": { "type": "string" },
        "company": { "type": "string" },
        "email": { "type": "string" },
        "phone": { "type": "string" },
        "joinDate": { "type": "string" }
      }
    }
  },
  "llm": {
    "provider": "openai",
    "model": "gpt-4o",
    "apiKeySecret": "OPENAI_API_KEY"
  }
}

Output:

{
  "data": {
    "name": "John Doe",
    "position": "Senior Engineer",
    "company": "Acme Corp",
    "email": "john@acme.com",
    "phone": "+1-555-0123",
    "joinDate": "January 2020"
  },
  "confidence": 0.9,
  "tokens": 320
}

Advanced Usage:

Extract arrays:

{
  "mode": "single",
  "tool": "extract_structured",
  "params": {
    "source": "text",
    "input": "We offer three plans: Basic ($9/mo), Pro ($29/mo), Enterprise ($99/mo)",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "plans": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "price": { "type": "number" }
            }
          }
        }
      }
    }
  }
}

search_web

Purpose: Find URLs via sitemaps, RSS feeds, or search APIs

When to Use:

Discovering content URLs
Sitemap parsing
RSS feed aggregation

Parameters:

{
  query?: string                    
  sitemapUrl?: string              
  rssUrl?: string                  
  maxResults?: number              
}

Complete Example:

{
  "mode": "single",
  "tool": "search_web",
  "params": {
    "sitemapUrl": "https://example.com/sitemap.xml",
    "maxResults": 50
  }
}

Output:

{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "count": 3,
  "source": "sitemap"
}

Advanced Usage:

Parse RSS feeds:

{
  "mode": "single",
  "tool": "search_web",
  "params": {
    "rssUrl": "https://blog.example.com/feed",
    "maxResults": 20
  }
}

diff_text

Purpose: Compare text with semantic or character-level differences

When to Use:

Content change detection
Version comparison
Update monitoring

Parameters:

{
  text1: string              
  text2: string              
  semantic?: boolean         
}

Complete Example:

{
  "mode": "single",
  "tool": "diff_text",
  "params": {
    "text1": "The quick brown fox jumps.",
    "text2": "The quick red fox leaps.",
    "semantic": true
  }
}

Output:

{
  "additions": ["red", "leaps"],
  "deletions": ["brown", "jumps"],
  "changeScore": 0.286
}

Advanced Usage:

Character-level diff:

{
  "mode": "single",
  "tool": "diff_text",
  "params": {
    "text1": "hello",
    "text2": "helo",
    "semantic": false
  }
}

Chapter 4: Execution Modes

Single Mode

Execute one tool at a time.

Example:

{
  "mode": "single",
  "tool": "fetch_web",
  "params": {
    "url": "https://example.com"
  }
}

When to Use:

Simple one-off operations
Testing tools
API integrations

Batch Mode

Execute multiple tools in parallel with configurable concurrency.

Example:

{
  "mode": "batch",
  "concurrency": 3,
  "calls": [
    {
      "tool": "fetch_web",
      "params": { "url": "https://example.com/page1" }
    },
    {
      "tool": "fetch_web",
      "params": { "url": "https://example.com/page2" }
    },
    {
      "tool": "summarize",
      "params": { "text": "Long text..." }
    }
  ]
}

When to Use:

Processing multiple URLs
Parallel data operations
Bulk transformations

Output:

{
  "results": [
    {
      "tool": "fetch_web",
      "ok": true,
      "output": { "status": 200, "contentText": "..." }
    },
    {
      "tool": "fetch_web",
      "ok": true,
      "output": { "status": 200, "contentText": "..." }
    },
    {
      "tool": "summarize",
      "ok": true,
      "output": { "summary": "...", "tokens": 150 }
    }
  ]
}

DAG Dependencies

Execute tools with dependencies using Directed Acyclic Graph resolution.

Example:

{
  "mode": "batch",
  "dag": true,
  "calls": [
    {
      "callId": "fetch",
      "tool": "fetch_web",
      "params": { "url": "https://example.com" }
    },
    {
      "callId": "extract",
      "tool": "extract",
      "params": {
        "source": "html",
        "input": { "ref": "fetch.htmlSnippet" },
        "selectors": [{ "name": "title", "css": "h1" }]
      },
      "dependsOn": ["fetch"]
    },
    {
      "callId": "summarize",
      "tool": "summarize",
      "params": {
        "text": { "ref": "fetch.contentText" }
      },
      "dependsOn": ["fetch"]
    }
  ]
}

When to Use:

Multi-step workflows
Chained transformations
Complex data pipelines

Reference Syntax:

{ "ref": "callId" } - Reference entire result
{ "ref": "callId.path.to.field" } - Reference nested field
{ "ref": "callId.array.0" } - Reference array element

Performance Tips

Optimize Concurrency:

HTTP-only: 5-10 concurrent
With proxies: 2-5 concurrent
Browser rendering: 1-2 concurrent

Use Caching:

{
  "cache": {
    "enabled": true,
    "ttlSec": 3600
  }
}

Enable Deduplication:

{
  "dedupe": {
    "enabled": true,
    "strategy": "url"
  }
}

Set Budgets:

{
  "budgets": {
    "maxDurationSec": 300,
    "maxTotalBytes": 52428800,
    "maxTotalTokens": 100000
  }
}

Chapter 5: AI/LLM Integration

Supported Providers

OpenAI:

Models: gpt-4o, gpt-4o-mini, gpt-4, gpt-3.5-turbo
Best for: General purpose, structured extraction
Cost: Approximately $0.15-$10 per 1M tokens (subject to change)

Anthropic (Claude):

Models: claude-3-5-sonnet-20241022, claude-3-haiku-20240307
Best for: Long-form content, complex reasoning
Cost: Approximately $0.25-$15 per 1M tokens (subject to change)

Azure OpenAI:

Models: Same as OpenAI, deployed to Azure
Best for: Enterprise compliance, regional requirements
Cost: Similar to OpenAI, billed through Azure (subject to change)

Model Selection

Configuration:

{
  "llm": {
    "provider": "openai",
    "model": "gpt-4o-mini",
    "apiKeySecret": "OPENAI_API_KEY",
    "maxTokens": 4000
  }
}

Choosing Models:

Task	Recommended Model	Reason
Summarization	`gpt-4o-mini`	Fast, cheap, accurate
Classification	`gpt-4o-mini`	Low latency, cost-effective
Structured extraction	`gpt-4o`	Better schema adherence
Complex reasoning	`claude-3-5-sonnet`	Superior reasoning
Bulk operations	`gpt-4o-mini`	Cost optimization

Cost Optimization

1. Use Cheaper Models:

{
  "llm": {
    "provider": "openai",
    "model": "gpt-4o-mini"
  }
}

2. Limit Token Usage:

{
  "llm": {
    "maxTokens": 500
  },
  "budgets": {
    "maxTotalTokens": 50000
  }
}

3. Cache Results:

{
  "cache": {
    "enabled": true,
    "ttlSec": 86400
  }
}

4. Monitor Costs:

Check costEstimateUSD in run reports:

{
  "costEstimateUSD": 0.0045,
  "usage": {
    "llmTokens": 3000,
    "llmCosts": {
      "openai": 0.0045,
      "anthropic": 0.0000,
      "azure": 0.0000,
      "total": 0.0045
    }
  }
}

Automatic Cost Tracking

MCP Nexus automatically tracks LLM costs per provider with detailed breakdowns.

How It Works:

Costs are calculated automatically for each LLM call
Per-provider breakdown is maintained (OpenAI, Anthropic, Azure)
Costs are displayed in logs during execution
Final cost summary included in run report

Cost Tracking in Logs:

During execution, you'll see cost information for each LLM call:

[INFO]  LLM cost: $0.0012 (openai, gpt-4o-mini, 450 tokens)
[INFO]  LLM cost: $0.0035 (anthropic, claude-3-5-sonnet-20241022, 890 tokens)

At the end of the run, a summary is displayed:

[INFO]  LLM Costs: OpenAI $0.0024, Anthropic $0.0035, Azure $0.0000, Total $0.0059

Cost Breakdown in Output:

The usage.llmCosts field provides a detailed breakdown:

{
  "usage": {
    "llmTokens": 1340,
    "llmCosts": {
      "openai": 0.0024,
      "anthropic": 0.0035,
      "azure": 0.0000,
      "total": 0.0059
    }
  },
  "costEstimateUSD": 0.0059
}

Per-Tool Cost Tracking:

Costs are tracked individually for each tool that uses LLM:

summarize: Full cost per summary generated
classify: Cost per classification
extract_structured: Cost per extraction

Multi-Provider Support:

If you use multiple LLM providers in a single run (e.g., OpenAI for classification and Anthropic for summarization), costs are tracked separately:

{
  "mode": "batch",
  "calls": [
    {
      "tool": "classify",
      "params": {"text": "...", "labels": ["..."]},
      "llm": {"provider": "openai", "model": "gpt-4o-mini"}
    },
    {
      "tool": "summarize",
      "params": {"text": "..."},
      "llm": {"provider": "anthropic", "model": "claude-3-5-sonnet-20241022"}
    }
  ]
}

Result:

{
  "usage": {
    "llmCosts": {
      "openai": 0.0008,
      "anthropic": 0.0042,
      "total": 0.0050
    }
  }
}

Benefits:

Transparency: Know exactly what each LLM call costs
Optimization: Identify expensive operations and optimize
Budgeting: Track costs against allocated budgets
Multi-Provider: Compare costs across different providers

Token Management

Token Limits by Model:

Model	Input Limit	Output Limit
gpt-4o	128K	16K
gpt-4o-mini	128K	16K
claude-3-5-sonnet	200K	8K
claude-3-haiku	200K	4K

Tracking Usage:

Every LLM tool returns token count:

{
  "summary": "...",
  "tokens": 450
}

Total tokens tracked in usage:

{
  "usage": {
    "llmTokens": 1250
  }
}

Structured Extraction Details

Use extract_structured for complex data extraction:

{
  "mode": "single",
  "tool": "extract_structured",
  "params": {
    "source": "text",
    "input": "Product: iPhone 15 Pro\nPrice: $999\nColor: Blue",
    "jsonSchema": {
      "type": "object",
      "properties": {
        "product": { "type": "string" },
        "price": { "type": "number" },
        "color": { "type": "string" }
      },
      "required": ["product", "price"]
    }
  },
  "llm": {
    "provider": "openai",
    "model": "gpt-4o"
  }
}

Tips:

Use detailed schemas with descriptions
Prefer gpt-4o over gpt-4o-mini for complex schemas
Validate extracted data in your application

Chapter 6: Performance & Optimization

HTTP Caching

How It Works:

MCP Nexus implements intelligent HTTP caching with:

ETag header support
Last-Modified header support
Configurable TTL
Per-URL cache entries

Configuration:

{
  "cache": {
    "enabled": true,
    "ttlSec": 3600
  }
}

Cache Metrics:

Monitor effectiveness:

{
  "usage": {
    "cacheHits": 15,
    "cacheMisses": 3
  }
}

Aim for >70% hit rate for repeated workloads.

TTL Guidelines:

Content Type	Recommended TTL
Static content	86400 (24h)
News/blogs	3600 (1h)
Product prices	300 (5min)
Stock data	60 (1min)
User content	0 (disabled)

Request Deduplication

Strategies:

URL-based: Same URL = duplicate
Content-based: Same content hash = duplicate
Hybrid: URL + content hash

Configuration:

{
  "dedupe": {
    "enabled": true,
    "strategy": "hybrid",
    "ttlSec": 86400
  }
}

When to Use:

Crawling workflows
Batch processing
RSS/sitemap parsing
Not for real-time data fetching
Not for dynamic content

Example:

{
  "mode": "single",
  "tool": "crawl_lite",
  "params": {
    "startUrl": "https://example.com",
    "maxPages": 100
  },
  "dedupe": {
    "enabled": true,
    "strategy": "url"
  }
}

Circuit Breakers

Purpose: Prevent cascading failures by detecting and isolating failing services.

How It Works:

Track failures per domain
Open circuit after N failures
Half-open after cooldown period
Close after successful requests

Default Behavior:

Failure threshold: 3 failures
Cooldown: 60-120 seconds (randomized)
Success threshold: 2 successes to close

Monitoring:

{
  "usage": {
    "circuitBreakerTrips": 2
  }
}

High trip counts indicate:

Target site issues
Rate limiting
Network problems
Need for tuning

Best Practices:

Monitor trip counts
Investigate domains with frequent trips
Adjust delays between requests
Use proxies for problematic domains

Proxy Configuration

When to Use Proxies:

Scraping rate-limited sites
Avoiding IP blocks
Geographic targeting
High-volume scraping

Apify Proxy (Recommended):

{
  "proxy": {
    "useApifyProxy": true
  }
}

Benefits:

Residential and datacenter IPs
Automatic rotation
Geographic targeting
Built-in retry logic

Cost: Approximately $0.50 per GB (subject to change)

Custom Proxies:

{
  "proxy": {
    "proxyUrls": [
      "http://user:pass@proxy1.example.com:8000",
      "http://user:pass@proxy2.example.com:8000"
    ]
  }
}

User-Agent Rotation:

Automatic rotation through realistic browser User-Agents. No configuration needed.

Browser Rendering

Modes:

None (Default):

HTTP-only fetching
Fastest (100-500ms per page)
No JavaScript execution
Use for static content

Minimal:

{
  "render": "minimal"
}

Launches headless browser
Waits 2-3 seconds for JS
No screenshots
Use for light JavaScript sites

Full:

{
  "render": "full"
}

Full browser rendering
Waits for network idle
Captures screenshots
Use for complex SPAs

Performance Impact:

Mode	Speed	Memory	CPU	Cost
None	1x	50MB	1x	1x
Minimal	20x slower	300MB	5x	5x
Full	40x slower	500MB	10x	10x

When to Use:

None: Static HTML, APIs, RSS feeds
Minimal: E-commerce, news sites with JS
Full: SPAs, React/Vue apps, complex UIs

Chapter 7: Security & Compliance

HMAC Webhook Verification

Overview:

All webhooks include HMAC-SHA256 signatures for verification.

Signature Format:

X-Signature: sha256=<hex-encoded-hmac>
X-Timestamp: <ISO-8601-timestamp>
X-Request-Id: <UUID-v4>

HMAC computed over: timestamp + "." + body

Node.js Verification:

const crypto = require('crypto');

function verifyWebhook(body, timestamp, signature, secret) {
  const payload = `${timestamp}.${JSON.stringify(body)}`;
  const expectedSignature = crypto
    .createHmac('sha256', secret)
    .update(payload)
    .digest('hex');
  
  const expected = Buffer.from(`sha256=${expectedSignature}`, 'utf8');
  const actual = Buffer.from(signature, 'utf8');
  
  if (expected.length !== actual.length) {
    return false;
  }
  
  return crypto.timingSafeEqual(expected, actual);
}

app.post('/webhook', (req, res) => {
  const secret = process.env.WEBHOOK_SECRET;
  const signature = req.headers['x-signature'];
  const timestamp = req.headers['x-timestamp'];
  
  if (!verifyWebhook(req.body, timestamp, signature, secret)) {
    return res.status(401).send('Invalid signature');
  }
  
  console.log('Webhook verified:', req.body);
  res.status(200).send('OK');
});

Python Verification:

import hmac
import hashlib

def verify_webhook(signature, timestamp, body, secret):
    expected = 'sha256=' + hmac.new(
        secret.encode('utf-8'),
        f'{timestamp}.{body}'.encode('utf-8'),
        hashlib.sha256
    ).hexdigest()
    
    return hmac.compare_digest(signature, expected)

@app.route('/webhook', methods=['POST'])
def webhook():
    signature = request.headers.get('X-Signature')
    timestamp = request.headers.get('X-Timestamp')
    body = request.get_data(as_text=True)
    secret = os.environ['WEBHOOK_SECRET']
    
    if not verify_webhook(signature, timestamp, body, secret):
        return 'Invalid signature', 401
    
    data = request.json
    print('Webhook verified:', data)
    return 'OK', 200

Replay Attack Prevention:

Check timestamp (reject >5 minutes old)
Store and check idempotency keys
Use HTTPS only

Robots.txt Respect

Default Behavior:

Respects robots.txt for all fetch_web and crawl_lite operations.

Features:

Wildcard pattern support
Crawl-delay extraction
User-agent: * rules

Override Per Domain:

{
  "security": {
    "ignoreRobotsFor": ["example.com", "api.example.com"]
  }
}

Legal Considerations:

Respecting robots.txt is a best practice
Check Terms of Service of target sites
Public data ≠ permission to scrape at scale
Some countries have specific web scraping laws

Domain Allow/Deny Lists

Allowlist (Whitelist):

Only process URLs matching patterns:

{
  "security": {
    "allowlist": [
      "^https://example\\.com/.*",
      "^https://api\\.mysite\\.com/.*"
    ]
  }
}

Denylist (Blacklist):

Block specific patterns:

{
  "security": {
    "denylist": [
      "^https://example\\.com/admin/.*",
      "^https://.*\\.gov/.*",
      "^https://.*\\.mil/.*"
    ]
  }
}

SSRF Protection:

Block internal networks:

{
  "security": {
    "denylist": [
      "^https?://127\\.0\\.0\\.1/.*",
      "^https?://localhost/.*",
      "^https?://169\\.254\\..*",
      "^https?://10\\..*",
      "^https?://172\\.(1[6-9]|2[0-9]|3[0-1])\\..*",
      "^https?://192\\.168\\..*"
    ]
  }
}

PII Redaction

Enable Log Redaction:

{
  "security": {
    "redactLogs": true
  }
}

What Gets Redacted:

Tool results in console logs
result field in single mode
results array in batch mode

What's NOT Redacted:

Metadata (timing, tokens, errors)
Dataset outputs
Webhook payloads
Key-value store artifacts

Secret Management

Using Apify Secrets:

Go to Apify Console → Settings → Secrets
Add secret (e.g., OPENAI_API_KEY)
Reference in input:

{
  "llm": {
    "apiKeySecret": "OPENAI_API_KEY"
  }
}

Environment Variables:

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export WEBHOOK_SECRET=your-secret

Best Practices:

Never commit secrets to repositories
Use different secrets for dev/staging/prod
Rotate secrets quarterly
Use minimal required permissions
Monitor secret usage
Delete unused secrets

Content Security

Safe HTML Parsing:

Uses cheerio and jsdom safely
No eval() or code execution
Sandboxed DOM operations
XSS-safe by design

PDF Parsing:

Memory-limited parsing
No code execution
Timeout protection

XML Parsing:

Entity expansion disabled
DTD processing disabled
XXE attack prevention

Chapter 8: Production Deployment

Rate Limits & Best Practices

Respecting Target Sites:

Always respect robots.txt
Use appropriate delays (300ms minimum)
Implement exponential backoff for 429 responses
Monitor circuit breaker trips

Recommended Settings:

{
  "budgets": {
    "maxDurationSec": 300,
    "maxCalls": 100,
    "maxPages": 50,
    "maxTotalBytes": 52428800,
    "maxTotalTokens": 100000
  }
}

Rate Limiting Strategy:

Per-domain circuit breakers (automatic)
HTTP caching (reduce requests)
Deduplication (avoid duplicates)
Delays in crawl_lite (300-1000ms)

Anti-Bot Strategies

When to Use Proxies:

Sites with strict rate limits
Many concurrent requests
IP blocking issues
Geographic targeting needed

User-Agent Rotation:

Automatic rotation through realistic browser User-Agents.

Additional Techniques:

Random delays in crawl_lite
Respect crawl-delay from robots.txt
Use browser rendering for JS-heavy sites
Limit batch concurrency (2-5)

Example:

{
  "mode": "single",
  "tool": "fetch_web",
  "params": {
    "url": "https://strict-site.com"
  },
  "proxy": {
    "useApifyProxy": true
  },
  "render": "minimal"
}

When to Use Browser Rendering

Use "minimal" mode when:

Site requires JavaScript but loads quickly
Need basic interactivity
Performance is a priority

Use "full" mode when:

Complex JavaScript applications
Need to wait for async content
Screenshots required for verification
SPAs (Single Page Applications)

Avoid browser rendering when:

Static HTML is sufficient
Performance is critical
Costs need minimization

Cost Comparison:

Mode	Pages/Hour	Cost Multiplier
HTTP-only	3600	1x
Minimal	180	20x
Full	90	40x

LLM Provider Limits

OpenAI:

Model	TPM Limit (Free)	Approx. Cost per 1M Tokens
gpt-4o	10,000	~$2.50 input, ~$10 output
gpt-4o-mini	200,000	~$0.15 input, ~$0.60 output

Anthropic:

Model	TPM Limit	Approx. Cost per 1M Tokens
claude-3-5-sonnet	Varies	~$3 input, ~$15 output
claude-3-haiku	Higher	~$0.25 input, ~$1.25 output

Optimization Tips:

Use cheaper models for simple tasks
Cache LLM results
Limit maxTokens
Use structured extraction sparingly
Monitor costEstimateUSD

Circuit Breaker Tuning

Default Settings:

Failure threshold: 3 failures
Cooldown: 60-120 seconds
Success threshold: 2 successes

Adjust For:

Aggressive (Critical Production):

Lower failure threshold (2)
Longer cooldown (180s)

Lenient (Flaky Sources):

Higher failure threshold (5)
Shorter cooldown (30s)

Monitoring:

{
  "usage": {
    "circuitBreakerTrips": 3
  }
}

High trips indicate:

Target site issues
Rate limiting
Network problems
Need for adjustment

Cache TTL Guidelines

By Content Type:

Type	TTL (seconds)	Rationale
Static content	86400	Changes rarely
News/blogs	3600	Updated hourly
Product prices	300	Frequent changes
Stock data	60	Real-time needs
User content	0	Always fresh

Configuration:

{
  "cache": {
    "enabled": true,
    "ttlSec": 3600
  }
}

Monitor Effectiveness:

{
  "usage": {
    "cacheHits": 85,
    "cacheMisses": 15
  }
}

Aim for >70% hit rate for repeated workloads.

Cost Optimization Strategies

1. Tiered Approach:

Try HTTP → Try minimal browser → Use full rendering

2. Batch Similar Operations:

Group by domain to leverage cache and circuit breakers:

{
  "mode": "batch",
  "calls": [
    {"tool": "fetch_web", "params": {"url": "https://example.com/page1"}},
    {"tool": "fetch_web", "params": {"url": "https://example.com/page2"}},
    {"tool": "fetch_web", "params": {"url": "https://example.com/page3"}}
  ]
}

3. Enable Deduplication:

{
  "dedupe": {
    "enabled": true,
    "strategy": "url"
  }
}

4. Minimize LLM Usage:

Use extract instead of extract_structured when possible
Cache LLM results
Use smaller models (gpt-4o-mini)
Set aggressive maxTokens limits

5. Optimize Concurrency:

Scenario	Recommended Concurrency
HTTP-only	5-10
With proxies	2-5
Browser rendering	1-2

6. Store Only What You Need:

{
  "store": {
    "html": false,
    "screenshot": false,
    "text": true
  }
}

Chapter 9: Development Guide

Project Structure

mcp-nexus/
├── .actor/
│   ├── actor.json              # Actor metadata and config
│   ├── input_schema.json       # Input validation schema
│   ├── dataset_schema.json     # Dataset view schema
│   └── key_value_store_schema.json  # KVS collection schema
├── src/
│   ├── main.ts                 # Entry point and orchestrator
│   ├── types.ts                # TypeScript type definitions
│   ├── lib/
│   │   ├── validators.ts       # Input validation (Zod)
│   │   ├── http.ts             # HTTP client with caching
│   │   ├── circuitBreaker.ts   # Circuit breaker logic
│   │   ├── deduplication.ts    # Duplicate detection
│   │   ├── llm.ts              # LLM client wrapper
│   │   ├── browser.ts          # Playwright browser manager
│   │   ├── proxy.ts            # Proxy and UA rotation
│   │   ├── sitemap.ts          # Sitemap/RSS parser
│   │   ├── diff.ts             # Text diff utilities
│   │   ├── transform.ts        # JSON transformation
│   │   └── webhook.ts          # Webhook delivery
│   └── tools/
│       ├── fetchWeb.ts         # Web fetching tool
│       ├── extract.ts          # Data extraction tool
│       ├── summarize.ts        # AI summarization tool
│       ├── classify.ts         # AI classification tool
│       ├── transform.ts        # JSON transformation tool
│       ├── crawlLite.ts        # Web crawler tool
│       ├── extractStructured.ts # Structured extraction tool
│       ├── searchWeb.ts        # URL discovery tool
│       └── diffText.ts         # Text comparison tool
├── storage/                    # Local dev storage
│   ├── datasets/
│   ├── key_value_stores/
│   └── request_queues/
├── Dockerfile                  # Container image definition
├── package.json                # Dependencies
├── tsconfig.json               # TypeScript config
└── README.md                   # This file

Understanding the Code

Key Components:

Main Orchestrator (src/main.ts):

Entry point using Apify SDK
Input validation and parsing
Tool routing and execution
Metric collection and reporting
Webhook delivery

Tool Runtime Context:

Each tool receives a context object with:

Configuration (cache, dedupe, render, etc.)
Recording functions (HTTP bytes, tokens, retries)
Key-value store access
Circuit breaker state
User agent

Tool Implementation Pattern:

export const runMyTool = async (
  params: MyToolParams,
  ctx: ToolRuntimeContext
) => {
  // Tool logic here
  
  return {
    // Tool output
  }
}

Validators (src/lib/validators.ts):

Zod schemas for all tool parameters
Input parsing and validation
Default value resolution
Type safety guarantees

Infrastructure Libraries:

http.ts: Fetch with caching, robots.txt, PDF parsing
circuitBreaker.ts: Per-domain failure tracking
deduplication.ts: URL/content fingerprinting
llm.ts: Multi-provider LLM client
browser.ts: Playwright rendering
proxy.ts: User-agent rotation

Testing

Local Testing:

The Apify platform handles local testing. Use the Apify Console to:

Configure input
Run locally or on cloud
View results in Dataset tab

Test with specific inputs:

Use the Console UI to test different:

Tool configurations
Execution modes
Cache settings
Error scenarios

Debugging

Enable Verbose Logging:

Check console output for:

Request/response details
Cache hits/misses
Circuit breaker state
Token usage

Inspect Storage:

Local development stores data in storage/:

datasets/default/ - Run reports
key_value_stores/default/ - Artifacts
key_value_stores/default/INPUT.json - Input

Check Metrics:

Every run includes detailed metrics:

{
  "usage": {
    "durationMs": 1234,
    "httpBytes": 45678,
    "llmTokens": 150,
    "retries": 0,
    "cacheHits": 5,
    "cacheMisses": 2,
    "circuitBreakerTrips": 0
  }
}

Use Correlation IDs:

Track requests across systems:

{
  "correlationId": "my-request-123"
}

Chapter 10: API & Integration

Apify API Usage

Run Actor:

curl "https://api.apify.com/v2/acts/USERNAME~mcp-nexus/runs?token=YOUR_TOKEN" \
  -X POST \
  -H 'content-type: application/json' \
  -d '{
    "mode": "single",
    "tool": "fetch_web",
    "params": {"url": "https://example.com"}
  }'

Get Run Status:

$curl "https://api.apify.com/v2/acts/USERNAME~mcp-nexus/runs/RUN_ID?token=YOUR_TOKEN"

Get Dataset Items:

$curl "https://api.apify.com/v2/datasets/DATASET_ID/items?token=YOUR_TOKEN"

Full Documentation:

Apify API Reference

Webhook Setup

Configuration:

{
  "webhook": {
    "url": "https://api.example.com/webhook",
    "secret": "your-webhook-secret",
    "batching": true
  }
}

Webhook Payload:

Receives complete RunReport:

{
  "correlationId": "abc-123",
  "ok": true,
  "mode": "single",
  "result": {...},
  "usage": {...}
}

Headers:

Content-Type: application/json
X-Signature: sha256=<hmac>
X-Timestamp: <iso-timestamp>
X-Request-Id: <uuid>

Verification:

See HMAC Webhook Verification for code examples.

Webhook Batching

Overview:

Webhook batching groups simultaneous webhook updates in batch mode, reducing the number of webhook calls and improving efficiency.

How It Works:

When multiple tool calls complete within a time window (500ms), their results are batched
A single webhook is sent with all grouped results
Only applies to batch mode execution
Maintains order and correlation

Enable Batching:

{
  "mode": "batch",
  "calls": [
    {"tool": "fetch_web", "params": {"url": "https://example.com/page1"}},
    {"tool": "fetch_web", "params": {"url": "https://example.com/page2"}},
    {"tool": "summarize", "params": {"text": "..."}}
  ],
  "webhook": {
    "url": "https://api.example.com/webhook",
    "secret": "your-secret",
    "batching": true
  }
}

Batched Webhook Payload:

When multiple updates are grouped, the webhook receives:

{
  "type": "batch",
  "count": 3,
  "items": [
    {
      "tool": "fetch_web",
      "result": {
        "status": 200,
        "contentText": "..."
      }
    },
    {
      "tool": "fetch_web",
      "result": {
        "status": 200,
        "contentText": "..."
      }
    },
    {
      "tool": "summarize",
      "result": {
        "summary": "...",
        "tokens": 150
      }
    }
  ]
}

Single vs. Batch Payload:

If only one update is in the batch window, it sends the regular format:

{
  "correlationId": "abc-123",
  "ok": true,
  "mode": "batch",
  "results": [...]
}

Logs:

During execution with batching enabled:

[INFO]  Webhook batch: 3 updates grouped
[INFO]  Sending batched webhook

Configuration Options:

Field	Type	Default	Description
`batching`	`boolean`	`true`	Enable webhook batching for batch mode

Disable Batching:

To send individual webhooks for each result:

{
  "webhook": {
    "url": "https://api.example.com/webhook",
    "secret": "your-secret",
    "batching": false
  }
}

Benefits:

Reduced Calls: Fewer webhook requests to your endpoint
Efficiency: Lower network overhead and processing
Grouping: Related results arrive together
Cost Savings: Reduced webhook processing costs

Use Cases:

High-volume batch processing: Process many tool calls efficiently
API rate limits: Reduce webhook endpoint load
Correlated updates: Group related results for easier processing
Cost optimization: Minimize webhook infrastructure costs

Important Notes:

Batching only applies to batch mode ("mode": "batch")
Single mode always sends individual webhooks
Batch window is 500ms (not configurable)
Empty batches are not sent
Default is enabled (batching: true)

Handling Batched Webhooks:

Your webhook endpoint should handle both regular and batched formats:

app.post('/webhook', (req, res) => {
  const payload = req.body;
  
  if (payload.type === 'batch') {
    console.log(`Received batch of ${payload.count} items`);
    payload.items.forEach(item => {
      console.log(`Tool: ${item.tool}`, item.result);
    });
  } else {
    console.log('Received single result');
    console.log(payload.result || payload.results);
  }
  
  res.status(200).send('OK');
});

n8n Integration

Step 1: HTTP Request Node

Configure HTTP Request node:

Method: POST
URL: https://api.apify.com/v2/acts/USERNAME~mcp-nexus/runs?token=YOUR_TOKEN
Body: JSON

Step 2: Pass Input

{
  "mode": "single",
  "tool": "fetch_web",
  "params": {
    "url": "{{$json.url}}"
  }
}

Step 3: Wait for Completion

Add Wait node or use webhooks for async notification.

Step 4: Process Results

Parse dataset output in subsequent nodes.

REST API Examples

Example 1: Fetch and Summarize

curl "https://api.apify.com/v2/acts/USERNAME~mcp-nexus/runs?token=TOKEN" \
  -H 'content-type: application/json' \
  -d '{
    "mode": "batch",
    "dag": true,
    "calls": [
      {
        "callId": "fetch",
        "tool": "fetch_web",
        "params": {"url": "https://example.com/article"}
      },
      {
        "callId": "summarize",
        "tool": "summarize",
        "params": {
          "text": {"ref": "fetch.contentText"}
        },
        "dependsOn": ["fetch"]
      }
    ]
  }'

Example 2: Crawl and Extract

curl "https://api.apify.com/v2/acts/USERNAME~mcp-nexus/runs?token=TOKEN" \
  -H 'content-type: application/json' \
  -d '{
    "mode": "single",
    "tool": "crawl_lite",
    "params": {
      "startUrl": "https://example.com",
      "maxPages": 10
    },
    "store": {"html": true}
  }'

SDK Usage

JavaScript:

import { ApifyClient } from 'apify-client'

const client = new ApifyClient({ token: 'YOUR_TOKEN' })

const run = await client.actor('USERNAME/mcp-nexus').call({
  mode: 'single',
  tool: 'fetch_web',
  params: {
    url: 'https://example.com'
  }
})

const dataset = await client.dataset(run.defaultDatasetId).listItems()
console.log(dataset.items[0])

Python:

from apify_client import ApifyClient

client = ApifyClient('YOUR_TOKEN')

run = client.actor('USERNAME/mcp-nexus').call(
    run_input={
        'mode': 'single',
        'tool': 'fetch_web',
        'params': {
            'url': 'https://example.com'
        }
    }
)

dataset = client.dataset(run['defaultDatasetId']).list_items()
print(dataset.items[0])

Appendices

Appendix A: Input Schema Reference

Top-Level Fields:

Field	Type	Required	Description
`mode`	`'single' \| 'batch'`	Yes	Execution mode
`correlationId`	`string`	No	Tracking identifier
`tool`	`ToolName`	Conditional	Tool name (single mode)
`params`	`object`	Conditional	Tool parameters (single mode)
`calls`	`array`	Conditional	Tool calls (batch mode)
`dag`	`boolean`	No	Enable DAG execution
`concurrency`	`number`	No	Batch concurrency (default: 2)

Configuration Objects:

llm:

{
  provider: 'openai' | 'anthropic' | 'azure'
  model: string
  apiKeySecret?: string
  maxTokens?: number
}

cache:

{
  enabled: boolean
  ttlSec: number
}

dedupe:

{
  enabled: boolean
  ttlSec: number
  strategy: 'url' | 'content' | 'hybrid'
}

render:

'none' | 'minimal' | 'full'

store:

{
  html: boolean
  screenshot: boolean
  text: boolean
}

proxy:

{
  useApifyProxy?: boolean
  proxyUrls?: string[]
}

security:

{
  allowlist?: string[]
  denylist?: string[]
  ignoreRobotsFor?: string[]
  redactLogs?: boolean
}

budgets:

{
  maxDurationSec?: number
  maxCalls?: number
  maxPages?: number
  maxTotalBytes?: number
  maxTotalTokens?: number
  maxLLMTokens?: number
  maxFetchBytes?: number
}

webhook:

{
  url?: string
  secret?: string
  batching?: boolean
}

Appendix B: Output Schema Reference

RunReport:

{
  correlationId: string
  schemaVersion: number
  ok: boolean
  mode: 'single' | 'batch'
  toolsExecuted: number
  usage: {
    durationMs: number
    httpBytes: number
    llmTokens: number
    retries: number
    cacheHits: number
    cacheMisses: number
    circuitBreakerTrips: number
    llmCosts: {
      openai: number
      anthropic: number
      azure: number
      total: number
    }
  }
  costEstimateUSD: number
  warnings: string[]
  errors: string[]
  timestamp: string
  result?: any              
  results?: Array<{         
    tool: string
    ok: boolean
    output?: any
    error?: string
  }>
  toolMetrics?: Record<string, {
    durationMs: number
    retries: number
    bytes: number
    tokens: number
  }>
}

Appendix C: Error Codes

Common Errors:

Error	Cause	Solution
`Unsupported tool`	Invalid tool name	Check tool names in schema
`LLM API key not found`	Missing API key	Set `apiKeySecret` or env var
`Max total bytes quota exceeded`	Budget limit hit	Increase `maxTotalBytes`
`Max total tokens quota exceeded`	Token budget exceeded	Increase `maxTotalTokens`
`Circuit breaker open`	Domain failures	Wait for cooldown
`Failed to execute`	Tool execution error	Check tool parameters
`Circular dependency detected`	Invalid DAG	Fix `dependsOn` references
`Reference to unknown call`	Invalid ref	Check `callId` values

Appendix D: Troubleshooting

Issue: Circuit Breaker Constantly Tripping

Symptoms: Many circuit breaker trips in usage

Solutions:

Check if target site is up
Increase delay between requests
Use proxies
Check if IP is blocked

Issue: High LLM Costs

Symptoms: High costEstimateUSD values

Solutions:

Use cheaper models (gpt-4o-mini)
Enable caching
Reduce maxTokens
Switch to rule-based extraction

Issue: Browser Rendering Timeouts

Symptoms: Errors with render: "full"

Solutions:

Increase Actor timeout
Use "minimal" instead
Check if site loads locally
Consider HTTP-only approach

Issue: Low Cache Hit Rate

Symptoms: High cache misses, low hits

Solutions:

Increase cache TTL
Check if URLs have unique parameters
Enable deduplication
Use canonical URLs

Issue: Webhooks Not Delivered

Symptoms: No webhook received

Solutions:

Check webhook URL is accessible
Verify HMAC secret
Check for 429 responses
Review idempotency logs

Appendix E: FAQ

Q: Can I run this without Apify?

No, MCP Nexus is designed as an Apify Actor and relies on the Apify platform infrastructure.

Q: How much does it cost?

Costs include:

Apify compute units (approximately $0.25/hour, subject to change)
LLM API calls (provider-dependent, subject to change)
Apify Proxy (if used, approximately $0.50/GB, subject to change)

Q: Can I use my own LLM API keys?

Yes, store them as Apify secrets and reference via apiKeySecret.

Q: Is there a rate limit?

Limits depend on:

Your Apify plan
LLM provider limits
Target site restrictions

Q: Can I scrape any website?

You should:

Respect robots.txt
Follow Terms of Service
Comply with local laws
Use responsibly

Q: How do I debug failed runs?

Check:

Error messages in output
Circuit breaker trips
Budget violations
Tool parameters

Q: What's the maximum execution time?

Default: 60 seconds (configurable via maxDurationSec)

Appendix F: Changelog

See CHANGELOG.md for complete version history.

Latest Version: 2.0.0

Major features:

Multi-provider LLM support
HTTP caching with ETags
Circuit breakers
Browser rendering
DAG execution mode
Structured extraction
9 specialized tools

Support & Resources

Documentation:

Apify Documentation

Community:

Apify Discord

Commercial Support:

Apify Consulting

Support the Developer:

Buy Me a Coffee

License & Support

License: This actor is proprietary software available on the Apify platform.

Support:

Issues & Questions: Contact via tuguidragos.com
Feature Requests: Reach out via website
Commercial Support: Available upon request

Built by Tugui Dragos Web: tuguidragos.com Support Development: Buy Me a Coffee

Last Updated: 2025-11-11

Notion AI Database Extractor & Sync Tool

tuguidragos/notion-ai-database-sync

Extract, summarize, and sync your Notion databases automatically. Supports AI summaries (GPT-4o), incremental updates, and secure HMAC webhooks. Ideal for analytics, CRM sync, or automation workflows.

Țugui Dragoș

5.0

DRG Phantom Core - Genesis Pilot

tuguidragos/drg-phantom-core-genesis-pilot

A stealth-grade autonomous lead intelligence engine that discovers, enriches, analyzes, and qualifies B2B prospects using multi-source scraping and AI scoring. This pilot release showcases the system’s core capabilities and foundational architecture.

Țugui Dragoș

5.0

Notion Uploader

filip_cicvarek/notion-uploader

Upload data into a specified Notion database. It dynamically maps data from any Actor / Dataset to your Notion properties, ensuring your database stays up-to-date with the latest information.

Filip Cicvárek

5.0

Notion Org Chart Generator

spidoosho/notion-org-chart-generator

Generate an image of organization chart from the Employee directory in Notion.

Anh Tuan Hoang

Google Maps Business Scraper

tuguidragos/google-maps-business-scraper

Google Maps lead extractor with built‑in email discovery. Scrape names, addresses, phones, websites, ratings, reviews, and optional social links. Serper.dev support for high-speed bulk searches. Export ready for CSV, Excel and JSON.

Țugui Dragoș

5.0

Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA

tuguidragos/hybrid-vision-spider-ai-powered-universal-web-scraper

AI-driven hybrid web scraper that merges Playwright and Vision intelligence to extract structured data from any dynamic site. Schema-aware, proxy-ready, budget-safe, and fully compatible with Apify datasets.

Țugui Dragoș

5.0

WCC Pinecone Integration

tri_angle/wcc-pinecone-integration

Crawl any website and store its content in your Pinecone vector database. Enhance the accuracy and reliability of your own AI Assistant with facts fetched from external sources or connect this integration to our Pinecone GPT Chatbot assistant available in Apify Store.

Tri⟁angle

168

3.8

Notion Marketplace Scraper

barrierefix/notion-marketplace-scraper

Scrape templates, categories, ratings, and creator profiles from Notion's official template marketplace. Perfect for competitive analysis, market research, creator monitoring, and discovering trending Notion templates.

BarriereFix

5.0

Website Contact Data Scraper from Bing and Google

tuguidragos/website-contact-data-scraper-from-bing-and-google

Scrape verified business contact details from Google and Bing search engine results pages (SERP). Extract emails, phone numbers, websites, and addresses from official company pages. No coding required. Perfect for sales prospecting, market research, and B2B outreach. Export to CSV, JSON via API

Țugui Dragoș

5.0

No-BS Content Crawler 🖕

successful_nonagon/no-bs-content-crawler

Fast web crawler that extracts clean text from websites. Returns readable content, headings, and links. Perfect for content aggregation, SEO research, and data collection.

hafsah nuzhat

5.0

MCP Nexus Universal AI Tool Bridge

MCP Nexus Universal AI Tool Bridge

Quick Start

Run on Apify Platform

30-Second Tutorial

One-Line API Call

Legal & Compliance Note

What MCP Nexus Can Do

Table of Contents

Chapter 1: Core Concepts

What is MCP Nexus

Architecture Overview

How It Works

Key Features

Chapter 2: Getting Started

Installation

Authentication

Your First Run

Understanding Results

Recommended Default Configuration

Conversion-Optimized Examples

Chapter 3: Tools Reference

fetch_web

extract

summarize

classify

transform

crawl_lite

extract_structured

search_web

diff_text

Chapter 4: Execution Modes

Single Mode

Batch Mode

DAG Dependencies

Performance Tips

Chapter 5: AI/LLM Integration

Supported Providers

Model Selection

Cost Optimization

Automatic Cost Tracking

Token Management

Structured Extraction Details

Chapter 6: Performance & Optimization

HTTP Caching

Request Deduplication

Circuit Breakers

Proxy Configuration

Browser Rendering

Chapter 7: Security & Compliance

HMAC Webhook Verification

Robots.txt Respect

Domain Allow/Deny Lists

PII Redaction

Secret Management

Content Security

Chapter 8: Production Deployment

Rate Limits & Best Practices

Anti-Bot Strategies

When to Use Browser Rendering

LLM Provider Limits

Circuit Breaker Tuning

Cache TTL Guidelines

Cost Optimization Strategies

Chapter 9: Development Guide

Project Structure

Understanding the Code

Testing

Debugging

Chapter 10: API & Integration

Apify API Usage

Webhook Setup

Webhook Batching

n8n Integration

REST API Examples

SDK Usage

Appendices

Appendix A: Input Schema Reference

Appendix B: Output Schema Reference

Appendix C: Error Codes