# Site Researcher (`crawlerbros/site-researcher`) Actor

Extract structured intelligence from any website: title, meta description, Open Graph tags, JSON-LD structured data, headings, images, videos, tech-stack fingerprint. Walks the sitemap to discover pages.

- **URL**: https://apify.com/crawlerbros/site-researcher.md
- **Developed by:** [Crawler Bros](https://apify.com/crawlerbros) (community)
- **Categories:** Developer tools, SEO tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 14 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

from $1.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Site Researcher

Extract structured intelligence from any website. For each page, the actor pulls the title, meta description, Open Graph tags, Twitter Cards, JSON-LD structured data, heading inventory, image and video URLs, and a tech-stack fingerprint. Discovers pages via sitemap-walk and/or internal-link BFS. HTTP-only — no proxy, no browser, no API key.

### What it does

You give it a start URL. The actor:
1. Optionally fetches `/sitemap.xml` (and follows sitemap-index files) to discover up to `maxPages` pages.
2. Optionally follows internal `<a href>` links from each crawled page (BFS, same-host only) to fill the rest of the budget.
3. Per page, extracts:
   - **Title** (`<title>`).
   - **Description** (`<meta name="description">`).
   - **Canonical URL** (`<link rel="canonical">`).
   - **Open Graph tags** — `og:title`, `og:description`, `og:image`, `og:type`, etc.
   - **Twitter Cards** — `twitter:card`, `twitter:site`, etc.
   - **JSON-LD blocks** — every `application/ld+json` script, plus a flat `jsonLdTypes` array of @type values.
   - **Headings** — first 20 unique values per level (h1 / h2 / h3).
   - **Images** — every `<img src>` (and `data-src` lazy variants), absolutised, with alt text.
   - **Videos** — every `<video src>` and `<source>` inside `<video>`.
   - **Tech-stack** — lightweight Wappalyzer-style fingerprint (WordPress, Shopify, Next.js, React, Vue, Angular, Webflow, Cloudflare, GTM, Google Analytics, HubSpot, Intercom, Zendesk, Facebook Pixel, etc.).

### Input

| Field | Type | Default | Description |
|---|---|---|---|
| `startUrl` | string (required) | `https://apify.com` | Root URL to research. |
| `crawlSitemap` | boolean | `true` | Parse `/sitemap.xml` to discover pages. |
| `followInternalLinks` | boolean | `true` | Follow internal `<a href>` links to extend the page set. |
| `maxPages` | integer | `20` (1–200) | Hard cap on pages researched. |
| `extractMedia` | boolean | `true` | Include image/video URLs in each page record. |
| `extractTechStack` | boolean | `true` | Run the Wappalyzer-style scan. |
| `userAgent` | string (optional) | (Chrome 131) | Override only if a server filters by UA. |

#### Example input

```json
{
  "startUrl": "https://apify.com",
  "crawlSitemap": true,
  "followInternalLinks": true,
  "maxPages": 30,
  "extractMedia": true,
  "extractTechStack": true
}
````

### Output

One record per researched page. Empty fields are omitted.

```json
{
  "url": "https://apify.com/",
  "title": "Apify · The full-stack web-scraping & automation platform",
  "description": "Apify is the all-in-one web scraping…",
  "canonical": "https://apify.com/",
  "ogTags": {
    "title": "Apify · The full-stack web-scraping & automation platform",
    "description": "Apify is the all-in-one web scraping…",
    "image": "https://apify.com/img/og-image.jpg",
    "type": "website"
  },
  "twitterTags": {
    "card": "summary_large_image",
    "site": "@apify"
  },
  "headings": {
    "h1": ["Build, deploy & monetize web scrapers and AI agents"],
    "h2": ["Trusted by 60,000+ developers", "Why Apify?"]
  },
  "jsonLdTypes": ["Organization", "WebSite"],
  "jsonLd": [ /* full ld+json blocks */ ],
  "images": [
    {"url": "https://apify.com/img/hero.png", "alt": "Apify hero"}
  ],
  "imageCount": 24,
  "videos": [],
  "techStack": ["Cloudflare", "Google Tag Manager", "Next.js", "React"],
  "seoSummary": {
    "titleLength": 58,
    "metaDescriptionLength": 142,
    "h1": "Build, deploy & monetize web scrapers and AI agents",
    "h1Count": 1,
    "wordCount": 1247,
    "imageCount": 24,
    "imagesWithoutAlt": 2,
    "internalLinkCount": 38,
    "externalLinkCount": 12,
    "pageSizeBytes": 78213,
    "hasCanonical": true,
    "hasStructuredData": true,
    "hasOgTags": true,
    "hasTwitterTags": true
  },
  "seoScores": {
    "metaTags": 100,
    "headings": 100,
    "images": 92,
    "links": 100,
    "structuredData": 100,
    "socialMeta": 100,
    "contentQuality": 100,
    "performance": 100,
    "technicalSeo": 100,
    "overallScore": 99,
    "overallGrade": "A",
    "issues": [
      {"category": "images", "message": "2/24 images missing alt text"}
    ],
    "issuesSummary": "1 issue(s) detected"
  },
  "discoveredVia": "start-url",
  "scrapedAt": "2024-12-16T14:23:11+00:00"
}
```

#### Output fields

- **`url`** — page URL (absolute).
- **`title`** / **`description`** — `<title>` and `<meta name="description">`.
- **`canonical`** — `<link rel="canonical">` URL.
- **`ogTags`** — flat dict of `og:*` properties without the prefix.
- **`twitterTags`** — flat dict of `twitter:*` properties without the prefix.
- **`headings`** — `{h1: [...], h2: [...], h3: [...]}` with first 20 unique values per level.
- **`jsonLd`** — every parseable `application/ld+json` block (raw payloads).
- **`jsonLdTypes`** — sorted list of `@type` values across all blocks.
- **`images`** / **`imageCount`** — array of `{url, alt}` and total count.
- **`videos`** / **`videoCount`** — array of `{url, type?}` and total count.
- **`techStack`** — sorted list of detected stack tokens (CMS, frameworks, analytics, CDNs, web server).
- **`seoSummary`** — at-a-glance SEO metrics block: `titleLength`, `metaDescriptionLength`, `h1`, `h1Count`, `wordCount`, `imageCount`, `imagesWithoutAlt`, `internalLinkCount`, `externalLinkCount`, `pageSizeBytes`, plus presence flags `hasCanonical`, `hasStructuredData`, `hasOgTags`, `hasTwitterTags`.
- **`seoScores`** — 9 category scores (0–100) — `metaTags`, `headings`, `images`, `links`, `structuredData`, `socialMeta`, `contentQuality`, `performance`, `technicalSeo` — plus `overallScore` (0–100), `overallGrade` (A/B/C/D/F), and an `issues[]` array with concrete recommendations (e.g. `"Title length 18 chars (recommended 30-60)"`).
- **`discoveredVia`** — `"start-url"` (the starting page), `"sitemap"` (from `sitemap.xml`), or `"internal-link"` (BFS from another page).
- **`scrapedAt`** — ISO-8601 timestamp.

### Tech-stack signatures

The detector pattern-matches on HTML body and response headers. Coverage is intentionally narrow but high-confidence:

**CMS / builders**: WordPress, Shopify, Squarespace, Wix, Webflow, Drupal, Ghost.
**SPA frameworks**: Next.js, React, Vue (incl. Nuxt), Angular, Svelte.
**Analytics / tag managers**: Google Tag Manager, Google Analytics, HubSpot, Facebook Pixel, Amplitude, Segment.
**Customer support**: Intercom, Zendesk.
**CDNs**: Cloudflare (incl. via `cf-ray` header), Amazon CloudFront, Fastly, Akamai.
**Web servers**: nginx, Apache (via `Server` header).
**Backends**: Express, ASP.NET, PHP (via `X-Powered-By`).

### Use cases

- **Competitor research** — quickly fingerprint a competitor's tech stack and content structure.
- **SEO audits** — verify every page has a title, meta description, canonical URL, and OpenGraph image.
- **Sales enablement** — extract pages tagged with specific JSON-LD `@type` (e.g. `Product`, `Article`, `Event`).
- **Brand monitoring** — pull every image and video URL for asset auditing.
- **Lead enrichment** — combine site title, description, and tech stack into a single CRM-ready record.

### FAQ

**Does it need a proxy?**
No. Public web pages are accessible from datacenter IPs. A few sites with aggressive WAFs may block; those will fall through with an empty `techStack` and missing `images` / `videos`.

**Does it work on JavaScript-rendered (SPA) pages?**
Partially. The actor sees the server-rendered HTML, not what runs after the page boots. For Next.js pages this is usually fine (Next.js SSRs). For pure CSR React/Vue apps, the meta tags are still visible but content arrays may be sparse.

**How many pages does it crawl?**
Up to `maxPages` (default 20). The discovery order is: start URL → sitemap pages → internal links from researched pages.

**Does it download images / video binaries?**
No — only collects URLs and metadata. Combine with a downloader actor for the bytes.

**What if the site has no sitemap?**
The actor falls back to internal-link BFS from the start URL. Set `crawlSitemap: false` to skip the probe entirely.

**Why is `jsonLd` sometimes missing?**
Many sites don't ship structured data. The output omits `jsonLd` and `jsonLdTypes` when zero parseable blocks are found.

# Actor input Schema

## `startUrl` (type: `string`):

Root URL to research. Must be a valid http(s) URL.

## `crawlSitemap` (type: `boolean`):

When enabled, the actor first parses /sitemap.xml (and follows sitemap-index files) to discover pages to research.

## `followInternalLinks` (type: `boolean`):

When enabled, the actor also follows internal `<a href>` links from the start URL to extend the page set (BFS, same-host only).

## `maxPages` (type: `integer`):

Hard cap on pages to research. Sitemap-discovered pages and link-discovered pages share this budget.

## `extractMedia` (type: `boolean`):

Include `<img>` and `<video>` URLs in each page record. Disable for a leaner output.

## `extractTechStack` (type: `boolean`):

Run a lightweight Wappalyzer-style scan (WordPress, Shopify, Next.js, React, Vue, Angular, Cloudflare, GTM, etc.).

## `userAgent` (type: `string`):

Override the default Chrome User-Agent.

## Actor input object example

```json
{
  "startUrl": "https://apify.com",
  "crawlSitemap": true,
  "followInternalLinks": true,
  "maxPages": 20,
  "extractMedia": true,
  "extractTechStack": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrl": "https://apify.com",
    "maxPages": 20
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlerbros/site-researcher").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrl": "https://apify.com",
    "maxPages": 20,
}

# Run the Actor and wait for it to finish
run = client.actor("crawlerbros/site-researcher").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrl": "https://apify.com",
  "maxPages": 20
}' |
apify call crawlerbros/site-researcher --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlerbros/site-researcher",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Site Researcher",
        "description": "Extract structured intelligence from any website: title, meta description, Open Graph tags, JSON-LD structured data, headings, images, videos, tech-stack fingerprint. Walks the sitemap to discover pages.",
        "version": "0.1",
        "x-build-id": "13Y5IyLj3TO5ChPWf"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlerbros~site-researcher/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlerbros-site-researcher",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlerbros~site-researcher/runs": {
            "post": {
                "operationId": "runs-sync-crawlerbros-site-researcher",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlerbros~site-researcher/run-sync": {
            "post": {
                "operationId": "run-sync-crawlerbros-site-researcher",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrl"
                ],
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "type": "string",
                        "description": "Root URL to research. Must be a valid http(s) URL."
                    },
                    "crawlSitemap": {
                        "title": "Crawl sitemap",
                        "type": "boolean",
                        "description": "When enabled, the actor first parses /sitemap.xml (and follows sitemap-index files) to discover pages to research.",
                        "default": true
                    },
                    "followInternalLinks": {
                        "title": "Follow internal links",
                        "type": "boolean",
                        "description": "When enabled, the actor also follows internal `<a href>` links from the start URL to extend the page set (BFS, same-host only).",
                        "default": true
                    },
                    "maxPages": {
                        "title": "Maximum pages",
                        "minimum": 1,
                        "maximum": 200,
                        "type": "integer",
                        "description": "Hard cap on pages to research. Sitemap-discovered pages and link-discovered pages share this budget.",
                        "default": 20
                    },
                    "extractMedia": {
                        "title": "Extract media URLs",
                        "type": "boolean",
                        "description": "Include `<img>` and `<video>` URLs in each page record. Disable for a leaner output.",
                        "default": true
                    },
                    "extractTechStack": {
                        "title": "Detect tech stack",
                        "type": "boolean",
                        "description": "Run a lightweight Wappalyzer-style scan (WordPress, Shopify, Next.js, React, Vue, Angular, Cloudflare, GTM, etc.).",
                        "default": true
                    },
                    "userAgent": {
                        "title": "Custom User-Agent (optional)",
                        "type": "string",
                        "description": "Override the default Chrome User-Agent."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
