# Common Crawl Scraper (`crawlerbros/common-crawl-scraper`) Actor

Query the Common Crawl URL Index for any domain or URL pattern. Discover a site's archived pages, historical URLs, capture dates, HTTP statuses and MIME types for SEO, domain intelligence and research. Also lists the available monthly crawls.

- **URL**: https://apify.com/crawlerbros/common-crawl-scraper.md
- **Developed by:** [Crawler Bros](https://apify.com/crawlerbros) (community)
- **Categories:** SEO tools, Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $3.00 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Common Crawl Scraper

Query the **Common Crawl URL Index** for any domain or URL pattern and get back every archived capture — the historical URLs a site exposed, when they were crawled, their HTTP status and MIME type, and the exact location of the raw page in Common Crawl's public archive. Perfect for SEO audits, domain intelligence, historical URL discovery and web-scale research. HTTP-only, no login, no proxy.

### What this actor does

- **Two modes:** `urlCaptures` and `listCrawls`
- **Wildcard lookups:** `example.com`, `example.com/*`, `en.wikipedia.org/wiki/*`, `*.example.com`
- **125+ monthly crawls:** query the latest crawl or any historical one
- **Server-side filters:** date range; **client-side filters:** HTTP status, MIME type
- **WARC location** for every capture so you can fetch the raw archived page
- **Empty fields are omitted** — every field in a record is populated

### Modes

| Mode | What it does | Needs |
|---|---|---|
| `urlCaptures` | Look up all archived captures for a domain / URL pattern | `urlPattern` (+ optional `crawl`, filters) |
| `listCrawls` | List every available Common Crawl monthly crawl | – |

### Output — `urlCaptures` (one row per archived capture)

- `url` — the archived URL
- `urlKey` — Common Crawl's canonical (SURT) key
- `timestamp` — capture time, `YYYYMMDDHHMMSS`
- `captureDate` — the same time as ISO 8601
- `status` — HTTP status at capture time
- `mime` — declared MIME type
- `mimeDetected` — MIME type detected from content
- `digest` — content digest (dedupe identical pages)
- `length` — record byte length
- `offset` — byte offset within the WARC file
- `filename` — WARC file path in the archive
- `languages` — detected language codes
- `encoding` — character encoding
- `redirectUrl` — redirect target (for 3xx captures)
- `truncated` — truncation reason (when present)
- `crawlId` — which crawl this came from
- `warcUrl` — direct link to the WARC file on `data.commoncrawl.org`
- `recordType: "capture"`, `sourceUrl`, `scrapedAt`

### Output — `listCrawls` (one row per crawl)

- `crawlId` — e.g. `CC-MAIN-2024-10`
- `name` — human-readable name (e.g. `February/March 2024 Index`)
- `fromDate`, `toDate` — crawl time window
- `cdxApiUrl` — the crawl's index API endpoint
- `timegateUrl` — the crawl's timegate
- `recordType: "crawl"`, `sourceUrl`, `scrapedAt`

### Input

| Field | Type | Default | Description |
|---|---|---|---|
| `mode` | string | `urlCaptures` | `urlCaptures` / `listCrawls` |
| `urlPattern` | string | `en.wikipedia.org/wiki/*` | Domain or URL, `*` wildcards allowed |
| `crawl` | string | `latest` | `latest` or a crawl id like `CC-MAIN-2024-10` |
| `matchType` | string | `auto` | `auto` / `exact` / `prefix` / `host` / `domain` |
| `statusFilter` | int | – | Keep only this HTTP status (e.g. `200`) |
| `mimeFilter` | string | – | Keep only MIME types containing this text |
| `fromDate` | string | – | `YYYYMMDD` lower bound |
| `toDate` | string | – | `YYYYMMDD` upper bound |
| `maxItems` | int | `100` | Hard cap (1–5000) |

#### Example: all archived Wikipedia article URLs in the latest crawl

```json
{ "mode": "urlCaptures", "urlPattern": "en.wikipedia.org/wiki/*", "crawl": "latest", "maxItems": 500 }
````

#### Example: only successful HTML pages of a domain

```json
{ "mode": "urlCaptures", "urlPattern": "example.com/*", "statusFilter": 200, "mimeFilter": "text/html" }
```

#### Example: a whole domain including subdomains, in a specific crawl

```json
{ "mode": "urlCaptures", "urlPattern": "wikipedia.org", "matchType": "domain", "crawl": "CC-MAIN-2024-10" }
```

#### Example: list every available crawl

```json
{ "mode": "listCrawls" }
```

### Use cases

- **SEO & site audits** — discover every URL a domain has ever exposed to crawlers
- **Domain intelligence** — profile a competitor's URL structure and content types
- **Historical URL discovery** — recover old / removed pages for migration or research
- **Data engineering** — get WARC offsets to pull raw archived pages at scale
- **Security research** — enumerate a domain's historical footprint

### Data source

Data comes from the public [Common Crawl URL Index](https://index.commoncrawl.org/) (the CDX API) and the crawl archive at `data.commoncrawl.org`, both published openly by the Common Crawl Foundation. No account, API key, or proxy is required. Coverage spans well over a decade of monthly crawls; run `listCrawls` to see every crawl currently available and its date range.

### FAQ

**What is Common Crawl?**  A free, open repository of web crawl data covering billions of pages, updated roughly monthly. Its URL Index lets you look up which URLs were captured and where they live in the archive. See [commoncrawl.org](https://commoncrawl.org).

**What's a "crawl"?**  Each monthly snapshot is a crawl with an id like `CC-MAIN-2024-10`. Use `latest` for the newest, or run `listCrawls` to see all available ids and their date ranges.

**How do wildcards work?**  `example.com` matches that host's pages; `example.com/*` matches everything under it; `*.example.com` matches all subdomains. You can also set `matchType` explicitly to `exact`, `prefix`, `host` or `domain`.

**Why did I get no results for a big-name site?**  Some sites exclude crawlers via robots.txt, so they aren't in Common Crawl. Try a different domain or crawl.

**Can I fetch the actual page content?**  Each capture includes `warcUrl`, `offset` and `length` — enough to download the exact archived response from Common Crawl's public `data.commoncrawl.org` store.

**What does `digest` do?**  It's a content hash — identical `digest` values across captures mean the page content didn't change, which is handy for change detection and deduplication.

**How far back does the data go?**  Common Crawl has crawls stretching back over a decade; `listCrawls` shows every one currently available.

# Actor input Schema

## `mode` (type: `string`):

What to fetch.

## `urlPattern` (type: `string`):

Domain or URL to look up. Use `*` wildcards, e.g. `example.com`, `example.com/*`, `en.wikipedia.org/wiki/*`, or `*.example.com`.

## `crawl` (type: `string`):

Which monthly crawl to query. Use `latest` for the newest, or a crawl id like `CC-MAIN-2024-10`. Run mode=listCrawls to discover valid ids.

## `matchType` (type: `string`):

How to interpret the pattern. `auto` lets the `*` wildcards decide.

## `statusFilter` (type: `integer`):

Only keep captures with this HTTP status (e.g. 200). Leave empty for all.

## `mimeFilter` (type: `string`):

Only keep captures whose MIME type contains this text (e.g. `text/html`, `pdf`, `json`).

## `fromDate` (type: `string`):

Only include captures on or after this date, e.g. `20240101`.

## `toDate` (type: `string`):

Only include captures on or before this date, e.g. `20241231`.

## `maxItems` (type: `integer`):

Hard cap on emitted records.

## Actor input object example

```json
{
  "mode": "urlCaptures",
  "urlPattern": "en.wikipedia.org/wiki/*",
  "crawl": "latest",
  "matchType": "auto",
  "maxItems": 15
}
```

# Actor output Schema

## `captures` (type: `string`):

Dataset containing all scraped Common Crawl records.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "mode": "urlCaptures",
    "urlPattern": "en.wikipedia.org/wiki/*",
    "crawl": "latest",
    "matchType": "auto",
    "maxItems": 15
};

// Run the Actor and wait for it to finish
const run = await client.actor("crawlerbros/common-crawl-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "mode": "urlCaptures",
    "urlPattern": "en.wikipedia.org/wiki/*",
    "crawl": "latest",
    "matchType": "auto",
    "maxItems": 15,
}

# Run the Actor and wait for it to finish
run = client.actor("crawlerbros/common-crawl-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "mode": "urlCaptures",
  "urlPattern": "en.wikipedia.org/wiki/*",
  "crawl": "latest",
  "matchType": "auto",
  "maxItems": 15
}' |
apify call crawlerbros/common-crawl-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=crawlerbros/common-crawl-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Common Crawl Scraper",
        "description": "Query the Common Crawl URL Index for any domain or URL pattern. Discover a site's archived pages, historical URLs, capture dates, HTTP statuses and MIME types for SEO, domain intelligence and research. Also lists the available monthly crawls.",
        "version": "1.0",
        "x-build-id": "hWF0gztdINx8D1eke"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/crawlerbros~common-crawl-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-crawlerbros-common-crawl-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/crawlerbros~common-crawl-scraper/runs": {
            "post": {
                "operationId": "runs-sync-crawlerbros-common-crawl-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/crawlerbros~common-crawl-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-crawlerbros-common-crawl-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "mode"
                ],
                "properties": {
                    "mode": {
                        "title": "Mode",
                        "enum": [
                            "urlCaptures",
                            "listCrawls"
                        ],
                        "type": "string",
                        "description": "What to fetch.",
                        "default": "urlCaptures"
                    },
                    "urlPattern": {
                        "title": "URL / domain pattern",
                        "type": "string",
                        "description": "Domain or URL to look up. Use `*` wildcards, e.g. `example.com`, `example.com/*`, `en.wikipedia.org/wiki/*`, or `*.example.com`.",
                        "default": "en.wikipedia.org/wiki/*"
                    },
                    "crawl": {
                        "title": "Crawl",
                        "type": "string",
                        "description": "Which monthly crawl to query. Use `latest` for the newest, or a crawl id like `CC-MAIN-2024-10`. Run mode=listCrawls to discover valid ids.",
                        "default": "latest"
                    },
                    "matchType": {
                        "title": "Match type",
                        "enum": [
                            "auto",
                            "exact",
                            "prefix",
                            "host",
                            "domain"
                        ],
                        "type": "string",
                        "description": "How to interpret the pattern. `auto` lets the `*` wildcards decide.",
                        "default": "auto"
                    },
                    "statusFilter": {
                        "title": "HTTP status filter",
                        "minimum": 100,
                        "maximum": 599,
                        "type": "integer",
                        "description": "Only keep captures with this HTTP status (e.g. 200). Leave empty for all."
                    },
                    "mimeFilter": {
                        "title": "MIME filter",
                        "type": "string",
                        "description": "Only keep captures whose MIME type contains this text (e.g. `text/html`, `pdf`, `json`)."
                    },
                    "fromDate": {
                        "title": "From date (YYYYMMDD)",
                        "type": "string",
                        "description": "Only include captures on or after this date, e.g. `20240101`."
                    },
                    "toDate": {
                        "title": "To date (YYYYMMDD)",
                        "type": "string",
                        "description": "Only include captures on or before this date, e.g. `20241231`."
                    },
                    "maxItems": {
                        "title": "Max items",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Hard cap on emitted records.",
                        "default": 100
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
