# Common Crawl URL Index Lookup Scraper (`parseforge/common-crawl-index-scraper`) Actor

Pull every web page Common Crawl captured for a domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and WARC offsets to fetch original payloads. Filter by collection, MIME, and status. Export to JSON, CSV, or Excel for large-scale web research and content discovery.

- **URL**: https://apify.com/parseforge/common-crawl-index-scraper.md
- **Developed by:** [ParseForge](https://apify.com/parseforge) (community)
- **Categories:** Developer tools, Automation, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $8.25 / 1,000 items

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

![ParseForge Banner](https://github.com/ParseForge/apify-assets/blob/ad35ccc13ddd068b9d6cba33f323962e39aed5b2/banner.jpg?raw=true)

## 🌐 Common Crawl Index Scraper

> 🚀 **List every web page Common Crawl captured for a domain or URL prefix.** WARC offsets included so you can fetch the original payload from S3. No API key, no registration.

> 🕒 **Last updated:** 2026-05-01 · **📊 9 fields** per record · **🗂️ 250+ billion pages indexed** · **📅 monthly crawls since 2008** · **🆓 free public index**

The **Common Crawl Index Scraper** queries the public Common Crawl Index Server and returns every page Common Crawl captured for a given domain or URL prefix. Each record includes the captured URL, ISO timestamp, MIME type, HTTP status code, content digest, byte length, WARC filename, byte offset into that file, and the source collection name.

Common Crawl runs a fresh public web crawl every month and indexes the results in a sortable URL-keyed index. The dataset has powered widely-cited research, Wikipedia-grade reference work, and the training corpus for many large language models. This Actor handles collection selection, MIME and status filters, pagination, and timestamp formatting so you can focus on the data.

| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| ML engineers, web researchers, SEO analysts, data scientists, academics | Training-data discovery, large-scale crawl filtering, archive lookup, content audits |

---

### 📋 What the Common Crawl Index Scraper does

Five filtering workflows in a single run:

- 🌐 **Domain or prefix lookup.** Submit a URL or prefix and pull every Common Crawl capture in the chosen collection.
- 🗂️ **Collection selector.** Pick a specific monthly crawl like `CC-MAIN-2026-04` or default to the latest.
- 📐 **Match-type control.** `exact`, `prefix`, `host`, or `domain` like a CDX query.
- 📄 **MIME and status filters.** Restrict to HTML, JSON, image, or any specific status code.
- 📦 **WARC offsets included.** Every row tells you which WARC file holds the original payload and at what byte offset.

Each row reports the URL, ISO timestamp, MIME type, HTTP status, digest, byte length, WARC filename, byte offset, and the parent collection identifier.

> 💡 **Why it matters:** Common Crawl is the largest free web corpus in existence and the foundation of many open AI training datasets. Knowing whether a domain is even in the corpus, and at what depth, is a basic question for ML pretraining work, copyright analysis, and large-scale research. Direct CDX queries against the index server are doable but slow and finicky; this Actor wraps that in a clean filter UI.

---

### 🎬 Full Demo

_🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset._

---

### ⚙️ Input

<table>
<thead>
<tr><th>Input</th><th>Type</th><th>Default</th><th>Behavior</th></tr>
</thead>
<tbody>
<tr><td><code>maxItems</code></td><td>integer</td><td><code>10</code></td><td>Records to return. Free plan caps at 10, paid plan at 1,000,000.</td></tr>
<tr><td><code>urlOrDomain</code></td><td>string</td><td><code>"apify.com"</code></td><td>Domain or URL prefix to look up.</td></tr>
<tr><td><code>matchType</code></td><td>string</td><td><code>"domain"</code></td><td><code>exact</code>, <code>prefix</code>, <code>host</code>, or <code>domain</code>.</td></tr>
<tr><td><code>collection</code></td><td>string</td><td>latest available</td><td>Monthly crawl identifier like <code>CC-MAIN-2026-04</code>.</td></tr>
<tr><td><code>mimeFilter</code></td><td>string</td><td>empty</td><td>MIME type filter, e.g. <code>text/html</code>.</td></tr>
<tr><td><code>statusFilter</code></td><td>string</td><td>empty</td><td>HTTP status code filter, e.g. <code>200</code>.</td></tr>
</tbody>
</table>

**Example: every HTML page captured under apify.com in April 2026.**

```json
{
    "maxItems": 500,
    "urlOrDomain": "apify.com",
    "matchType": "domain",
    "collection": "CC-MAIN-2026-04",
    "mimeFilter": "text/html",
    "statusFilter": "200"
}
````

**Example: every capture of a single competitor URL.**

```json
{
    "maxItems": 100,
    "urlOrDomain": "competitor.com/pricing",
    "matchType": "exact"
}
```

> ⚠️ **Good to Know:** Common Crawl publishes one full crawl per month and the corresponding index. The collection list is fetched at run time from `index.commoncrawl.org/collinfo.json`, so the most recent crawl is always available. WARC paths in the output are relative to the Common Crawl S3 bucket; download with the standard AWS S3 tooling.

***

### 📊 Output

Each row contains **9 fields**. Download as CSV, Excel, JSON, or XML.

#### 🧾 Schema

| Field | Type | Example |
|---|---|---|
| 🔗 `url` | string | `"https://apify.com/store"` |
| 📅 `timestamp` | ISO 8601 | `"2026-04-15T08:22:13Z"` |
| 📄 `mimeType` | string | `"text/html"` |
| ✅ `statusCode` | integer | `200` |
| 🔐 `digest` | string | `"AAB45HGJK..."` |
| 📦 `length` | integer | `8421` |
| 📂 `filename` | string | `"crawl-data/CC-MAIN-2026-04/segments/.../warc.gz"` |
| 📌 `offset` | integer | `142893551` |
| 🗂️ `collection` | string | `"CC-MAIN-2026-04"` |

#### 📦 Sample records

<details>
<summary><strong>📄 HTML capture of a product page</strong></summary>

```json
{
    "url": "https://apify.com/store",
    "timestamp": "2026-04-15T08:22:13Z",
    "mimeType": "text/html",
    "statusCode": 200,
    "digest": "AAB45HGJKLMNQ12RSXYZW8VU3T6P9LMK",
    "length": 8421,
    "filename": "crawl-data/CC-MAIN-2026-04/segments/1745881122/warc/CC-MAIN-2026-04-2-00012.warc.gz",
    "offset": 142893551,
    "collection": "CC-MAIN-2026-04"
}
```

</details>

<details>
<summary><strong>🔁 Redirect capture returning HTTP 301</strong></summary>

```json
{
    "url": "https://apify.com/legacy",
    "timestamp": "2024-12-08T14:33:08Z",
    "mimeType": "text/html",
    "statusCode": 301,
    "digest": "QQRST6789UVWXYZ12345ABCDEFGHIJKM",
    "length": 410,
    "filename": "crawl-data/CC-MAIN-2024-49/segments/1734556677/warc/CC-MAIN-2024-49-2-04321.warc.gz",
    "offset": 84200002,
    "collection": "CC-MAIN-2024-49"
}
```

</details>

<details>
<summary><strong>📦 Image asset capture</strong></summary>

```json
{
    "url": "https://cdn.apify.com/img/logo.png",
    "timestamp": "2026-03-20T22:11:04Z",
    "mimeType": "image/png",
    "statusCode": 200,
    "digest": "MMOPQR123ZYWX456BCDEFGHIJKL789NO",
    "length": 18742,
    "filename": "crawl-data/CC-MAIN-2026-13/segments/1742345566/warc/CC-MAIN-2026-13-2-09120.warc.gz",
    "offset": 21088445,
    "collection": "CC-MAIN-2026-13"
}
```

</details>

***

### ✨ Why choose this Actor

| | Capability |
|---|---|
| 🆓 | **Free public source.** Reads the Common Crawl Index Server directly. |
| 🗂️ | **Every monthly crawl.** All collections from 2008 to today are queryable. |
| 📦 | **WARC offsets.** Each row tells you the exact byte range to fetch the original payload. |
| 📐 | **CDX-style match types.** Exact URL, prefix, host, or full domain. |
| 📄 | **MIME and status filters.** Slice the corpus by content type or HTTP status. |
| 🚀 | **Sub-30-second runs.** Typical 100-row pulls finish quickly. |
| 🛠️ | **Live collection list.** Latest crawl auto-detected at run time. |

> 📊 Common Crawl reports more than 250 billion pages indexed across all monthly crawls.

***

### 📈 How it compares to alternatives

| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| Direct CDX queries | Free | Full | Monthly | Manual | Engineer hours |
| Paid web index APIs | $$$ subscription | Partial | Daily | Built-in | Account setup |
| Self-hosted CC mirrors | Storage cost | Snapshot | Manual refresh | None | Infrastructure |
| **⭐ Common Crawl Index Scraper** *(this Actor)* | Pay-per-event | Full | Monthly | Match type, MIME, status, collection | None |

Same index server Common Crawl publishes, exposed as clean structured records.

***

### 🚀 How to use

1. 🆓 **Create a free Apify account.** [Sign up here](https://console.apify.com/sign-up?fpr=vmoqkp) and get $5 in free credit.
2. 🔍 **Open the Actor.** Search for "Common Crawl Index" in the Apify Store.
3. ⚙️ **Set your inputs.** Pick the URL or domain, match type, and any filters.
4. ▶️ **Click Start.** A 100-row run typically completes in 10 to 25 seconds.
5. 📥 **Download.** Export as CSV, Excel, JSON, or XML.

> ⏱️ Total time from sign-up to first dataset: under five minutes.

***

### 💼 Business use cases

<table>
<tr>
<td width="50%">

#### 🤖 ML & data science

- Check if a domain is in a training corpus
- Estimate copyright exposure for LLM datasets
- Build domain-specific subcorpora from CC
- Cross-reference scraped data with public capture record

</td>
<td width="50%">

#### 📈 SEO & competitive

- Map a competitor's full URL space
- Audit which pages CC sees vs Google
- Discover legacy paths through historical crawls
- Track structural changes month over month

</td>
</tr>
<tr>
<td width="50%">

#### 🛡️ Security & OSINT

- Map historical attack surface of a target domain
- Find leaked URLs that are no longer linked
- Track CDN and origin host changes
- Identify abandoned subdomains

</td>
<td width="50%">

#### 📰 Research & journalism

- Cite specific captures with stable WARC offsets
- Run reproducible studies on CC subsets
- Compare crawl coverage of different topic spaces
- Build longitudinal datasets month over month

</td>
</tr>
</table>

***

### 🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

<table>
<tr>
<td width="50%">

#### 🎓 Research and academia

- Empirical datasets for papers, thesis work, and coursework
- Longitudinal studies tracking changes across snapshots
- Reproducible research with cited, versioned data pulls
- Classroom exercises on data analysis and ethical scraping

</td>
<td width="50%">

#### 🎨 Personal and creative

- Side projects, portfolio demos, and indie app launches
- Data visualizations, dashboards, and infographics
- Content research for bloggers, YouTubers, and podcasters
- Hobbyist collections and personal trackers

</td>
</tr>
<tr>
<td width="50%">

#### 🤝 Non-profit and civic

- Transparency reporting and accountability projects
- Advocacy campaigns backed by public-interest data
- Community-run databases for local issues
- Investigative journalism on public records

</td>
<td width="50%">

#### 🧪 Experimentation

- Prototype AI and machine-learning pipelines with real data
- Validate product-market hypotheses before engineering spend
- Train small domain-specific models on niche corpora
- Test dashboard concepts with live input

</td>
</tr>
</table>

***

### 🔌 Automating Common Crawl Index Scraper

Run this Actor on a schedule, from your codebase, or inside another tool:

- **Node.js** SDK: see [Apify JavaScript client](https://docs.apify.com/api/client/js/) for programmatic runs.
- **Python** SDK: see [Apify Python client](https://docs.apify.com/api/client/python/) for the same flow in Python.
- **HTTP API**: see [Apify API docs](https://docs.apify.com/api/v2) for raw REST integration.

Schedule monthly runs from the Apify Console to track each new crawl. Pipe results into Google Sheets, S3, BigQuery, or your own webhook with the built-in [integrations](https://docs.apify.com/platform/integrations).

***

### ❓ Frequently Asked Questions

<details>
<summary><strong>🗂️ How do I pick a collection?</strong></summary>

Each Common Crawl is identified by a tag like `CC-MAIN-2026-04`. The Actor fetches the live collection list and defaults to the latest one. Pass an explicit value if you want a historical snapshot.

</details>

<details>
<summary><strong>📦 Can I download the actual page content?</strong></summary>

Each row gives you the WARC filename and byte offset. Use any AWS S3 client to fetch a byte range from `s3://commoncrawl/{filename}`. Bulk WARC fetching is a separate workflow.

</details>

<details>
<summary><strong>🔍 What is the difference between match types?</strong></summary>

`exact` matches one URL only. `prefix` matches a URL plus everything beneath it. `host` matches one hostname. `domain` matches the host plus all subdomains.

</details>

<details>
<summary><strong>📅 How often does Common Crawl refresh?</strong></summary>

Roughly once per month. The newest collection is detected automatically at run time.

</details>

<details>
<summary><strong>📦 How many rows can I pull at once?</strong></summary>

Free plan caps at 10. Paid plans go up to 1,000,000. Very broad queries can return millions of rows; always set sensible filters.

</details>

<details>
<summary><strong>📄 Why does my run return zero rows?</strong></summary>

Common Crawl indexes a sample, not the entire web. Smaller sites may not be in any given monthly collection. Try a broader match type or a different collection.

</details>

<details>
<summary><strong>📅 Can I query historical collections?</strong></summary>

Yes. Pass any past collection identifier (e.g. `CC-MAIN-2020-50`). The Actor returns the snapshot from that month.

</details>

<details>
<summary><strong>💼 Can I use this for commercial work?</strong></summary>

Yes. The Common Crawl dataset is published under terms that allow commercial use. Always cite Common Crawl as the source.

</details>

<details>
<summary><strong>💳 Do I need a paid Apify plan?</strong></summary>

The free plan returns up to 10 rows per run. Paid plans return up to 1,000,000.

</details>

<details>
<summary><strong>⚠️ What if a run fails or returns empty?</strong></summary>

Common Crawl's index server occasionally rate-limits very wide queries. Narrow the match type or retry. [Open a contact form](https://tally.so/r/BzdKgA) and include the run URL if the issue persists.

</details>

<details>
<summary><strong>🔁 How fresh is the data?</strong></summary>

Each run hits the live index server, which is updated when Common Crawl publishes a new monthly crawl.

</details>

<details>
<summary><strong>⚖️ Is this legal?</strong></summary>

Yes. Common Crawl publishes the index server for exactly this kind of programmatic access, and the dataset is released under terms that permit research and commercial use.

</details>

***

### 🔌 Integrate with any app

- [**Make**](https://apify.com/integrations/make) - drop run results into 1,800+ apps.
- [**Zapier**](https://apify.com/integrations/zapier) - trigger automations off completed runs.
- [**Slack**](https://apify.com/integrations/slack) - post run summaries to a channel.
- [**Google Sheets**](https://apify.com/integrations/google-sheets) - sync each run into a spreadsheet.
- [**Webhooks**](https://docs.apify.com/platform/integrations/webhooks) - notify your own services on run finish.
- [**Airbyte**](https://apify.com/integrations/airbyte) - load runs into Snowflake, BigQuery, or Postgres.

***

### 🔗 Recommended Actors

- [**🕰️ Wayback Machine CDX Scraper**](https://apify.com/parseforge/wayback-cdx-scraper) - the Internet Archive's complementary historical web index.
- [**🅱️ Bing Search Scraper**](https://apify.com/parseforge/bing-search-scraper) - check current rank for URLs you find in CC.
- [**🦆 DuckDuckGo Search Scraper**](https://apify.com/parseforge/duckduckgo-search-scraper) - alternative SERP signal alongside crawl coverage.
- [**📚 Wikipedia Pageviews Scraper**](https://apify.com/parseforge/wikipedia-pageviews-scraper) - cross-reference web mentions with public-interest spikes.
- [**🐙 GitHub Trending Repos Scraper**](https://apify.com/parseforge/github-trending-scraper) - capture the developer-attention layer.

> 💡 **Pro Tip:** browse the complete [ParseForge collection](https://apify.com/parseforge) for more pre-built scrapers and data tools.

***

**🆘 Need Help?** [**Open our contact form**](https://tally.so/r/BzdKgA) and we'll route the question to the right person.

***

> Common Crawl is a registered trademark of Common Crawl Foundation, a 501(c)(3) non-profit. This Actor is not affiliated with or endorsed by Common Crawl. It uses only the public Index Server endpoint and respects all published rate limits.

# Actor input Schema

## `maxItems` (type: `integer`):

Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000.

## `url` (type: `string`):

Domain or URL to look up. Examples: `apify.com`, `example.com/blog`, `https://news.ycombinator.com`.

## `crawls` (type: `array`):

Specific Common Crawl crawl IDs (e.g. `CC-MAIN-2024-30`). Leave empty to query the most recent crawl automatically.

## `matchType` (type: `string`):

How to match the URL: `exact`, `prefix`, `host`, or `domain`.

## `fromDate` (type: `string`):

Earliest crawl timestamp. Leave empty for any.

## `toDate` (type: `string`):

Latest crawl timestamp. Leave empty for any.

## `statusCode` (type: `string`):

Restrict to a specific status code, e.g. `200`, `301`, `404`. Leave empty for all.

## `mimeType` (type: `string`):

Filter by MIME type, e.g. `text/html`, `application/pdf`, `image/jpeg`.

## Actor input object example

```json
{
  "maxItems": 10,
  "url": "apify.com",
  "matchType": "domain"
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "maxItems": 10,
    "url": "apify.com",
    "matchType": "domain"
};

// Run the Actor and wait for it to finish
const run = await client.actor("parseforge/common-crawl-index-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "maxItems": 10,
    "url": "apify.com",
    "matchType": "domain",
}

# Run the Actor and wait for it to finish
run = client.actor("parseforge/common-crawl-index-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "maxItems": 10,
  "url": "apify.com",
  "matchType": "domain"
}' |
apify call parseforge/common-crawl-index-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=parseforge/common-crawl-index-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Common Crawl URL Index Lookup Scraper",
        "description": "Pull every web page Common Crawl captured for a domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and WARC offsets to fetch original payloads. Filter by collection, MIME, and status. Export to JSON, CSV, or Excel for large-scale web research and content discovery.",
        "version": "1.0",
        "x-build-id": "CtPMOAAZoeu0cpvar"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/parseforge~common-crawl-index-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-parseforge-common-crawl-index-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/parseforge~common-crawl-index-scraper/runs": {
            "post": {
                "operationId": "runs-sync-parseforge-common-crawl-index-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/parseforge~common-crawl-index-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-parseforge-common-crawl-index-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 1,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000."
                    },
                    "url": {
                        "title": "URL or domain",
                        "type": "string",
                        "description": "Domain or URL to look up. Examples: `apify.com`, `example.com/blog`, `https://news.ycombinator.com`."
                    },
                    "crawls": {
                        "title": "Crawl IDs",
                        "type": "array",
                        "description": "Specific Common Crawl crawl IDs (e.g. `CC-MAIN-2024-30`). Leave empty to query the most recent crawl automatically.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "matchType": {
                        "title": "Match type",
                        "enum": [
                            "exact",
                            "prefix",
                            "host",
                            "domain"
                        ],
                        "type": "string",
                        "description": "How to match the URL: `exact`, `prefix`, `host`, or `domain`.",
                        "default": "domain"
                    },
                    "fromDate": {
                        "title": "From date (YYYYMMDD)",
                        "type": "string",
                        "description": "Earliest crawl timestamp. Leave empty for any."
                    },
                    "toDate": {
                        "title": "To date (YYYYMMDD)",
                        "type": "string",
                        "description": "Latest crawl timestamp. Leave empty for any."
                    },
                    "statusCode": {
                        "title": "HTTP status code",
                        "type": "string",
                        "description": "Restrict to a specific status code, e.g. `200`, `301`, `404`. Leave empty for all."
                    },
                    "mimeType": {
                        "title": "MIME type",
                        "type": "string",
                        "description": "Filter by MIME type, e.g. `text/html`, `application/pdf`, `image/jpeg`."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```