Common Crawl URL Index Lookup Scraper
Pricing
from $8.25 / 1,000 items
Common Crawl URL Index Lookup Scraper
Pull every web page Common Crawl captured for a domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and WARC offsets to fetch original payloads. Filter by collection, MIME, and status. Export to JSON, CSV, or Excel for large-scale web research and content discovery.
Pricing
from $8.25 / 1,000 items
Rating
0.0
(0)
Developer
ParseForge
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share

🌐 Common Crawl Index Scraper
🚀 List every web page Common Crawl captured for a domain or URL prefix. WARC offsets included so you can fetch the original payload from S3. No API key, no registration.
🕒 Last updated: 2026-05-01 · 📊 9 fields per record · 🗂️ 250+ billion pages indexed · 📅 monthly crawls since 2008 · 🆓 free public index
The Common Crawl Index Scraper queries the public Common Crawl Index Server and returns every page Common Crawl captured for a given domain or URL prefix. Each record includes the captured URL, ISO timestamp, MIME type, HTTP status code, content digest, byte length, WARC filename, byte offset into that file, and the source collection name.
Common Crawl runs a fresh public web crawl every month and indexes the results in a sortable URL-keyed index. The dataset has powered widely-cited research, Wikipedia-grade reference work, and the training corpus for many large language models. This Actor handles collection selection, MIME and status filters, pagination, and timestamp formatting so you can focus on the data.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| ML engineers, web researchers, SEO analysts, data scientists, academics | Training-data discovery, large-scale crawl filtering, archive lookup, content audits |
📋 What the Common Crawl Index Scraper does
Five filtering workflows in a single run:
- 🌐 Domain or prefix lookup. Submit a URL or prefix and pull every Common Crawl capture in the chosen collection.
- 🗂️ Collection selector. Pick a specific monthly crawl like
CC-MAIN-2026-04or default to the latest. - 📐 Match-type control.
exact,prefix,host, ordomainlike a CDX query. - 📄 MIME and status filters. Restrict to HTML, JSON, image, or any specific status code.
- 📦 WARC offsets included. Every row tells you which WARC file holds the original payload and at what byte offset.
Each row reports the URL, ISO timestamp, MIME type, HTTP status, digest, byte length, WARC filename, byte offset, and the parent collection identifier.
💡 Why it matters: Common Crawl is the largest free web corpus in existence and the foundation of many open AI training datasets. Knowing whether a domain is even in the corpus, and at what depth, is a basic question for ML pretraining work, copyright analysis, and large-scale research. Direct CDX queries against the index server are doable but slow and finicky; this Actor wraps that in a clean filter UI.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
maxItems | integer | 10 | Records to return. Free plan caps at 10, paid plan at 1,000,000. |
urlOrDomain | string | "apify.com" | Domain or URL prefix to look up. |
matchType | string | "domain" | exact, prefix, host, or domain. |
collection | string | latest available | Monthly crawl identifier like CC-MAIN-2026-04. |
mimeFilter | string | empty | MIME type filter, e.g. text/html. |
statusFilter | string | empty | HTTP status code filter, e.g. 200. |
Example: every HTML page captured under apify.com in April 2026.
{"maxItems": 500,"urlOrDomain": "apify.com","matchType": "domain","collection": "CC-MAIN-2026-04","mimeFilter": "text/html","statusFilter": "200"}
Example: every capture of a single competitor URL.
{"maxItems": 100,"urlOrDomain": "competitor.com/pricing","matchType": "exact"}
⚠️ Good to Know: Common Crawl publishes one full crawl per month and the corresponding index. The collection list is fetched at run time from
index.commoncrawl.org/collinfo.json, so the most recent crawl is always available. WARC paths in the output are relative to the Common Crawl S3 bucket; download with the standard AWS S3 tooling.
📊 Output
Each row contains 9 fields. Download as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🔗 url | string | "https://apify.com/store" |
📅 timestamp | ISO 8601 | "2026-04-15T08:22:13Z" |
📄 mimeType | string | "text/html" |
✅ statusCode | integer | 200 |
🔐 digest | string | "AAB45HGJK..." |
📦 length | integer | 8421 |
📂 filename | string | "crawl-data/CC-MAIN-2026-04/segments/.../warc.gz" |
📌 offset | integer | 142893551 |
🗂️ collection | string | "CC-MAIN-2026-04" |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🆓 | Free public source. Reads the Common Crawl Index Server directly. |
| 🗂️ | Every monthly crawl. All collections from 2008 to today are queryable. |
| 📦 | WARC offsets. Each row tells you the exact byte range to fetch the original payload. |
| 📐 | CDX-style match types. Exact URL, prefix, host, or full domain. |
| 📄 | MIME and status filters. Slice the corpus by content type or HTTP status. |
| 🚀 | Sub-30-second runs. Typical 100-row pulls finish quickly. |
| 🛠️ | Live collection list. Latest crawl auto-detected at run time. |
📊 Common Crawl reports more than 250 billion pages indexed across all monthly crawls.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| Direct CDX queries | Free | Full | Monthly | Manual | Engineer hours |
| Paid web index APIs | $$$ subscription | Partial | Daily | Built-in | Account setup |
| Self-hosted CC mirrors | Storage cost | Snapshot | Manual refresh | None | Infrastructure |
| ⭐ Common Crawl Index Scraper (this Actor) | Pay-per-event | Full | Monthly | Match type, MIME, status, collection | None |
Same index server Common Crawl publishes, exposed as clean structured records.
🚀 How to use
- 🆓 Create a free Apify account. Sign up here and get $5 in free credit.
- 🔍 Open the Actor. Search for "Common Crawl Index" in the Apify Store.
- ⚙️ Set your inputs. Pick the URL or domain, match type, and any filters.
- ▶️ Click Start. A 100-row run typically completes in 10 to 25 seconds.
- 📥 Download. Export as CSV, Excel, JSON, or XML.
⏱️ Total time from sign-up to first dataset: under five minutes.
💼 Business use cases
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🔌 Automating Common Crawl Index Scraper
Run this Actor on a schedule, from your codebase, or inside another tool:
- Node.js SDK: see Apify JavaScript client for programmatic runs.
- Python SDK: see Apify Python client for the same flow in Python.
- HTTP API: see Apify API docs for raw REST integration.
Schedule monthly runs from the Apify Console to track each new crawl. Pipe results into Google Sheets, S3, BigQuery, or your own webhook with the built-in integrations.
❓ Frequently Asked Questions
🔌 Integrate with any app
- Make - drop run results into 1,800+ apps.
- Zapier - trigger automations off completed runs.
- Slack - post run summaries to a channel.
- Google Sheets - sync each run into a spreadsheet.
- Webhooks - notify your own services on run finish.
- Airbyte - load runs into Snowflake, BigQuery, or Postgres.
🔗 Recommended Actors
- 🕰️ Wayback Machine CDX Scraper - the Internet Archive's complementary historical web index.
- 🅱️ Bing Search Scraper - check current rank for URLs you find in CC.
- 🦆 DuckDuckGo Search Scraper - alternative SERP signal alongside crawl coverage.
- 📚 Wikipedia Pageviews Scraper - cross-reference web mentions with public-interest spikes.
- 🐙 GitHub Trending Repos Scraper - capture the developer-attention layer.
💡 Pro Tip: browse the complete ParseForge collection for more pre-built scrapers and data tools.
🆘 Need Help? Open our contact form and we'll route the question to the right person.
Common Crawl is a registered trademark of Common Crawl Foundation, a 501(c)(3) non-profit. This Actor is not affiliated with or endorsed by Common Crawl. It uses only the public Index Server endpoint and respects all published rate limits.