Common Crawl URL Index Lookup Scraper
Pricing
from $8.25 / 1,000 items
Common Crawl URL Index Lookup Scraper
Pull every web page Common Crawl captured for a domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and WARC offsets to fetch original payloads. Filter by collection, MIME, and status. Export to JSON, CSV, or Excel for large-scale web research and content discovery.
Pricing
from $8.25 / 1,000 items
Rating
0.0
(0)
Developer
ParseForge
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 days ago
Last modified
Categories
Share

🌐 Common Crawl Index Scraper
🚀 List every web page Common Crawl captured for a domain or URL prefix. WARC offsets included so you can fetch the original payload from S3. No API key, no registration.
🕒 Last updated: 2026-05-01 · 📊 9 fields per record · 🗂️ 250+ billion pages indexed · 📅 monthly crawls since 2008 · 🆓 free public index
The Common Crawl Index Scraper queries the public Common Crawl Index Server and returns every page Common Crawl captured for a given domain or URL prefix. Each record includes the captured URL, ISO timestamp, MIME type, HTTP status code, content digest, byte length, WARC filename, byte offset into that file, and the source collection name.
Common Crawl runs a fresh public web crawl every month and indexes the results in a sortable URL-keyed index. The dataset has powered widely-cited research, Wikipedia-grade reference work, and the training corpus for many large language models. This Actor handles collection selection, MIME and status filters, pagination, and timestamp formatting so you can focus on the data.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| ML engineers, web researchers, SEO analysts, data scientists, academics | Training-data discovery, large-scale crawl filtering, archive lookup, content audits |
📋 What the Common Crawl Index Scraper does
Five filtering workflows in a single run:
- 🌐 Domain or prefix lookup. Submit a URL or prefix and pull every Common Crawl capture in the chosen collection.
- 🗂️ Collection selector. Pick a specific monthly crawl like
CC-MAIN-2026-04or default to the latest. - 📐 Match-type control.
exact,prefix,host, ordomainlike a CDX query. - 📄 MIME and status filters. Restrict to HTML, JSON, image, or any specific status code.
- 📦 WARC offsets included. Every row tells you which WARC file holds the original payload and at what byte offset.
Each row reports the URL, ISO timestamp, MIME type, HTTP status, digest, byte length, WARC filename, byte offset, and the parent collection identifier.
💡 Why it matters: Common Crawl is the largest free web corpus in existence and the foundation of many open AI training datasets. Knowing whether a domain is even in the corpus, and at what depth, is a basic question for ML pretraining work, copyright analysis, and large-scale research. Direct CDX queries against the index server are doable but slow and finicky; this Actor wraps that in a clean filter UI.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
maxItems | integer | 10 | Records to return. Free plan caps at 10, paid plan at 1,000,000. |
urlOrDomain | string | "apify.com" | Domain or URL prefix to look up. |
matchType | string | "domain" | exact, prefix, host, or domain. |
collection | string | latest available | Monthly crawl identifier like CC-MAIN-2026-04. |
mimeFilter | string | empty | MIME type filter, e.g. text/html. |
statusFilter | string | empty | HTTP status code filter, e.g. 200. |
Example: every HTML page captured under apify.com in April 2026.
{"maxItems": 500,"urlOrDomain": "apify.com","matchType": "domain","collection": "CC-MAIN-2026-04","mimeFilter": "text/html","statusFilter": "200"}
Example: every capture of a single competitor URL.
{"maxItems": 100,"urlOrDomain": "competitor.com/pricing","matchType": "exact"}
⚠️ Good to Know: Common Crawl publishes one full crawl per month and the corresponding index. The collection list is fetched at run time from
index.commoncrawl.org/collinfo.json, so the most recent crawl is always available. WARC paths in the output are relative to the Common Crawl S3 bucket; download with the standard AWS S3 tooling.
📊 Output
Each row contains 9 fields. Download as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🔗 url | string | "https://apify.com/store" |
📅 timestamp | ISO 8601 | "2026-04-15T08:22:13Z" |
📄 mimeType | string | "text/html" |
✅ statusCode | integer | 200 |
🔐 digest | string | "AAB45HGJK..." |
📦 length | integer | 8421 |
📂 filename | string | "crawl-data/CC-MAIN-2026-04/segments/.../warc.gz" |
📌 offset | integer | 142893551 |
🗂️ collection | string | "CC-MAIN-2026-04" |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🆓 | Free public source. Reads the Common Crawl Index Server directly. |
| 🗂️ | Every monthly crawl. All collections from 2008 to today are queryable. |
| 📦 | WARC offsets. Each row tells you the exact byte range to fetch the original payload. |
| 📐 | CDX-style match types. Exact URL, prefix, host, or full domain. |
| 📄 | MIME and status filters. Slice the corpus by content type or HTTP status. |
| 🚀 | Sub-30-second runs. Typical 100-row pulls finish quickly. |
| 🛠️ | Live collection list. Latest crawl auto-detected at run time. |
📊 Common Crawl reports more than 250 billion pages indexed across all monthly crawls.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| Direct CDX queries | Free | Full | Monthly | Manual | Engineer hours |
| Paid web index APIs | $$$ subscription | Partial | Daily | Built-in | Account setup |
| Self-hosted CC mirrors | Storage cost | Snapshot | Manual refresh | None | Infrastructure |
| ⭐ Common Crawl Index Scraper (this Actor) | Pay-per-event | Full | Monthly | Match type, MIME, status, collection | None |
Same index server Common Crawl publishes, exposed as clean structured records.
🚀 How to use
- 🆓 Create a free Apify account. Sign up here and get $5 in free credit.
- 🔍 Open the Actor. Search for "Common Crawl Index" in the Apify Store.
- ⚙️ Set your inputs. Pick the URL or domain, match type, and any filters.
- ▶️ Click Start. A 100-row run typically completes in 10 to 25 seconds.
- 📥 Download. Export as CSV, Excel, JSON, or XML.
⏱️ Total time from sign-up to first dataset: under five minutes.
💼 Business use cases
🌟 Beyond business use cases
Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.
🔌 Automating Common Crawl Index Scraper
Run this Actor on a schedule, from your codebase, or inside another tool:
- Node.js SDK: see Apify JavaScript client for programmatic runs.
- Python SDK: see Apify Python client for the same flow in Python.
- HTTP API: see Apify API docs for raw REST integration.
Schedule monthly runs from the Apify Console to track each new crawl. Pipe results into Google Sheets, S3, BigQuery, or your own webhook with the built-in integrations.
🤖 Ask an AI assistant about this scraper
Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:
- 💬 ChatGPT
- 🧠 Claude
- 🔍 Perplexity
- 🅒 Copilot
❓ Frequently Asked Questions
🗂️ How do I pick a collection?
Each Common Crawl is identified by a tag like CC-MAIN-2026-04. The Actor fetches the live collection list and defaults to the latest one. Pass an explicit value if you want a historical snapshot.
📦 Can I download the actual page content?
Each row gives you the WARC filename and byte offset. Use any AWS S3 client to fetch a byte range from s3://commoncrawl/{filename}. Bulk WARC fetching is a separate workflow.
🔍 What is the difference between match types?
exact matches one URL only. prefix matches a URL plus everything beneath it. host matches one hostname. domain matches the host plus all subdomains.
📅 How often does Common Crawl refresh?
Roughly once per month. The newest collection is detected automatically at run time.
📦 How many rows can I pull at once?
Free plan caps at 10. Paid plans go up to 1,000,000. Very broad queries can return millions of rows; always set sensible filters.
📄 Why does my run return zero rows?
Common Crawl indexes a sample, not the entire web. Smaller sites may not be in any given monthly collection. Try a broader match type or a different collection.
📅 Can I query historical collections?
Yes. Pass any past collection identifier (e.g. CC-MAIN-2020-50). The Actor returns the snapshot from that month.
💼 Can I use this for commercial work?
Yes. The Common Crawl dataset is published under terms that allow commercial use. Always cite Common Crawl as the source.
💳 Do I need a paid Apify plan?
The free plan returns up to 10 rows per run. Paid plans return up to 1,000,000.
⚠️ What if a run fails or returns empty?
Common Crawl's index server occasionally rate-limits very wide queries. Narrow the match type or retry. Open a contact form and include the run URL if the issue persists.
🔁 How fresh is the data?
Each run hits the live index server, which is updated when Common Crawl publishes a new monthly crawl.
⚖️ Is this legal?
Yes. Common Crawl publishes the index server for exactly this kind of programmatic access, and the dataset is released under terms that permit research and commercial use.
🔌 Integrate with any app
- Make - drop run results into 1,800+ apps.
- Zapier - trigger automations off completed runs.
- Slack - post run summaries to a channel.
- Google Sheets - sync each run into a spreadsheet.
- Webhooks - notify your own services on run finish.
- Airbyte - load runs into Snowflake, BigQuery, or Postgres.
🔗 Recommended Actors
- 🕰️ Wayback Machine CDX Scraper - the Internet Archive's complementary historical web index.
- 🅱️ Bing Search Scraper - check current rank for URLs you find in CC.
- 🦆 DuckDuckGo Search Scraper - alternative SERP signal alongside crawl coverage.
- 📚 Wikipedia Pageviews Scraper - cross-reference web mentions with public-interest spikes.
- 🐙 GitHub Trending Repos Scraper - capture the developer-attention layer.
💡 Pro Tip: browse the complete ParseForge collection for more pre-built scrapers and data tools.
🆘 Need Help? Open our contact form and we'll route the question to the right person.
Common Crawl is a registered trademark of Common Crawl Foundation, a 501(c)(3) non-profit. This Actor is not affiliated with or endorsed by Common Crawl. It uses only the public Index Server endpoint and respects all published rate limits.


