Pricing

from $8.25 / 1,000 items

Common Crawl URL Index Lookup Scraper

Pull every web page Common Crawl captured for a domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and WARC offsets to fetch original payloads. Filter by collection, MIME, and status. Export to JSON, CSV, or Excel for large-scale web research and content discovery.

Pricing

from $8.25 / 1,000 items

Rating

0.0

(0)

Developer

ParseForge

Actor stats

Bookmarked

Total users

Monthly active users

8 days ago

Last modified

🌐 Common Crawl Index Scraper

🚀 List every web page Common Crawl captured for a domain or URL prefix. WARC offsets included so you can fetch the original payload from S3. No API key, no registration.

🕒 Last updated: 2026-05-01 · 📊 9 fields per record · 🗂️ 250+ billion pages indexed · 📅 monthly crawls since 2008 · 🆓 free public index

The Common Crawl Index Scraper queries the public Common Crawl Index Server and returns every page Common Crawl captured for a given domain or URL prefix. Each record includes the captured URL, ISO timestamp, MIME type, HTTP status code, content digest, byte length, WARC filename, byte offset into that file, and the source collection name.

Common Crawl runs a fresh public web crawl every month and indexes the results in a sortable URL-keyed index. The dataset has powered widely-cited research, Wikipedia-grade reference work, and the training corpus for many large language models. This Actor handles collection selection, MIME and status filters, pagination, and timestamp formatting so you can focus on the data.

🎯 Target Audience	💡 Primary Use Cases
ML engineers, web researchers, SEO analysts, data scientists, academics	Training-data discovery, large-scale crawl filtering, archive lookup, content audits

📋 What the Common Crawl Index Scraper does

Five filtering workflows in a single run:

🌐 Domain or prefix lookup. Submit a URL or prefix and pull every Common Crawl capture in the chosen collection.
🗂️ Collection selector. Pick a specific monthly crawl like CC-MAIN-2026-04 or default to the latest.
📐 Match-type control. exact, prefix, host, or domain like a CDX query.
📄 MIME and status filters. Restrict to HTML, JSON, image, or any specific status code.
📦 WARC offsets included. Every row tells you which WARC file holds the original payload and at what byte offset.

Each row reports the URL, ISO timestamp, MIME type, HTTP status, digest, byte length, WARC filename, byte offset, and the parent collection identifier.

💡 Why it matters: Common Crawl is the largest free web corpus in existence and the foundation of many open AI training datasets. Knowing whether a domain is even in the corpus, and at what depth, is a basic question for ML pretraining work, copyright analysis, and large-scale research. Direct CDX queries against the index server are doable but slow and finicky; this Actor wraps that in a clean filter UI.

🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.

⚙️ Input

Input	Type	Default	Behavior
`maxItems`	integer	`10`	Records to return. Free plan caps at 10, paid plan at 1,000,000.
`urlOrDomain`	string	`"apify.com"`	Domain or URL prefix to look up.
`matchType`	string	`"domain"`	`exact`, `prefix`, `host`, or `domain`.
`collection`	string	latest available	Monthly crawl identifier like `CC-MAIN-2026-04`.
`mimeFilter`	string	empty	MIME type filter, e.g. `text/html`.
`statusFilter`	string	empty	HTTP status code filter, e.g. `200`.

Example: every HTML page captured under apify.com in April 2026.

{
    "maxItems": 500,
    "urlOrDomain": "apify.com",
    "matchType": "domain",
    "collection": "CC-MAIN-2026-04",
    "mimeFilter": "text/html",
    "statusFilter": "200"
}

Example: every capture of a single competitor URL.

{
    "maxItems": 100,
    "urlOrDomain": "competitor.com/pricing",
    "matchType": "exact"
}

⚠️ Good to Know: Common Crawl publishes one full crawl per month and the corresponding index. The collection list is fetched at run time from index.commoncrawl.org/collinfo.json, so the most recent crawl is always available. WARC paths in the output are relative to the Common Crawl S3 bucket; download with the standard AWS S3 tooling.

📊 Output

Each row contains 9 fields. Download as CSV, Excel, JSON, or XML.

🧾 Schema

Field	Type	Example
🔗 `url`	string	`"https://apify.com/store"`
📅 `timestamp`	ISO 8601	`"2026-04-15T08:22:13Z"`
📄 `mimeType`	string	`"text/html"`
✅ `statusCode`	integer	`200`
🔐 `digest`	string	`"AAB45HGJK..."`
📦 `length`	integer	`8421`
📂 `filename`	string	`"crawl-data/CC-MAIN-2026-04/segments/.../warc.gz"`
📌 `offset`	integer	`142893551`
🗂️ `collection`	string	`"CC-MAIN-2026-04"`

📦 Sample records

✨ Why choose this Actor

	Capability
🆓	Free public source. Reads the Common Crawl Index Server directly.
🗂️	Every monthly crawl. All collections from 2008 to today are queryable.
📦	WARC offsets. Each row tells you the exact byte range to fetch the original payload.
📐	CDX-style match types. Exact URL, prefix, host, or full domain.
📄	MIME and status filters. Slice the corpus by content type or HTTP status.
🚀	Sub-30-second runs. Typical 100-row pulls finish quickly.
🛠️	Live collection list. Latest crawl auto-detected at run time.

📊 Common Crawl reports more than 250 billion pages indexed across all monthly crawls.

📈 How it compares to alternatives

Approach	Cost	Coverage	Refresh	Filters	Setup
Direct CDX queries	Free	Full	Monthly	Manual	Engineer hours
Paid web index APIs	$$$ subscription	Partial	Daily	Built-in	Account setup
Self-hosted CC mirrors	Storage cost	Snapshot	Manual refresh	None	Infrastructure
⭐ Common Crawl Index Scraper (this Actor)	Pay-per-event	Full	Monthly	Match type, MIME, status, collection	None

Same index server Common Crawl publishes, exposed as clean structured records.

🚀 How to use

🆓 Create a free Apify account. Sign up here and get $5 in free credit.
🔍 Open the Actor. Search for "Common Crawl Index" in the Apify Store.
⚙️ Set your inputs. Pick the URL or domain, match type, and any filters.
▶️ Click Start. A 100-row run typically completes in 10 to 25 seconds.
📥 Download. Export as CSV, Excel, JSON, or XML.

⏱️ Total time from sign-up to first dataset: under five minutes.

💼 Business use cases

🤖 ML & data science

Check if a domain is in a training corpus
Estimate copyright exposure for LLM datasets
Build domain-specific subcorpora from CC
Cross-reference scraped data with public capture record

📈 SEO & competitive

Map a competitor's full URL space
Audit which pages CC sees vs Google
Discover legacy paths through historical crawls
Track structural changes month over month

🛡️ Security & OSINT

Map historical attack surface of a target domain
Find leaked URLs that are no longer linked
Track CDN and origin host changes
Identify abandoned subdomains

📰 Research & journalism

Cite specific captures with stable WARC offsets
Run reproducible studies on CC subsets
Compare crawl coverage of different topic spaces
Build longitudinal datasets month over month

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

Empirical datasets for papers, thesis work, and coursework
Longitudinal studies tracking changes across snapshots
Reproducible research with cited, versioned data pulls
Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

Side projects, portfolio demos, and indie app launches
Data visualizations, dashboards, and infographics
Content research for bloggers, YouTubers, and podcasters
Hobbyist collections and personal trackers

🤝 Non-profit and civic

Transparency reporting and accountability projects
Advocacy campaigns backed by public-interest data
Community-run databases for local issues
Investigative journalism on public records

🧪 Experimentation

Prototype AI and machine-learning pipelines with real data
Validate product-market hypotheses before engineering spend
Train small domain-specific models on niche corpora
Test dashboard concepts with live input

🔌 Automating Common Crawl Index Scraper

Run this Actor on a schedule, from your codebase, or inside another tool:

Node.js SDK: see Apify JavaScript client for programmatic runs.
Python SDK: see Apify Python client for the same flow in Python.
HTTP API: see Apify API docs for raw REST integration.

Schedule monthly runs from the Apify Console to track each new crawl. Pipe results into Google Sheets, S3, BigQuery, or your own webhook with the built-in integrations.

🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

❓ Frequently Asked Questions

🗂️ How do I pick a collection?

Each Common Crawl is identified by a tag like CC-MAIN-2026-04. The Actor fetches the live collection list and defaults to the latest one. Pass an explicit value if you want a historical snapshot.

📦 Can I download the actual page content?

Each row gives you the WARC filename and byte offset. Use any AWS S3 client to fetch a byte range from s3://commoncrawl/{filename}. Bulk WARC fetching is a separate workflow.

🔍 What is the difference between match types?

exact matches one URL only. prefix matches a URL plus everything beneath it. host matches one hostname. domain matches the host plus all subdomains.

📅 How often does Common Crawl refresh?

Roughly once per month. The newest collection is detected automatically at run time.

📦 How many rows can I pull at once?

Free plan caps at 10. Paid plans go up to 1,000,000. Very broad queries can return millions of rows; always set sensible filters.

📄 Why does my run return zero rows?

Common Crawl indexes a sample, not the entire web. Smaller sites may not be in any given monthly collection. Try a broader match type or a different collection.

📅 Can I query historical collections?

Yes. Pass any past collection identifier (e.g. CC-MAIN-2020-50). The Actor returns the snapshot from that month.

💼 Can I use this for commercial work?

Yes. The Common Crawl dataset is published under terms that allow commercial use. Always cite Common Crawl as the source.

💳 Do I need a paid Apify plan?

The free plan returns up to 10 rows per run. Paid plans return up to 1,000,000.

⚠️ What if a run fails or returns empty?

Common Crawl's index server occasionally rate-limits very wide queries. Narrow the match type or retry. Open a contact form and include the run URL if the issue persists.

🔁 How fresh is the data?

Each run hits the live index server, which is updated when Common Crawl publishes a new monthly crawl.

⚖️ Is this legal?

Yes. Common Crawl publishes the index server for exactly this kind of programmatic access, and the dataset is released under terms that permit research and commercial use.

🔌 Integrate with any app

Make - drop run results into 1,800+ apps.
Zapier - trigger automations off completed runs.
Slack - post run summaries to a channel.
Google Sheets - sync each run into a spreadsheet.
Webhooks - notify your own services on run finish.
Airbyte - load runs into Snowflake, BigQuery, or Postgres.

🔗 Recommended Actors

🕰️ Wayback Machine CDX Scraper - the Internet Archive's complementary historical web index.
🅱️ Bing Search Scraper - check current rank for URLs you find in CC.
🦆 DuckDuckGo Search Scraper - alternative SERP signal alongside crawl coverage.
📚 Wikipedia Pageviews Scraper - cross-reference web mentions with public-interest spikes.
🐙 GitHub Trending Repos Scraper - capture the developer-attention layer.

💡 Pro Tip: browse the complete ParseForge collection for more pre-built scrapers and data tools.

🆘 Need Help? Open our contact form and we'll route the question to the right person.

Common Crawl is a registered trademark of Common Crawl Foundation, a 501(c)(3) non-profit. This Actor is not affiliated with or endorsed by Common Crawl. It uses only the public Index Server endpoint and respects all published rate limits.

Wayback Machine CDX URL List Scraper

parseforge/wayback-cdx-scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

ParseForge

Wayback Machine Scraper

glassventures/wayback-machine-scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

Glass Ventures

MIME Type Detector

automation-lab/mime-type-detector

Detect MIME types from file extensions, URLs, or magic bytes (base64). Batch process thousands of files. Uses mime-types + file-type packages. Zero proxy, pure utility.

Stas Persiianenko

Wayback Machine CDX Bulk Extractor

automation-lab/wayback-machine-cdx-extractor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

Stas Persiianenko

RAG Web Browser Scraper

datapilot/rag-web-browser-scraper

RAG Web Browser Search & Crawl Actor uses to search Bing or crawl URLs, then extracts page content as clean markdown. It captures title, description, language, HTTP status, and structured metadata. Supports multiple queries, proxies, and outputs organized crawl + search results.

Data Pilot

Wayback Machine Snapshots Scraper — Internet Archive History

seemuapps/wayback-machine-snapshots-scraper

List every Internet Archive snapshot of a URL, page, or whole domain. Timestamp, snapshot URL, status code, mime type, content length. No login.

Andrew

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Crawler Bros

5.0

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

Ryan Clinton

Website Content Crawler Fast

timelody/website-content-crawler-fast

Scraping data from every single web page.

timelody

5.0

Universal Web Scraper & Data Extractor – Fast No-Code Tool

motivational_nickel/my-actor

Universal web scraper that extracts structured data from almost any website. Detect and scrape webpage content into clean datasets (CSV, Excel, JSON) without coding. Ideal for web scraping, research, lead generation, automation pipelines, and large-scale data extraction.