Common Crawl URL Index Lookup Scraper avatar

Common Crawl URL Index Lookup Scraper

Pricing

from $8.25 / 1,000 items

Go to Apify Store
Common Crawl URL Index Lookup Scraper

Common Crawl URL Index Lookup Scraper

Pull every web page Common Crawl captured for a domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and WARC offsets to fetch original payloads. Filter by collection, MIME, and status. Export to JSON, CSV, or Excel for large-scale web research and content discovery.

Pricing

from $8.25 / 1,000 items

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

ParseForge Banner

🌐 Common Crawl Index Scraper

🚀 List every web page Common Crawl captured for a domain or URL prefix. WARC offsets included so you can fetch the original payload from S3. No API key, no registration.

🕒 Last updated: 2026-05-01 · 📊 9 fields per record · 🗂️ 250+ billion pages indexed · 📅 monthly crawls since 2008 · 🆓 free public index

The Common Crawl Index Scraper queries the public Common Crawl Index Server and returns every page Common Crawl captured for a given domain or URL prefix. Each record includes the captured URL, ISO timestamp, MIME type, HTTP status code, content digest, byte length, WARC filename, byte offset into that file, and the source collection name.

Common Crawl runs a fresh public web crawl every month and indexes the results in a sortable URL-keyed index. The dataset has powered widely-cited research, Wikipedia-grade reference work, and the training corpus for many large language models. This Actor handles collection selection, MIME and status filters, pagination, and timestamp formatting so you can focus on the data.

🎯 Target Audience💡 Primary Use Cases
ML engineers, web researchers, SEO analysts, data scientists, academicsTraining-data discovery, large-scale crawl filtering, archive lookup, content audits

📋 What the Common Crawl Index Scraper does

Five filtering workflows in a single run:

  • 🌐 Domain or prefix lookup. Submit a URL or prefix and pull every Common Crawl capture in the chosen collection.
  • 🗂️ Collection selector. Pick a specific monthly crawl like CC-MAIN-2026-04 or default to the latest.
  • 📐 Match-type control. exact, prefix, host, or domain like a CDX query.
  • 📄 MIME and status filters. Restrict to HTML, JSON, image, or any specific status code.
  • 📦 WARC offsets included. Every row tells you which WARC file holds the original payload and at what byte offset.

Each row reports the URL, ISO timestamp, MIME type, HTTP status, digest, byte length, WARC filename, byte offset, and the parent collection identifier.

💡 Why it matters: Common Crawl is the largest free web corpus in existence and the foundation of many open AI training datasets. Knowing whether a domain is even in the corpus, and at what depth, is a basic question for ML pretraining work, copyright analysis, and large-scale research. Direct CDX queries against the index server are doable but slow and finicky; this Actor wraps that in a clean filter UI.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset.


⚙️ Input

InputTypeDefaultBehavior
maxItemsinteger10Records to return. Free plan caps at 10, paid plan at 1,000,000.
urlOrDomainstring"apify.com"Domain or URL prefix to look up.
matchTypestring"domain"exact, prefix, host, or domain.
collectionstringlatest availableMonthly crawl identifier like CC-MAIN-2026-04.
mimeFilterstringemptyMIME type filter, e.g. text/html.
statusFilterstringemptyHTTP status code filter, e.g. 200.

Example: every HTML page captured under apify.com in April 2026.

{
"maxItems": 500,
"urlOrDomain": "apify.com",
"matchType": "domain",
"collection": "CC-MAIN-2026-04",
"mimeFilter": "text/html",
"statusFilter": "200"
}

Example: every capture of a single competitor URL.

{
"maxItems": 100,
"urlOrDomain": "competitor.com/pricing",
"matchType": "exact"
}

⚠️ Good to Know: Common Crawl publishes one full crawl per month and the corresponding index. The collection list is fetched at run time from index.commoncrawl.org/collinfo.json, so the most recent crawl is always available. WARC paths in the output are relative to the Common Crawl S3 bucket; download with the standard AWS S3 tooling.


📊 Output

Each row contains 9 fields. Download as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🔗 urlstring"https://apify.com/store"
📅 timestampISO 8601"2026-04-15T08:22:13Z"
📄 mimeTypestring"text/html"
statusCodeinteger200
🔐 digeststring"AAB45HGJK..."
📦 lengthinteger8421
📂 filenamestring"crawl-data/CC-MAIN-2026-04/segments/.../warc.gz"
📌 offsetinteger142893551
🗂️ collectionstring"CC-MAIN-2026-04"

📦 Sample records


✨ Why choose this Actor

Capability
🆓Free public source. Reads the Common Crawl Index Server directly.
🗂️Every monthly crawl. All collections from 2008 to today are queryable.
📦WARC offsets. Each row tells you the exact byte range to fetch the original payload.
📐CDX-style match types. Exact URL, prefix, host, or full domain.
📄MIME and status filters. Slice the corpus by content type or HTTP status.
🚀Sub-30-second runs. Typical 100-row pulls finish quickly.
🛠️Live collection list. Latest crawl auto-detected at run time.

📊 Common Crawl reports more than 250 billion pages indexed across all monthly crawls.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
Direct CDX queriesFreeFullMonthlyManualEngineer hours
Paid web index APIs$$$ subscriptionPartialDailyBuilt-inAccount setup
Self-hosted CC mirrorsStorage costSnapshotManual refreshNoneInfrastructure
⭐ Common Crawl Index Scraper (this Actor)Pay-per-eventFullMonthlyMatch type, MIME, status, collectionNone

Same index server Common Crawl publishes, exposed as clean structured records.


🚀 How to use

  1. 🆓 Create a free Apify account. Sign up here and get $5 in free credit.
  2. 🔍 Open the Actor. Search for "Common Crawl Index" in the Apify Store.
  3. ⚙️ Set your inputs. Pick the URL or domain, match type, and any filters.
  4. ▶️ Click Start. A 100-row run typically completes in 10 to 25 seconds.
  5. 📥 Download. Export as CSV, Excel, JSON, or XML.

⏱️ Total time from sign-up to first dataset: under five minutes.


💼 Business use cases

🤖 ML & data science

  • Check if a domain is in a training corpus
  • Estimate copyright exposure for LLM datasets
  • Build domain-specific subcorpora from CC
  • Cross-reference scraped data with public capture record

📈 SEO & competitive

  • Map a competitor's full URL space
  • Audit which pages CC sees vs Google
  • Discover legacy paths through historical crawls
  • Track structural changes month over month

🛡️ Security & OSINT

  • Map historical attack surface of a target domain
  • Find leaked URLs that are no longer linked
  • Track CDN and origin host changes
  • Identify abandoned subdomains

📰 Research & journalism

  • Cite specific captures with stable WARC offsets
  • Run reproducible studies on CC subsets
  • Compare crawl coverage of different topic spaces
  • Build longitudinal datasets month over month

🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

🎓 Research and academia

  • Empirical datasets for papers, thesis work, and coursework
  • Longitudinal studies tracking changes across snapshots
  • Reproducible research with cited, versioned data pulls
  • Classroom exercises on data analysis and ethical scraping

🎨 Personal and creative

  • Side projects, portfolio demos, and indie app launches
  • Data visualizations, dashboards, and infographics
  • Content research for bloggers, YouTubers, and podcasters
  • Hobbyist collections and personal trackers

🤝 Non-profit and civic

  • Transparency reporting and accountability projects
  • Advocacy campaigns backed by public-interest data
  • Community-run databases for local issues
  • Investigative journalism on public records

🧪 Experimentation

  • Prototype AI and machine-learning pipelines with real data
  • Validate product-market hypotheses before engineering spend
  • Train small domain-specific models on niche corpora
  • Test dashboard concepts with live input

🔌 Automating Common Crawl Index Scraper

Run this Actor on a schedule, from your codebase, or inside another tool:

Schedule monthly runs from the Apify Console to track each new crawl. Pipe results into Google Sheets, S3, BigQuery, or your own webhook with the built-in integrations.


❓ Frequently Asked Questions


🔌 Integrate with any app

  • Make - drop run results into 1,800+ apps.
  • Zapier - trigger automations off completed runs.
  • Slack - post run summaries to a channel.
  • Google Sheets - sync each run into a spreadsheet.
  • Webhooks - notify your own services on run finish.
  • Airbyte - load runs into Snowflake, BigQuery, or Postgres.

💡 Pro Tip: browse the complete ParseForge collection for more pre-built scrapers and data tools.


🆘 Need Help? Open our contact form and we'll route the question to the right person.


Common Crawl is a registered trademark of Common Crawl Foundation, a 501(c)(3) non-profit. This Actor is not affiliated with or endorsed by Common Crawl. It uses only the public Index Server endpoint and respects all published rate limits.