Common Crawl Scraper avatar

Common Crawl Scraper

Pricing

from $3.00 / 1,000 results

Go to Apify Store
Common Crawl Scraper

Common Crawl Scraper

Query the Common Crawl URL Index for any domain or URL pattern. Discover a site's archived pages, historical URLs, capture dates, HTTP statuses and MIME types for SEO, domain intelligence and research. Also lists the available monthly crawls.

Pricing

from $3.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Crawler Bros

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Query the Common Crawl URL Index for any domain or URL pattern and get back every archived capture — the historical URLs a site exposed, when they were crawled, their HTTP status and MIME type, and the exact location of the raw page in Common Crawl's public archive. Perfect for SEO audits, domain intelligence, historical URL discovery and web-scale research. HTTP-only, no login, no proxy.

What this actor does

  • Two modes: urlCaptures and listCrawls
  • Wildcard lookups: example.com, example.com/*, en.wikipedia.org/wiki/*, *.example.com
  • 125+ monthly crawls: query the latest crawl or any historical one
  • Server-side filters: date range; client-side filters: HTTP status, MIME type
  • WARC location for every capture so you can fetch the raw archived page
  • Empty fields are omitted — every field in a record is populated

Modes

ModeWhat it doesNeeds
urlCapturesLook up all archived captures for a domain / URL patternurlPattern (+ optional crawl, filters)
listCrawlsList every available Common Crawl monthly crawl

Output — urlCaptures (one row per archived capture)

  • url — the archived URL
  • urlKey — Common Crawl's canonical (SURT) key
  • timestamp — capture time, YYYYMMDDHHMMSS
  • captureDate — the same time as ISO 8601
  • status — HTTP status at capture time
  • mime — declared MIME type
  • mimeDetected — MIME type detected from content
  • digest — content digest (dedupe identical pages)
  • length — record byte length
  • offset — byte offset within the WARC file
  • filename — WARC file path in the archive
  • languages — detected language codes
  • encoding — character encoding
  • redirectUrl — redirect target (for 3xx captures)
  • truncated — truncation reason (when present)
  • crawlId — which crawl this came from
  • warcUrl — direct link to the WARC file on data.commoncrawl.org
  • recordType: "capture", sourceUrl, scrapedAt

Output — listCrawls (one row per crawl)

  • crawlId — e.g. CC-MAIN-2024-10
  • name — human-readable name (e.g. February/March 2024 Index)
  • fromDate, toDate — crawl time window
  • cdxApiUrl — the crawl's index API endpoint
  • timegateUrl — the crawl's timegate
  • recordType: "crawl", sourceUrl, scrapedAt

Input

FieldTypeDefaultDescription
modestringurlCapturesurlCaptures / listCrawls
urlPatternstringen.wikipedia.org/wiki/*Domain or URL, * wildcards allowed
crawlstringlatestlatest or a crawl id like CC-MAIN-2024-10
matchTypestringautoauto / exact / prefix / host / domain
statusFilterintKeep only this HTTP status (e.g. 200)
mimeFilterstringKeep only MIME types containing this text
fromDatestringYYYYMMDD lower bound
toDatestringYYYYMMDD upper bound
maxItemsint100Hard cap (1–5000)

Example: all archived Wikipedia article URLs in the latest crawl

{ "mode": "urlCaptures", "urlPattern": "en.wikipedia.org/wiki/*", "crawl": "latest", "maxItems": 500 }

Example: only successful HTML pages of a domain

{ "mode": "urlCaptures", "urlPattern": "example.com/*", "statusFilter": 200, "mimeFilter": "text/html" }

Example: a whole domain including subdomains, in a specific crawl

{ "mode": "urlCaptures", "urlPattern": "wikipedia.org", "matchType": "domain", "crawl": "CC-MAIN-2024-10" }

Example: list every available crawl

{ "mode": "listCrawls" }

Use cases

  • SEO & site audits — discover every URL a domain has ever exposed to crawlers
  • Domain intelligence — profile a competitor's URL structure and content types
  • Historical URL discovery — recover old / removed pages for migration or research
  • Data engineering — get WARC offsets to pull raw archived pages at scale
  • Security research — enumerate a domain's historical footprint

Data source

Data comes from the public Common Crawl URL Index (the CDX API) and the crawl archive at data.commoncrawl.org, both published openly by the Common Crawl Foundation. No account, API key, or proxy is required. Coverage spans well over a decade of monthly crawls; run listCrawls to see every crawl currently available and its date range.

FAQ

What is Common Crawl? A free, open repository of web crawl data covering billions of pages, updated roughly monthly. Its URL Index lets you look up which URLs were captured and where they live in the archive. See commoncrawl.org.

What's a "crawl"? Each monthly snapshot is a crawl with an id like CC-MAIN-2024-10. Use latest for the newest, or run listCrawls to see all available ids and their date ranges.

How do wildcards work? example.com matches that host's pages; example.com/* matches everything under it; *.example.com matches all subdomains. You can also set matchType explicitly to exact, prefix, host or domain.

Why did I get no results for a big-name site? Some sites exclude crawlers via robots.txt, so they aren't in Common Crawl. Try a different domain or crawl.

Can I fetch the actual page content? Each capture includes warcUrl, offset and length — enough to download the exact archived response from Common Crawl's public data.commoncrawl.org store.

What does digest do? It's a content hash — identical digest values across captures mean the page content didn't change, which is handy for change detection and deduplication.

How far back does the data go? Common Crawl has crawls stretching back over a decade; listCrawls shows every one currently available.