Pricing

from $3.00 / 1,000 results

Common Crawl Scraper

Query the Common Crawl URL Index for any domain or URL pattern. Discover a site's archived pages, historical URLs, capture dates, HTTP statuses and MIME types for SEO, domain intelligence and research. Also lists the available monthly crawls.

Pricing

from $3.00 / 1,000 results

Rating

0.0

(0)

Developer

Crawler Bros

Actor stats

Bookmarked

Total users

Monthly active users

22 days ago

Last modified

What this actor does

Two modes: urlCaptures and listCrawls
Wildcard lookups: example.com, example.com/*, en.wikipedia.org/wiki/*, *.example.com
125+ monthly crawls: query the latest crawl or any historical one
Server-side filters: date range; client-side filters: HTTP status, MIME type
WARC location for every capture so you can fetch the raw archived page
Empty fields are omitted — every field in a record is populated

Modes

Mode	What it does	Needs
`urlCaptures`	Look up all archived captures for a domain / URL pattern	`urlPattern` (+ optional `crawl`, filters)
`listCrawls`	List every available Common Crawl monthly crawl	–

Output — `urlCaptures` (one row per archived capture)

url — the archived URL
urlKey — Common Crawl's canonical (SURT) key
timestamp — capture time, YYYYMMDDHHMMSS
captureDate — the same time as ISO 8601
status — HTTP status at capture time
mime — declared MIME type
mimeDetected — MIME type detected from content
digest — content digest (dedupe identical pages)
length — record byte length
offset — byte offset within the WARC file
filename — WARC file path in the archive
languages — detected language codes
encoding — character encoding
redirectUrl — redirect target (for 3xx captures)
truncated — truncation reason (when present)
crawlId — which crawl this came from
warcUrl — direct link to the WARC file on data.commoncrawl.org
recordType: "capture", sourceUrl, scrapedAt

Output — `listCrawls` (one row per crawl)

crawlId — e.g. CC-MAIN-2024-10
name — human-readable name (e.g. February/March 2024 Index)
fromDate, toDate — crawl time window
cdxApiUrl — the crawl's index API endpoint
timegateUrl — the crawl's timegate
recordType: "crawl", sourceUrl, scrapedAt

Input

Field	Type	Default	Description
`mode`	string	`urlCaptures`	`urlCaptures` / `listCrawls`
`urlPattern`	string	`en.wikipedia.org/wiki/*`	Domain or URL, `*` wildcards allowed
`crawl`	string	`latest`	`latest` or a crawl id like `CC-MAIN-2024-10`
`matchType`	string	`auto`	`auto` / `exact` / `prefix` / `host` / `domain`
`statusFilter`	int	–	Keep only this HTTP status (e.g. `200`)
`mimeFilter`	string	–	Keep only MIME types containing this text
`fromDate`	string	–	`YYYYMMDD` lower bound
`toDate`	string	–	`YYYYMMDD` upper bound
`maxItems`	int	`100`	Hard cap (1–5000)

Example: all archived Wikipedia article URLs in the latest crawl

{ "mode": "urlCaptures", "urlPattern": "en.wikipedia.org/wiki/*", "crawl": "latest", "maxItems": 500 }

Example: only successful HTML pages of a domain

{ "mode": "urlCaptures", "urlPattern": "example.com/*", "statusFilter": 200, "mimeFilter": "text/html" }

Example: a whole domain including subdomains, in a specific crawl

{ "mode": "urlCaptures", "urlPattern": "wikipedia.org", "matchType": "domain", "crawl": "CC-MAIN-2024-10" }

Example: list every available crawl

{ "mode": "listCrawls" }

Use cases

SEO & site audits — discover every URL a domain has ever exposed to crawlers
Domain intelligence — profile a competitor's URL structure and content types
Historical URL discovery — recover old / removed pages for migration or research
Data engineering — get WARC offsets to pull raw archived pages at scale
Security research — enumerate a domain's historical footprint

Data source

Data comes from the public Common Crawl URL Index (the CDX API) and the crawl archive at data.commoncrawl.org, both published openly by the Common Crawl Foundation. No account, API key, or proxy is required. Coverage spans well over a decade of monthly crawls; run listCrawls to see every crawl currently available and its date range.

FAQ

What is Common Crawl? A free, open repository of web crawl data covering billions of pages, updated roughly monthly. Its URL Index lets you look up which URLs were captured and where they live in the archive. See commoncrawl.org.

What's a "crawl"? Each monthly snapshot is a crawl with an id like CC-MAIN-2024-10. Use latest for the newest, or run listCrawls to see all available ids and their date ranges.

How do wildcards work? example.com matches that host's pages; example.com/* matches everything under it; *.example.com matches all subdomains. You can also set matchType explicitly to exact, prefix, host or domain.

Why did I get no results for a big-name site? Some sites exclude crawlers via robots.txt, so they aren't in Common Crawl. Try a different domain or crawl.

Can I fetch the actual page content? Each capture includes warcUrl, offset and length — enough to download the exact archived response from Common Crawl's public data.commoncrawl.org store.

What does digest do? It's a content hash — identical digest values across captures mean the page content didn't change, which is handy for change detection and deduplication.

How far back does the data go? Common Crawl has crawls stretching back over a decade; listCrawls shows every one currently available.

Common Crawl URL Index Lookup Scraper

parseforge/common-crawl-index-scraper

Pull every web page Common Crawl captured for a domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and WARC offsets to fetch original payloads. Filter by collection, MIME, and status. Export to JSON, CSV, or Excel for large-scale web research and content discovery.

ParseForge

Common Crawl Corpus Builder

zentrafoundry/common-crawl-targeted-corpus-builder

Build targeted text corpora from Common Crawl with provenance.

Zentra

Common Crawl Indexes Scraper

benthepythondev/common-crawl-indexes-scraper

Collect Common Crawl records and export id, name, to, from, cdx api, timegate as structured JSON, CSV or Excel data.

Ben

Wayback Machine CDX URL List Scraper

parseforge/wayback-cdx-scraper

Pull every archived URL the Internet Archive has captured for any domain or URL prefix. Get timestamps, MIME types, status codes, content digests, and direct snapshot links. Filter by date range, status, MIME, and uniqueness. Export to JSON, CSV, or Excel for SEO recovery and competitive research.

ParseForge

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Crawler Bros

Wayback Machine Scraper

glassventures/wayback-machine-scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

Glass Ventures

Wayback Machine Scraper API - Archived URLs & Website History

pink_comic/wayback-machine-archived-urls-website-history

Search the Internet Archive Wayback Machine CDX index for archived URLs and website history. Filter captures by exact URL, prefix, host/domain, date, HTTP status, MIME type, and duplicate rule. Get timestamps, replay URLs, content hashes, status codes, sizes, and source evidence.

Ava Torres

Wayback Cdx Scraper

fortuitous_pirate/wayback-cdx-scraper

Scrape the Internet Archive Wayback Machine CDX index: find all archived snapshots of any URL with timestamps, HTTP status codes, and MIME types.

Fortuitous Pirate

Wayback Machine Scraper — Archived Snapshots

hipersoft/wayback-machine-scraper

List every Internet Archive (Wayback Machine) snapshot of a URL or whole domain: timestamp, snapshot URL, status code, MIME type and content digest. Filter by date, status and dedupe. For SEO, OSINT and historical research. No key.

hiper soft

Email Finder - Name + Domain to Work Email

logiover/work-email-finder

Find the most likely business email from a person's name and company domain. Ranks every common corporate pattern (first.last, flast, first…) by real-world frequency and validates the domain over DNS (MX). Pure DNS — no rate limits, no bans, always returns a result. Bulk in/out via CSV, Excel, JSON.