Common Crawl Scraper
Pricing
from $3.00 / 1,000 results
Common Crawl Scraper
Query the Common Crawl URL Index for any domain or URL pattern. Discover a site's archived pages, historical URLs, capture dates, HTTP statuses and MIME types for SEO, domain intelligence and research. Also lists the available monthly crawls.
Pricing
from $3.00 / 1,000 results
Rating
0.0
(0)
Developer
Crawler Bros
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Query the Common Crawl URL Index for any domain or URL pattern and get back every archived capture — the historical URLs a site exposed, when they were crawled, their HTTP status and MIME type, and the exact location of the raw page in Common Crawl's public archive. Perfect for SEO audits, domain intelligence, historical URL discovery and web-scale research. HTTP-only, no login, no proxy.
What this actor does
- Two modes:
urlCapturesandlistCrawls - Wildcard lookups:
example.com,example.com/*,en.wikipedia.org/wiki/*,*.example.com - 125+ monthly crawls: query the latest crawl or any historical one
- Server-side filters: date range; client-side filters: HTTP status, MIME type
- WARC location for every capture so you can fetch the raw archived page
- Empty fields are omitted — every field in a record is populated
Modes
| Mode | What it does | Needs |
|---|---|---|
urlCaptures | Look up all archived captures for a domain / URL pattern | urlPattern (+ optional crawl, filters) |
listCrawls | List every available Common Crawl monthly crawl | – |
Output — urlCaptures (one row per archived capture)
url— the archived URLurlKey— Common Crawl's canonical (SURT) keytimestamp— capture time,YYYYMMDDHHMMSScaptureDate— the same time as ISO 8601status— HTTP status at capture timemime— declared MIME typemimeDetected— MIME type detected from contentdigest— content digest (dedupe identical pages)length— record byte lengthoffset— byte offset within the WARC filefilename— WARC file path in the archivelanguages— detected language codesencoding— character encodingredirectUrl— redirect target (for 3xx captures)truncated— truncation reason (when present)crawlId— which crawl this came fromwarcUrl— direct link to the WARC file ondata.commoncrawl.orgrecordType: "capture",sourceUrl,scrapedAt
Output — listCrawls (one row per crawl)
crawlId— e.g.CC-MAIN-2024-10name— human-readable name (e.g.February/March 2024 Index)fromDate,toDate— crawl time windowcdxApiUrl— the crawl's index API endpointtimegateUrl— the crawl's timegaterecordType: "crawl",sourceUrl,scrapedAt
Input
| Field | Type | Default | Description |
|---|---|---|---|
mode | string | urlCaptures | urlCaptures / listCrawls |
urlPattern | string | en.wikipedia.org/wiki/* | Domain or URL, * wildcards allowed |
crawl | string | latest | latest or a crawl id like CC-MAIN-2024-10 |
matchType | string | auto | auto / exact / prefix / host / domain |
statusFilter | int | – | Keep only this HTTP status (e.g. 200) |
mimeFilter | string | – | Keep only MIME types containing this text |
fromDate | string | – | YYYYMMDD lower bound |
toDate | string | – | YYYYMMDD upper bound |
maxItems | int | 100 | Hard cap (1–5000) |
Example: all archived Wikipedia article URLs in the latest crawl
{ "mode": "urlCaptures", "urlPattern": "en.wikipedia.org/wiki/*", "crawl": "latest", "maxItems": 500 }
Example: only successful HTML pages of a domain
{ "mode": "urlCaptures", "urlPattern": "example.com/*", "statusFilter": 200, "mimeFilter": "text/html" }
Example: a whole domain including subdomains, in a specific crawl
{ "mode": "urlCaptures", "urlPattern": "wikipedia.org", "matchType": "domain", "crawl": "CC-MAIN-2024-10" }
Example: list every available crawl
{ "mode": "listCrawls" }
Use cases
- SEO & site audits — discover every URL a domain has ever exposed to crawlers
- Domain intelligence — profile a competitor's URL structure and content types
- Historical URL discovery — recover old / removed pages for migration or research
- Data engineering — get WARC offsets to pull raw archived pages at scale
- Security research — enumerate a domain's historical footprint
Data source
Data comes from the public Common Crawl URL Index (the CDX API) and the crawl archive at data.commoncrawl.org, both published openly by the Common Crawl Foundation. No account, API key, or proxy is required. Coverage spans well over a decade of monthly crawls; run listCrawls to see every crawl currently available and its date range.
FAQ
What is Common Crawl? A free, open repository of web crawl data covering billions of pages, updated roughly monthly. Its URL Index lets you look up which URLs were captured and where they live in the archive. See commoncrawl.org.
What's a "crawl"? Each monthly snapshot is a crawl with an id like CC-MAIN-2024-10. Use latest for the newest, or run listCrawls to see all available ids and their date ranges.
How do wildcards work? example.com matches that host's pages; example.com/* matches everything under it; *.example.com matches all subdomains. You can also set matchType explicitly to exact, prefix, host or domain.
Why did I get no results for a big-name site? Some sites exclude crawlers via robots.txt, so they aren't in Common Crawl. Try a different domain or crawl.
Can I fetch the actual page content? Each capture includes warcUrl, offset and length — enough to download the exact archived response from Common Crawl's public data.commoncrawl.org store.
What does digest do? It's a content hash — identical digest values across captures mean the page content didn't change, which is handy for change detection and deduplication.
How far back does the data go? Common Crawl has crawls stretching back over a decade; listCrawls shows every one currently available.