Pricing

from $2.00 / 1,000 urls

Ultimate URL Harvester — All Site URLs from 6 Archives

Get every known URL of any domain from Wayback Machine, Common Crawl, AlienVault OTX, URLScan, crt.sh and sitemap — merged & deduped. Block-proof, no API key.

Pricing

from $2.00 / 1,000 urls

Rating

0.0

(0)

Developer

Hitman studio

Actor stats

Bookmarked

Total users

Monthly active users

22 days ago

Last modified

Sources (all free, no key)

Source	What it adds
Wayback Machine	Every URL the Internet Archive has ever seen
Common Crawl	URLs from multiple monthly crawl indexes (configurable, 125 available)
AlienVault OTX	Threat-intel observed URLs (paginated)
URLScan.io	Scanned page URLs
crt.sh	Subdomains from certificate transparency logs
CertSpotter	More subdomains from CT logs (redundant with crt.sh)
sitemap.xml	The site's own declared URLs (follows nested sitemap-index)
Live crawl (optional)	Crawls the live site for the freshest pages not yet archived

Querying several Common Crawl indexes is the big multiplier — e.g. vercel.com goes from ~8k URLs (1 index) to ~48k deduplicated URLs (4 indexes), far more than any single-source scraper.

Output

Each URL row shows which sources found it — so you know how well-known a page is:

{ "url": "https://example.com/pricing", "sources": ["commoncrawl","sitemap","wayback"], "extension": "", "domain": "example.com" }

Input

domain — e.g. example.com
sources — pick which archives (default: all 6)
includeSubdomains — also blog./shop./api. etc. (default true)
extensions — keep only e.g. ["pdf","js"]
keyword — keep only URLs containing this text
declutter — collapse near-duplicate URLs (default true)
fetchContent — also fetch each URL's saved HTML from Wayback (block-proof, slower)

Use cases

SEO & content audits — find every indexed/forgotten page
Site migrations — make sure no URL is lost
Scraper prep — get the full URL list before scraping
Security recon — discover exposed paths, old endpoints, subdomains
Get data from blocked sites — pull pages from Common Crawl/Wayback instead of hitting Cloudflare

Legal

Uses only public archives and the site's own sitemap. Does not bypass logins, paywalls, or anti-bot protection. Collects only publicly available data.

gau - Get All URLs

rl1987/gau-wrapper

Fetch known URLs from the Wayback Machine, Common Crawl, AlienVault OTX, and URLScan for any domain. A wrapper around the gau OSINT tool for attack-surface and data-pipeline use.

R.L.

Wayback Machine Scraper

glassventures/wayback-machine-scraper

Scrape Wayback Machine archive snapshots for any URL or domain. Get archived URLs, timestamps, status codes, MIME types. Export to JSON, CSV, Excel.

Glass Ventures

Wayback Machine URL Extractor - Archived URLs

logiover/wayback-machine-url-extractor

Extract every archived URL of any domain from the Internet Archive's Wayback Machine (CDX API). Recover lost or old pages, build redirect maps and run OSINT, with date and status filters. No API key, export to CSV or JSON.

Logiover

Sitemap & URL Extractor — Get Every URL of a Website

dataquarry/sitemap-url-extractor

Get every URL of a website: parses sitemap.xml and sitemap-indexes (discovered via robots.txt or the default location), with a same-site crawl fallback when there's no sitemap. Returns each URL + lastmod. No API key.

Daniel Brenner

Subdomain Finder — Certificate Transparency (crt.sh)

bgfc97/crtsh-subdomain-finder

Discover subdomains of any domain from public Certificate Transparency logs (crt.sh). Attack-surface mapping and recon with first/last-seen dates. No key, no proxy.

Bruno

Wayback Machine Scraper — Archived Snapshots

hipersoft/wayback-machine-scraper

List every Internet Archive (Wayback Machine) snapshot of a URL or whole domain: timestamp, snapshot URL, status code, MIME type and content digest. Filter by date, status and dedupe. For SEO, OSINT and historical research. No key.

hiper soft

Wayback Machine Search

crawlerbros/wayback-machine-search

Query Internet Archive's Wayback Machine for historical snapshots of any URL or domain. Filter by date, HTTP status, MIME type, and deduplicate. Optionally fetch the archived page text. Free public CDX API, no authentication.

Crawler Bros

Sitemap URL Harvester

mahogany_songbird/sitemap-url-harvester

Collect URLs from XML sitemaps for SEO and crawling.

Britton Furness

crt.sh Certificate Transparency Scraper

parseforge/crtsh-certificate-transparency-scraper

Search the crt.sh certificate transparency logs for any domain you control and surface the hosts behind it. Each record carries the common name, every subject alternative name, the issuing authority, serial number, and validity window. Built for attack surface mapping and asset inventory.

ParseForge

SSL Certificate Transparency Scraper (crt.sh)

chrisp1211/certificate-search-scraper-max

Scrape SSL/TLS certificate transparency logs from crt.sh for any domain. Great for security audits and subdomain discovery. Returns common name, subdomains (SANs), issuer and validity. Pay per certificate; empty runs free.