Ultimate URL Harvester — All Site URLs from 6 Archives
Pricing
from $0.20 / 1,000 urls
Ultimate URL Harvester — All Site URLs from 6 Archives
Get every known URL of any domain from Wayback Machine, Common Crawl, AlienVault OTX, URLScan, crt.sh and sitemap — merged & deduped. Block-proof, no API key.
Pricing
from $0.20 / 1,000 urls
Rating
0.0
(0)
Developer
Hitman studio
Maintained by CommunityActor stats
0
Bookmarked
5
Total users
3
Monthly active users
2 days ago
Last modified
Categories
Share
Get every known URL of any domain — pulled from 6 public archives at once, merged and deduplicated into one clean list. Other URL scrapers use a single source; this one combines them all, so you get the most complete URL list available.
Block-proof: it never hits the target website — all data comes from passive public archives. No API key, no proxy needed.
Sources (all free, no key)
| Source | What it adds |
|---|---|
| Wayback Machine | Every URL the Internet Archive has ever seen |
| Common Crawl | URLs from multiple monthly crawl indexes (configurable, 125 available) |
| AlienVault OTX | Threat-intel observed URLs (paginated) |
| URLScan.io | Scanned page URLs |
| crt.sh | Subdomains from certificate transparency logs |
| CertSpotter | More subdomains from CT logs (redundant with crt.sh) |
| sitemap.xml | The site's own declared URLs (follows nested sitemap-index) |
| Live crawl (optional) | Crawls the live site for the freshest pages not yet archived |
Querying several Common Crawl indexes is the big multiplier — e.g. vercel.com goes from ~8k URLs (1 index) to ~48k deduplicated URLs (4 indexes), far more than any single-source scraper.
Output
Each URL row shows which sources found it — so you know how well-known a page is:
{ "url": "https://example.com/pricing", "sources": ["commoncrawl","sitemap","wayback"], "extension": "", "domain": "example.com" }
Input
domain— e.g.example.comsources— pick which archives (default: all 6)includeSubdomains— also blog./shop./api. etc. (default true)extensions— keep only e.g.["pdf","js"]keyword— keep only URLs containing this textdeclutter— collapse near-duplicate URLs (default true)fetchContent— also fetch each URL's saved HTML from Wayback (block-proof, slower)
Use cases
- SEO & content audits — find every indexed/forgotten page
- Site migrations — make sure no URL is lost
- Scraper prep — get the full URL list before scraping
- Security recon — discover exposed paths, old endpoints, subdomains
- Get data from blocked sites — pull pages from Common Crawl/Wayback instead of hitting Cloudflare
Legal
Uses only public archives and the site's own sitemap. Does not bypass logins, paywalls, or anti-bot protection. Collects only publicly available data.