Ultimate URL Harvester — All Site URLs from 6 Archives avatar

Ultimate URL Harvester — All Site URLs from 6 Archives

Pricing

from $0.20 / 1,000 urls

Go to Apify Store
Ultimate URL Harvester — All Site URLs from 6 Archives

Ultimate URL Harvester — All Site URLs from 6 Archives

Get every known URL of any domain from Wayback Machine, Common Crawl, AlienVault OTX, URLScan, crt.sh and sitemap — merged & deduped. Block-proof, no API key.

Pricing

from $0.20 / 1,000 urls

Rating

0.0

(0)

Developer

Hitman studio

Hitman studio

Maintained by Community

Actor stats

0

Bookmarked

5

Total users

3

Monthly active users

2 days ago

Last modified

Share

Get every known URL of any domain — pulled from 6 public archives at once, merged and deduplicated into one clean list. Other URL scrapers use a single source; this one combines them all, so you get the most complete URL list available.

Block-proof: it never hits the target website — all data comes from passive public archives. No API key, no proxy needed.

Sources (all free, no key)

SourceWhat it adds
Wayback MachineEvery URL the Internet Archive has ever seen
Common CrawlURLs from multiple monthly crawl indexes (configurable, 125 available)
AlienVault OTXThreat-intel observed URLs (paginated)
URLScan.ioScanned page URLs
crt.shSubdomains from certificate transparency logs
CertSpotterMore subdomains from CT logs (redundant with crt.sh)
sitemap.xmlThe site's own declared URLs (follows nested sitemap-index)
Live crawl (optional)Crawls the live site for the freshest pages not yet archived

Querying several Common Crawl indexes is the big multiplier — e.g. vercel.com goes from ~8k URLs (1 index) to ~48k deduplicated URLs (4 indexes), far more than any single-source scraper.

Output

Each URL row shows which sources found it — so you know how well-known a page is:

{ "url": "https://example.com/pricing", "sources": ["commoncrawl","sitemap","wayback"], "extension": "", "domain": "example.com" }

Input

  • domain — e.g. example.com
  • sources — pick which archives (default: all 6)
  • includeSubdomains — also blog./shop./api. etc. (default true)
  • extensions — keep only e.g. ["pdf","js"]
  • keyword — keep only URLs containing this text
  • declutter — collapse near-duplicate URLs (default true)
  • fetchContent — also fetch each URL's saved HTML from Wayback (block-proof, slower)

Use cases

  • SEO & content audits — find every indexed/forgotten page
  • Site migrations — make sure no URL is lost
  • Scraper prep — get the full URL list before scraping
  • Security recon — discover exposed paths, old endpoints, subdomains
  • Get data from blocked sites — pull pages from Common Crawl/Wayback instead of hitting Cloudflare

Uses only public archives and the site's own sitemap. Does not bypass logins, paywalls, or anti-bot protection. Collects only publicly available data.