🏒 Company Data Aggregator β€” Crunchbase Free API Alternative avatar

🏒 Company Data Aggregator β€” Crunchbase Free API Alternative

Pricing

$30.00 / 1,000 company profile aggregation per domains

Go to Apify Store
🏒 Company Data Aggregator β€” Crunchbase Free API Alternative

🏒 Company Data Aggregator β€” Crunchbase Free API Alternative

Bulk company profile lookup. Aggregates WHOIS, DNS, GitHub org, SSL certs, tech stack headers, robots/sitemap β€” zero auth, zero paid APIs. Replaces the Crunchbase Free API (killed 2023).

Pricing

$30.00 / 1,000 company profile aggregation per domains

Rating

0.0

(0)

Developer

Stephan Corbeil

Stephan Corbeil

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

18 hours ago

Last modified

Share

The honest, zero-auth company information API. Feed in a company name or domain, get back a unified profile built from eight public signals: WHOIS, DNS, GitHub, SSL certificates, tech-stack headers, robots/sitemap, favicon CDN, and npm. No scraping of Crunchbase. No scraping of LinkedIn. No "enrichment" black boxes that secretly resell your query logs.

Keywords: crunchbase free api alternative, company information api, company lookup api, company enrichment api, domain intelligence api, company profile api, free company api


Why this actor exists

On June 28, 2023, Crunchbase quietly deprecated their free Basic API tier. The community discovered it when support tickets went unanswered and the "Get API key" button disappeared. Today the cheapest Crunchbase Enterprise plan starts at $49/user/month for the web app and five figures annually for the API β€” and even then, the rate limits make bulk enrichment painful.

Clearbit was acquired by HubSpot in late 2023 and their free tier was folded into a HubSpot-gated funnel. Apollo's free plan caps at 60 credits/month. ZoomInfo doesn't publish pricing for a reason.

Meanwhile, most of what you actually want to know about a company is already public. A company's domain registration tells you roughly when they started. Their MX record tells you whether they're on Google Workspace or Microsoft 365. Their GitHub org reveals engineering footprint. Their SSL cert's SAN list maps out related subdomains and products. This actor stitches those signals together, labels each one honestly, and hands you a clean JSON profile per company.


What sources does this tap?

We are deliberately transparent about every signal and its limitations. If a vendor won't tell you where the data came from, assume they're scraping someone they shouldn't be.

1. Domain WHOIS (python-whois)

  • Provides: creation date, registrar, expiration date
  • Limitation: GDPR-redacted for .eu/.fr/.de/.uk domains β€” registrant name/email usually empty. Date + registrar still come through.
  • We infer: founded_year as a proxy for company founding year. Good signal for startups; less useful for rebrands.

2. DNS records (dnspython)

  • Provides: MX (mail exchangers), NS (nameservers), A (IPs)
  • We infer:
    • email_provider from MX (Google Workspace vs Microsoft 365 vs Zoho vs Proton vs SES...)
    • dns_host from NS (Cloudflare vs Route 53 vs Azure DNS vs GoDaddy...)
  • Limitation: Some orgs use split-horizon DNS; enterprise MX may be a generic mail gateway that masks the real provider.

3. GitHub organization (api.github.com)

  • Provides: org login, display name, public repo count, followers, created_at, blog URL, location, total stars across first page of repos
  • Rate limit: 60 req/hour unauthenticated. Set the GITHUB_TOKEN environment variable to bump to 5000 req/hour.
  • Limitation: We guess the org slug from the domain SLD or company name. If the org uses a non-obvious handle ("meta-llama" vs "meta"), we may miss it. Override by providing name that matches the GitHub login.

4. Favicon / logo (Google s2)

  • URL shape: https://www.google.com/s2/favicons?domain={domain}&sz=128
  • Why Google s2: free, CDN-backed, globally cached, size-parameterized, and used by Chrome itself β€” the most reliable free logo source in existence.
  • Limitation: Returns a default globe icon if the target site has no favicon. For vector-quality logos, use our companion company-logo-api actor.

5. SSL certificate (ssl stdlib)

  • Provides: issuer organization, validity window, Subject Alternative Names (SAN list)
  • Why SANs matter: a cert covering api.stripe.com, dashboard.stripe.com, docs.stripe.com tells you the company operates those three properties. This is the cleanest free method for mapping a company's attack surface short of paid services like SecurityTrails or censys.io.
  • Limitation: Cloudflare-terminated TLS shows Cloudflare as the issuer and flattens the SAN list for the edge cert. Services using per-product certs (common at scale) won't appear in the apex cert's SANs.

6. Open-source presence (GitHub + npm)

  • A simple HEAD to registry.npmjs.org/{slug} catches companies that publish an SDK under their name.
  • We combine with GitHub repo count into a has_open_source boolean. Cheap but effective "is this a dev-tools company?" heuristic.

7. Tech signals from headers

  • Provides: Server, X-Powered-By, plus CDN inference from cf-ray / x-amz-cf-id / x-vercel-id / x-served-by
  • Limitation: These headers lie constantly. nginx on the edge says nothing about the origin stack. Treat as hints, not gospel. For deep tech-stack profiling, chain into our company-tech-stack-detector actor.

8. robots.txt + sitemap.xml

  • Provides: presence + byte size of each
  • Why: breadth of a company's web presence correlates with sitemap size. A 2KB sitemap is a five-page marketing site; a 2MB sitemap is a content-heavy SaaS with docs, blog, changelog, and programmatic SEO.

What this actor will NOT give you (and why honesty matters)

  • Revenue estimates. Those require bank-account surveillance (Plaid partners) or credit-card transaction panels. Every "free" vendor claiming them is either modeling from job-posting counts (wildly inaccurate) or reselling someone else's licensed data (legally sketchy).
  • Accurate headcount. LinkedIn scraping is the only way and LinkedIn sues scrapers (see hiQ Labs v. LinkedIn). We won't go there.
  • Funding rounds. Crunchbase and PitchBook license this data directly from filings and press releases. Public signals can hint at "they just hired 20 engineers" but not "they raised a $40M Series B last Tuesday."
  • Contact emails / phone numbers. These live in paid B2B waterfall providers (Apollo, ZoomInfo, Lusha). Scraping them violates every ToS we've read.
  • Intent signals. Those come from bidstream data and reverse-IP vendors. Paid surveillance. Hard no.

If a vendor sells you all of the above for $29/month, they are reselling scraped data and you are the liability shield.


Use cases

Lead enrichment (honest tier)

Enrich a CSV of trial signups with domain age, email provider, and CDN. Combine with your own behavioral data β€” don't rely on the enrichment to carry the signal. Works at scale because no rate-limited paid API sits in the critical path.

Investor / scout screening

Feed in a list of startups from Product Hunt, Hacker News "Who's Hiring", or YC batch pages. Filter on founded_year >= 2023 AND has_open_source = true AND github_org.total_stars > 100. Instant "legitimate technical team" filter.

Competitor monitoring

Nightly Apify schedule against your competitor set. Flag when SSL certs change issuers (migration signal), when DNS host changes (infra migration), when GitHub star growth spikes (launch signal), or when a new SAN appears on the cert (new product URL).

Due diligence sanity-check

Before a partnership or M&A conversation, you want five minutes of public-signal corroboration: does the domain actually exist, is the cert valid, is the GitHub org real, do they publish under the name they claim? This actor answers those in one API call per company.

Security attack-surface recon

The SAN list from SSL inspection is the single best free technique for enumerating a target organization's web properties. Chain it with dig axfr (where unlocked) and crt.sh for defensive reconnaissance. This actor is widely used by bug bounty hunters and blue teams.


Comparison

CapabilityCrunchbase FreeThis ActorClearbitApolloZoomInfo
StatusπŸ’€ Killed 2023βœ… Active, maintainedFolded into HubSpotFreemium (60 credits/mo)Enterprise only
Auth requiredAPI key (when alive)NoneHubSpot accountAccountSales call
Founded year / domain ageβœ…βœ… (WHOIS proxy)βœ…βœ…βœ…
Employee countβœ…βŒ (honest: needs LinkedIn)βœ…βœ…βœ…
Revenueβœ…βŒ (honest: surveillance data)βœ…βœ…βœ…
Funding roundsβœ… (hero feature)❌PartialPartialβœ…
Tech stackβŒβœ… (headers + npm + GitHub)βœ…Partialβœ…
Email provider (MX)βŒβœ…βŒβŒβŒ
Related subdomains (SAN)βŒβœ… Unique❌❌❌
Open-source signalβŒβœ… Unique❌❌❌
CostN/AApify compute only$$$$$$$$$$
ToS riskN/AZero (all public)ZeroZeroZero

Input schema

{
"companies": [
{"name": "Apify", "domain": "apify.com"},
{"name": "Cloudflare", "domain": "cloudflare.com"},
"stripe.com"
],
"include_whois": true,
"include_dns": true,
"include_github": true,
"include_ssl": true,
"timeout_per_source_seconds": 10
}
  • companies β€” array. Each item is either {name, domain} or a bare string. Bare strings with a dot are treated as domains; otherwise as company names (we guess {slug}.com).
  • The include_* flags let you trim execution when you only need a subset.
  • timeout_per_source_seconds β€” per-source ceiling. Failures don't kill the whole lookup (asyncio.gather(return_exceptions=True)).

Output schema

{
"input": {"name": "Apify", "domain": "apify.com"},
"resolved_domain": "apify.com",
"company_name": "Apify",
"logo_url": "https://www.google.com/s2/favicons?domain=apify.com&sz=128",
"founded_year": 2015,
"registrar": "GoDaddy.com, LLC",
"email_provider": "Google Workspace",
"dns_host": "Cloudflare",
"github_org": {
"name": "apify",
"public_repos": 120,
"total_stars": 8400,
"url": "https://github.com/apify"
},
"ssl_issuer": "Google Trust Services",
"ssl_validity_days": 65,
"related_domains": ["api.apify.com", "docs.apify.com"],
"tech_hints": {"server": "cloudflare", "cdn": "Cloudflare"},
"has_open_source": true,
"data_freshness": "2026-04-17T12:00:00+00:00"
}

Each field is nullable; if a source fails or a signal is absent, we return null and record the error in source_errors (one entry per failed source) so you can retry selectively.


  • company-tech-stack-detector β€” deeper tech fingerprinting (JS frameworks, analytics, CMS) via Wappalyzer-style DOM parsing. Pipe the resolved_domain from this actor in.
  • domain-whois-lookup β€” standalone WHOIS with deeper parsing (expiry alerts, status codes). Use for portfolio monitoring.
  • page-speed-analyzer β€” Lighthouse + Core Web Vitals for the resolved domain. Great for competitive performance dashboards.
  • company-logo-api β€” vector-quality logo extraction (Clearbit-style fallback chain: apple-touch-icon β†’ og:image β†’ favicon).

FAQ

Q: Why no revenue or headcount? A: Those require paid surveillance data vendors. Revenue estimates come from transaction-panel licensees; headcount comes from LinkedIn scraping (illegal under LinkedIn's ToS and litigated heavily since hiQ v. LinkedIn). This actor is honest about what public signals can reveal β€” and everything we return is either owned by the company itself (their DNS, their cert, their GitHub) or published by a public authority.

Q: Can I use this as a drop-in Crunchbase replacement? A: For the fields it covers, yes β€” and for many use cases you were over-paying Crunchbase for signals you didn't need. For funding-round and revenue data specifically, you still need a paid provider (PitchBook, Crunchbase Pro, Dealroom).

Q: What happens when GitHub rate-limits me? A: Unauth calls are capped at 60/hour per IP. For bulk runs, set GITHUB_TOKEN as an Apify actor environment variable (supports fine-grained personal access tokens with only public_repo scope). That lifts the cap to 5000/hour. The actor degrades gracefully when rate-limited β€” github_org returns null and the error is recorded in source_errors.

Q: Does this work for non-US domains? A: Yes. WHOIS + DNS + SSL are globally uniform. GitHub org guessing works for any Latin-script name. GDPR-redacted WHOIS will show registrar but empty registrant β€” we still extract the creation date.

Q: Will this trigger rate limits or IP bans? A: No. The only third-party HTTP calls are GitHub (well-documented), npm (public registry), and the target domain itself (one HEAD + two GETs to robots/sitemap). Everything else is DNS + TLS β€” protocol-level, not HTTP.

Q: Can I run this on 10,000 companies? A: Yes, with caveats. Bring a GITHUB_TOKEN. Set timeout_per_source_seconds to 15. Expect the bottleneck to be WHOIS (some TLDs throttle WHOIS aggressively). Budget ~2–4 seconds per company.

Q: How fresh is the data? A: Live. Every field is fetched at actor runtime. data_freshness records the exact ISO timestamp of the lookup. There is no caching layer β€” which also means no stale data, ever.


Built for teams who got burned when Crunchbase killed their free API. Run it on the Apify platform.

πŸ”— Sign up for Apify Β· Feedback welcome via the Issues tab.