YCombinator Companies Scraper | 5,900+ YC Startup Directory
Pricing
from $1.00 / 1,000 results
YCombinator Companies Scraper | 5,900+ YC Startup Directory
Scrape the Y Combinator startup directory (5,900+ funded companies) via the official Algolia API. Name, website, batch, status, team size, industry, tags, hiring flag, launched-at, logo. B2B sales prospecting, recruiter intel, VC analytics. HTTP-only, fast.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
Haketa
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
YCombinator Companies Scraper — 5,900+ YC Startup Directory Extractor for Sales, Recruiting, VC Research & Competitive Intelligence
The fastest, most complete Y Combinator startup directory extractor on Apify. Pull every funded YC company since 2005 — name, website, batch, status, stage, team size, industry, tags, region, hiring flag, launched-at, logo — straight from the official Algolia search backend that powers
ycombinator.com/companies. Zero browsers, zero anti-bot, ideal ICP data for B2B SaaS sales prospecting, recruiter intel, VC analytics, and competitive landscape mapping.
What This Actor Does
The YCombinator Companies Scraper is a production-grade Apify Actor that extracts the complete Y Combinator funded startup directory — every company that has ever been backed by YC since the IK12 (Independent Kickstart 2012) era through every subsequent Winter, Summer, Spring, and Fall batch up to the latest cohort. As of the current snapshot, that's 5,916 funded companies spanning early-stage seed bets to publicly traded YC alumni like Airbnb, Coinbase, DoorDash, and Stripe.
Under the hood, the actor talks directly to YC's official Algolia search backend — the very same 45BWZJ1SGC Algolia app and YCCompany_production index that power the search box, filters, and infinite scroll on ycombinator.com/companies. No headless browser. No HTML parsing. No anti-bot to dodge. Just polite, low-concurrency HTTP POST calls to the public Algolia REST endpoint with the ycdc_public tag filter that YC explicitly publishes for client-side use.
In a single run (typically 5 seconds for 50 records, ~5 minutes for the full 5,916-company catalog via batch fan-out), the actor returns richly normalized JSON records covering:
- Companies — every YC-funded startup (Active, Acquired, Public, Inactive)
- Stages — Seed, Early, Growth — useful for filtering to post-funding ICP
- Industries — B2B, Consumer, Fintech, Healthcare, Government, Education, Real Estate & Construction, Industrials and more
- Regions — United States of America, Europe, India, Asia, Latin America, Africa, Canada, Australia/New Zealand
- Batches — Winter 2024, Summer 2024, Spring 2024, Fall 2024 — all the way back to IK12
- Hiring signal —
isHiringboolean for recruiter and job-board use cases - Top company flag — YC's curated list of unicorns and best exits
Every record ships with the company's website, one-liner pitch, long description, logo URL, team size, launched-at timestamp, tags, sub-industry, former names, and a deep link back to the canonical ycombinator.com/companies/<slug> profile.
Why scrape Y Combinator yourself when this exists?
YC's directory looks innocently easy to scrape — it's just a public page. But teams that try the DIY route quickly hit a stack of headaches:
- The directory is fully JavaScript-rendered React —
curlof the HTML returns an empty shell with zero company data - A headless browser approach (Puppeteer, Playwright) means 5-10 minute runs for full coverage and high compute cost
- Without knowing the secured Algolia key, naive Algolia callers get 403 Forbidden — the key is base64-encoded and rotates implicitly via embedded
validUntil - Algolia caps a single secured-key query at 1,000 results — you can't just ask for "all 5,916 companies in one call"
- The on-site infinite scroll uses Algolia's
pagepagination which silently truncates past 1,000 — most DIY scripts plateau at ~1,000 and never notice the missing 4,900 records - Facet filters use a nested array-of-arrays syntax (
facetFilters=[["batch:W24"],["status:Active"]]) that's poorly documented outside Algolia's own docs - Field names in the Algolia response (
one_liner,small_logo_thumb_url,all_locations,launched_at) need normalization to a sane camelCase schema before they're database-ingestible - Timestamps come as Unix epochs that need ISO-8601 conversion
- YC tweaks the index periodically — adding
top_company,regions,subindustry, splittingindustriesinto array — meaning your custom scraper breaks silently - Zero retry / backoff on the naive call means transient Algolia 5xx errors kill your run
This actor solves all of that: it speaks the Algolia facetFilter dialect fluently, fans out by batch to break through the 1,000-cap, retries with exponential backoff, normalizes every field, converts launched-at to ISO timestamps, and Actor.fail()s on zero records so you never get a silent SUCCEEDED with an empty dataset.
Quick Start
One-Click Run
- Click "Try for free" on the Apify Store page
- Leave inputs empty to browse the first 500 YC companies, or type
AIinto the query box for AI-focused startups - Hit Start — your dataset is ready in under 10 seconds for a default run
- Download as JSON, CSV, Excel, or HTML directly from the Apify dataset view, or pipe to Google Sheets / a webhook
API Run (Python)
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")# Example 1: every YC-funded AI startup that's actively hiringrun = client.actor("haketa/ycombinator-companies-scraper").call(run_input={"query": "AI","hiringOnly": True,"maxRecords": 500,})for company in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{company['name']:<30} {company['batch']:<12} "f"team={company['teamSize']} {company['website']}")
API Run (Python — full catalog via batch fan-out)
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")# Pull the entire YC catalog by fanning out across recent batchesbatches = ["Winter 2024", "Summer 2024", "Spring 2024", "Fall 2024","Winter 2023", "Summer 2023", "Winter 2022", "Summer 2022","Winter 2021", "Summer 2021", "Winter 2020", "Summer 2020",# ...add all batches back to IK12 for full 5,916-company coverage]run = client.actor("haketa/ycombinator-companies-scraper").call(run_input={"batches": batches,"maxRecords": 0, # unlimited"hitsPerPage": 1000,"requestDelay": 300,})print(f"Saved {run['stats']['outputBodyLen']} bytes to dataset {run['defaultDatasetId']}")
API Run (Node.js / TypeScript)
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('haketa/ycombinator-companies-scraper').call({industries: ['Fintech'],statuses: ['Active'],regions: ['United States of America'],stages: ['Early', 'Growth'],hiringOnly: true,maxRecords: 1000,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Got ${items.length} US fintech YC companies hiring at Early/Growth stage`);items.slice(0, 5).forEach(c => console.log(`- ${c.name}: ${c.oneLiner}`));
API Run (cURL)
curl -X POST "https://api.apify.com/v2/acts/haketa~ycombinator-companies-scraper/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"query": "developer tools","batches": ["Winter 2024", "Summer 2024"],"hiringOnly": true,"maxRecords": 200}'
How It Works
YC's directory at ycombinator.com/companies is a React single-page app whose search, filters, and infinite scroll all call Algolia's hosted search REST API. The Algolia application is publicly identifiable in the browser network tab:
- Algolia Application ID:
45BWZJ1SGC - Primary index:
YCCompany_production - Secondary index:
YCCompany_By_Launch_Date_production - Tag filter:
ycdc_public— YC's own tag for client-exposed data - Endpoint:
https://45bwzj1sgc-dsn.algolia.net/1/indexes/YCCompany_production/query
The actor POSTs a JSON body with a URL-encoded params string containing the free-text query, hitsPerPage, page, and a facetFilters array-of-arrays expressing AND-across-categories / OR-within-category logic. It uses the same browser-exposed secured API key that ycombinator.com hands out — a base64 blob that embeds analyticsTags=ycdc, restrictIndices=YCCompany_production,YCCompany_By_Launch_Date_production, and tagFilters=["ycdc_public"] so it can only ever return data YC has explicitly marked public.
Endpoint reference
| Source | Endpoint | Records | Cadence |
|---|---|---|---|
| Algolia primary | https://45bwzj1sgc-dsn.algolia.net/1/indexes/YCCompany_production/query | 5,916 companies (current snapshot) | Live — updated by YC continuously |
| Algolia secondary | https://45bwzj1sgc-dsn.algolia.net/1/indexes/YCCompany_By_Launch_Date_production/query | Same companies, sorted by launch date | Live |
| YC profile page | https://www.ycombinator.com/companies/<slug> | One per company | Live |
Engineering details
- HTTP-only via
got-scraping— no Puppeteer, no Playwright, no Chromium. Each Algolia call is a single sub-second HTTPSPOST. - Algolia facet-filter dialect — nested array-of-arrays serialized as URL-encoded JSON:
[["batch:Winter 2024"],["status:Active","status:Acquired"]]. - Batch fan-out for the 1,000-cap — Algolia caps a secured-key query at 1000 hits. To exceed that, the actor lets you list every batch (
Winter 2024,Summer 2024, ...,IK12) and runs one query per batch. Each batch is < 300 companies, so 50+ batches multiplied out = full 5,916-company catalog. - Pagination loop — each filter combination loops
page=0..nbPages-1collecting hits, deduplicating by Algoliaidalong the way. - 3-attempt retry with exponential backoff — failed Algolia calls are retried with
2s, 4s, 6swaits plus jitter. Permanent failure logs an error and skips the batch. Actor.fail()on zero results — prevents the dreaded "SUCCEEDED with empty dataset" scenario; the run explicitly fails with a hint about case-sensitive batch/industry spellings.- Polite delays — configurable
requestDelay(default 300ms) between Algolia calls so the actor never hammers YC's infrastructure. - Field normalization — Algolia's snake_case (
one_liner,small_logo_thumb_url,all_locations,launched_at,team_size) is mapped to clean camelCase (oneLiner,logoUrl,location,launchedAt,teamSize). - Timestamp conversion — Unix
launched_atepoch is converted to ISO-8601launchedAtplus the rawlaunchedAtUnixfor time-series workflows. - No proxy required — Algolia's public search endpoint has zero anti-bot. You may attach Apify Proxy via
proxyConfigurationif you want, but it's pure overhead for this actor. - Deterministic output — same input always produces the same set of records (Algolia is sorted by their default relevance score for the query).
Input Parameters
{"query": "AI","batches": ["Winter 2024", "Summer 2024"],"statuses": ["Active"],"industries": ["B2B"],"regions": ["United States of America"],"stages": ["Seed", "Early"],"hiringOnly": true,"topCompaniesOnly": false,"maxRecords": 500,"hitsPerPage": 1000,"requestDelay": 300}
Parameter reference
| Parameter | Type | Default | Description |
|---|---|---|---|
query | string | "" | Free-text search across name, one-liner, long description, industry, and tags. Empty = browse all. Examples: "AI", "developer tools", "fintech", "climate", "vertical SaaS". |
batches | array<string> | [] | Filter by YC batch. Format: "Winter 2024", "Summer 2023", "Spring 2024", "Fall 2024", "IK12", etc. Each batch listed runs as a separate fan-out query — the recommended way to break the Algolia 1000-result cap and pull the full catalog. |
statuses | array<string> | [] | Filter by company status. Values: "Active", "Acquired", "Public", "Inactive". Empty = all four. |
industries | array<string> | [] | Filter by industry. Examples: "B2B", "Consumer", "Fintech", "Healthcare", "Government", "Real Estate and Construction", "Education", "Industrials". Empty = all. Case-sensitive — match YC's exact spelling. |
regions | array<string> | [] | Filter by region. Examples: "United States of America", "Europe", "Asia", "India", "Latin America", "Africa", "Canada", "Australia / New Zealand". |
stages | array<string> | [] | Filter by company stage. Values: "Seed", "Early", "Growth". Empty = all three. |
hiringOnly | boolean | false | When true, only returns companies with isHiring: true. Killer filter for recruiters and job-board operators. |
topCompaniesOnly | boolean | false | When true, only returns YC's curated Top Companies — the unicorns and best exits (think Airbnb, Stripe, Coinbase, DoorDash, Reddit, Twitch, Instacart). |
maxRecords | integer | 500 | Hard cap on total records across all fan-out queries. 0 = unlimited (bounded by Algolia's 1000-per-query cap × number of filter combinations). Set to 0 when pulling the full 5,916-company catalog. |
hitsPerPage | integer | 1000 | Algolia page size. 1000 is the maximum the secured key allows and keeps request count minimal. |
requestDelay | integer | 300 | Milliseconds between Algolia calls. Algolia is sub-second fast but 200-500ms is the polite range. |
proxyConfiguration | object | none | Optional Apify proxy. Almost never needed — Algolia's public search API has zero rate-limit on the ycdc_public tag. |
Output Schema
Every record is a flat JSON object with the same field set, so downstream consumers (Postgres, Snowflake, Salesforce, HubSpot, Airtable) can ingest without per-category branching.
Core company fields
| Field | Type | Description |
|---|---|---|
companyId | integer | Stable YC-assigned numeric ID. Use as the primary key in your warehouse. |
name | string | Company name (e.g., "Airbyte", "Stripe", "&AI"). |
slug | string | URL-safe handle (e.g., "airbyte", "stripe", "and-ai"). |
ycProfileUrl | string | Canonical deep link: https://www.ycombinator.com/companies/<slug>. |
website | string | The company's own homepage URL. |
oneLiner | string | The pitch in a sentence (e.g., "Open-source data movement infrastructure"). |
longDescription | string | Multi-sentence company description from the YC profile. |
logoUrl | string | Thumbnail logo URL hosted on YC's CDN. |
Classification fields
| Field | Type | Description |
|---|---|---|
batch | string | YC cohort (e.g., "Winter 2020", "Summer 2024", "IK12"). |
status | string | "Active", "Acquired", "Public", or "Inactive". |
stage | string | "Seed", "Early", or "Growth". |
industry | string | Primary industry (e.g., "B2B", "Fintech", "Healthcare"). |
subindustry | string | More granular vertical (e.g., "B2B -> Sales", "Fintech -> Banking and Exchange"). |
industries | array<string> | Full multi-industry tag list. |
tags | array<string> | Free-form descriptive tags (e.g., ["AI", "Sales", "B2B", "LegalTech"]). |
Operational fields
| Field | Type | Description |
|---|---|---|
teamSize | integer | Reported headcount at scrape time. |
location | string | Free-text location string (e.g., "San Francisco, CA, USA"). |
regions | array<string> | Normalized region list (e.g., ["America / Canada", "United States of America", "Remote"]). |
isHiring | boolean | true if the company is actively hiring on Work at a Startup. |
topCompany | boolean | true if YC has curated this company on its "Top Companies" list. |
nonprofit | boolean | true if registered as a nonprofit (YC funds a few each batch). |
formerNames | array<string> | Previous names if the company rebranded. |
launchedAt | string | ISO-8601 launch date (e.g., "2024-07-15T00:00:00.000Z"). |
launchedAtUnix | integer | Same timestamp as Unix epoch seconds — convenient for time-series joins. |
Provenance fields
| Field | Type | Description |
|---|---|---|
searchQuery | string | The query string that surfaced this record (echoed back for multi-query runs). |
searchBatch | string | The batch filter that surfaced this record (for fan-out runs). |
scrapedAt | string | ISO-8601 timestamp of when the actor pulled this record. |
Example: An AI B2B startup (verified live from query="AI")
{"companyId": 31984,"name": "&AI","slug": "and-ai","ycProfileUrl": "https://www.ycombinator.com/companies/and-ai","website": "https://www.and.ai","oneLiner": "AI for IP and patent law","longDescription": "&AI builds the AI copilot for IP attorneys and patent agents — drafting, prior art searches, office action responses, and portfolio analytics in one workspace.","logoUrl": "https://bookface-images.s3.amazonaws.com/small_logos/and-ai.png","location": "New York, NY, USA","regions": ["America / Canada", "United States of America"],"batch": "Summer 2024","status": "Active","stage": "Seed","teamSize": 13,"industry": "B2B","subindustry": "B2B -> LegalTech","industries": ["B2B", "B2B -> LegalTech"],"tags": ["AI", "Artificial Intelligence", "LegalTech", "B2B"],"topCompany": false,"isHiring": true,"nonprofit": false,"formerNames": null,"launchedAt": "2024-07-20T00:00:00.000Z","launchedAtUnix": 1721433600,"searchQuery": "AI","searchBatch": null,"scrapedAt": "2026-05-18T09:15:00.000Z"}
Example: A growth-stage YC alumnus (Airbyte)
{"companyId": 23892,"name": "Airbyte","slug": "airbyte","ycProfileUrl": "https://www.ycombinator.com/companies/airbyte","website": "https://airbyte.com","oneLiner": "Open-source data movement infrastructure","longDescription": "Airbyte is the leading open-source ELT platform with 300+ pre-built connectors. Used by thousands of data teams to centralize data into warehouses, lakes, and AI vector stores.","logoUrl": "https://bookface-images.s3.amazonaws.com/small_logos/airbyte.png","location": "San Francisco, CA, USA","regions": ["America / Canada", "United States of America", "Remote"],"batch": "Winter 2020","status": "Active","stage": "Growth","teamSize": 90,"industry": "B2B","subindustry": "B2B -> Engineering, Product and Design","industries": ["B2B", "B2B -> Engineering, Product and Design"],"tags": ["AI", "Data Engineering", "Open Source", "Developer Tools"],"topCompany": true,"isHiring": true,"nonprofit": false,"formerNames": null,"launchedAt": "2020-07-21T00:00:00.000Z","launchedAtUnix": 1595289600,"searchQuery": "AI","searchBatch": null,"scrapedAt": "2026-05-18T09:15:00.000Z"}
Status, Stage & Industry Reference
Company statuses
| Status | Meaning |
|---|---|
Active | Still operating independently and most likely raising or growing |
Acquired | Bought by another company (great for M&A pattern research) |
Public | IPO'd or listed via SPAC (Airbnb, Coinbase, DoorDash, Reddit, etc.) |
Inactive | Shut down, wound up, or otherwise dormant |
Stages
| Stage | Typical Profile |
|---|---|
Seed | Just out of YC, < 10 people, pre-Series A — primary recruiter and SDR target |
Early | Series A / B, 10-100 people — prime ICP for dev tools, payroll, HR, observability SaaS |
Growth | Series C+, 100+ people — enterprise SaaS, fintech, and consulting ICP |
Top YC industries (with sample counts)
| Industry | Notes |
|---|---|
B2B | The largest industry segment — SaaS, dev tools, sales, HR, security, finance ops |
Consumer | DTC, social, gaming, marketplaces, creator economy |
Fintech | Banking, payments, lending, crypto, insurance, wealth management |
Healthcare | Diagnostics, telehealth, biotech, mental health, healthtech infrastructure |
Education | K-12, higher ed, professional learning, EdTech infrastructure |
Real Estate and Construction | PropTech, construction tech, vacation rentals, real estate fintech |
Government | GovTech, defense, public-sector SaaS |
Industrials | Hardware, manufacturing, supply chain, climate, space |
Tip: Use
industries: ["B2B"]+stages: ["Early", "Growth"]+hiringOnly: trueto get the canonical SaaS-sales prospecting list — post-funded, growing-headcount B2B YC companies.
Use Cases
B2B SaaS Sales Prospecting
Funded YC startups are the highest-converting cohort for dev-tools, payroll, HR, observability, security, payment, and infrastructure SaaS sales teams. They're flush with capital, growing headcount, and the founders are technically literate so the sales cycle is short.
- Build hyper-targeted ICP lists by combining
industries: ["B2B"]+stages: ["Early", "Growth"]+teamSize > 20 - Identify post-funding spikes by filtering on the most recent 4 batches (
Winter 2024,Summer 2024,Spring 2024,Fall 2024) — these are the companies with fresh capital and procurement budgets - Enrich your CRM by appending YC batch year, stage, team size, and industry to existing Salesforce/HubSpot accounts
- Run trigger-based outbound — when a
Seed-stage company in your ICP rolls over toEarly, that's a buying-signal alert - Route territory ownership by region (
regions: ["United States of America"]vsregions: ["Europe"]) - Score account fit using YC batch as a proxy for company sophistication (a W24 company has different needs than an IK12 company)
Recruiter & Executive Search Intel
YC alumni network is the most concentrated source of "ex-founder", "early-engineer-at-unicorn", and "first-PM" talent on the planet. The isHiring flag is gold for recruiters.
- Pull every YC company hiring right now —
hiringOnly: trueplus a stage filter — and pitch retained search to the founder - Build executive search target lists of late-stage YC alumni (
stage: "Growth",topCompany: true) for VP / C-suite placements - Source ex-YC engineers for your client roster by joining this dataset with LinkedIn (the YC company website often lists "About" / "Team")
- Visa-friendly employer mapping — combine with the H1B Visa Database to surface YC companies actively sponsoring H-1Bs
- Time recruiter outreach to the launched-at date — a new launch means hiring volume jumps
- Identify acqui-hire targets by filtering
status: "Inactive"and recentbatches— these founders need a soft landing
VC Analytics & Deal Flow
Whether you're a seed-stage VC tracking YC dealflow or a growth fund mapping competitor portfolios, this dataset is the foundation.
- Competitor portfolio mapping — "What did Sequoia / a16z / Founders Fund back from Winter 2024?" by joining YC names with public investor databases
- Theme-based pipeline building —
query: "AI agents"returns every YC AI-agent startup;query: "vertical SaaS healthcare"returns the vertical SaaS healthcare cohort - Batch-over-batch trend analysis — count AI startups in W22 vs W23 vs W24 to quantify the AI explosion
- Stage progression tracking — diff
stagebetween monthly runs to spot companies graduating from Seed to Early (= recent fundraise = re-engage) - Geographic dealflow —
regions: ["India"]orregions: ["Latin America"]surfaces emerging-market YC dealflow - Top Company anomaly detection — a
topCompany: truecompany suddenly switching tostatus: "Inactive"is a data point worth investigating
Startup Research & Journalism
YC's batch composition is one of the best leading indicators of startup-ecosystem trends. Journalists, analysts, and researchers use this dataset to write data-driven stories.
- Quantify the AI explosion — count companies tagged
"AI"per batch since W22; the curve goes vertical in W23-S24 - Track the fintech retreat of 2022 — count Fintech-tagged companies per batch and chart it
- Cover the climate-tech rebound of 2024 —
query: "climate"per batch over time - Build investor pitch-deck appendices with charts of YC team-size growth, batch-size evolution, geographic distribution
- Profile cohorts — pull all of W24, sort by
launchedAt, write a 5,000-word "State of W24" feature - Compare YC to Techstars / 500 by joining this dataset with sibling Apify scrapers
University Career Services
YC alumni companies hire aggressively from top CS programs. Career services teams build curated boards from the YC isHiring feed.
- Show students which YC startups are hiring — filter
hiringOnly: true+ region matching campus - Cross-reference with visa data — combine with the H1B Visa Database for international student career boards
- Build alumni placement reports — "X% of our CS '24 grads went to YC-backed startups"
- Power on-campus recruiting pitches — invite hiring YC founders to do recruiting trips
- Career fairs — pull all SF Bay Area YC companies hiring to plan a Bay Area trek
Conference & Event Sales
SaaStr, TechCrunch Disrupt, MicroConf, the Stage Convention — every B2B SaaS conference needs to fill seats with funded-founder buyers. YC companies are their bread and butter.
- Build a SaaStr 2026 prospect list —
stages: ["Early", "Growth"]+industries: ["B2B"] - TechCrunch Disrupt early-bird list —
stages: ["Seed"]+ most recent 2 batches - Sponsor outreach —
topCompany: truecompanies are the dream sponsors with marketing budget - Speaker sourcing — Growth-stage YC founders make excellent panel speakers
- Side-event invitee lists — every YC founder in
regions: ["United States of America"]for the SF event circuit
Marketing & Ad Targeting
LinkedIn and Facebook custom audiences become dramatically more valuable when you can build a "YC-alumni-founder" persona.
- LinkedIn custom audience seed — upload the founder names from this dataset (joined with LinkedIn URLs) for ABM campaigns
- Founder-targeted Facebook custom audiences — match YC company websites to Facebook business accounts
- Lookalike modeling — train a lookalike on YC founders to find similar prospects outside YC
- Account-based marketing (ABM) for B2B SaaS — every YC company becomes a 1-row ABM target
- Industry-specific newsletters — sell ad spots to AI / Fintech / Healthcare advertisers and price by audience size in the dataset
Competitive Landscape Mapping & Strategy Decks
Product strategy teams pay consultants $50K+ for "competitive landscape" decks. This dataset lets you build them in an afternoon.
- "Every YC AI sales startup since 2020" —
query: "AI sales"+batches: <list>— for sales-tech market mapping - "Every YC developer tools startup since IK12" —
query: "developer tools"for dev-tools market saturation analysis - Industry concentration matrix —
industryxbatchpivot reveals where YC is concentrating bets - Product-strategy gap analysis — find an industry with few YC entrants — likely a green field
- Investor memo appendix — "Of the 47 AI infrastructure startups YC has funded since W22, only 8 are growth-stage" is a powerful slide
- Market sizing — total team size summed across an industry = a directional TAM proxy
M&A / Sourcing & Acquisition Targets
status: "Active" + stage: "Early" + sluggish team-size growth = a candidate acqui-hire conversation. Top corporate development teams scout YC alumni systematically.
- Pre-Series-B acquisition targets —
stage: "Early"+status: "Active"+ small team size - Defensive acquisitions — find every YC company in your direct vertical and triage threat level
- Acqui-hire scouting —
status: "Inactive"companies whose founders are signal-rich talent - Founder LinkedIn enrichment — join names with LinkedIn to cold-message about strategic conversations
- Competitor's portfolio acquisition — when a competitor goes on a YC-buying spree, the dataset surfaces the pattern
Investor Research & LP Reporting
LPs and emerging fund managers use YC dealflow as a benchmark for their own portfolios.
- Sector exposure benchmarking — what % of YC's last 4 batches were AI vs your fund's exposure?
- Geographic dealflow benchmarking — YC has 12% India; your fund has 2% — is that an opportunity or risk?
- Vintage tracking — pull every YC batch, count Public + Acquired outcomes — compute YC's mortality and upside ratios by vintage
- LP letter charts — embed YC market data as the "context" appendix in quarterly LP updates
- Co-invest sourcing — identify YC
Growthstage companies for late-stage co-invest deals
Sample Queries & Recipes
Recipe 1: Every AI YC startup actively hiring (recruiter goldmine)
{"query": "AI","hiringOnly": true,"statuses": ["Active"],"maxRecords": 1000}
Recipe 2: Full Winter 2024 batch — every company
{"batches": ["Winter 2024"],"maxRecords": 0}
Recipe 3: B2B SaaS ICP for sales prospecting
{"industries": ["B2B"],"stages": ["Early", "Growth"],"statuses": ["Active"],"regions": ["United States of America"],"hiringOnly": true,"maxRecords": 1000}
Recipe 4: YC's Top Companies list (Airbnb, Stripe, Coinbase, et al.)
{"topCompaniesOnly": true,"maxRecords": 0}
Recipe 5: Fintech YC alumni in India
{"industries": ["Fintech"],"regions": ["India"],"statuses": ["Active"]}
Recipe 6: Climate-tech surge across recent batches
{"query": "climate","batches": ["Winter 2024", "Summer 2024", "Spring 2024", "Fall 2024","Winter 2023", "Summer 2023"],"maxRecords": 0}
Recipe 7: Full 5,916-company catalog via batch fan-out
{"batches": ["Winter 2024", "Summer 2024", "Spring 2024", "Fall 2024","Winter 2023", "Summer 2023","Winter 2022", "Summer 2022","Winter 2021", "Summer 2021","Winter 2020", "Summer 2020","Winter 2019", "Summer 2019","Winter 2018", "Summer 2018","Winter 2017", "Summer 2017","Winter 2016", "Summer 2016","Winter 2015", "Summer 2015","Winter 2014", "Summer 2014","Winter 2013", "Summer 2013","Winter 2012", "Summer 2012","Winter 2011", "Summer 2011","Winter 2010", "Summer 2010","Winter 2009", "Summer 2009","Winter 2008", "Summer 2008","Winter 2007", "Summer 2007","Winter 2006", "Summer 2006","Summer 2005","IK12"],"maxRecords": 0,"hitsPerPage": 1000,"requestDelay": 300}
Integration Examples
Google Sheets (via Apify Integration)
- Set up an Apify schedule running this actor weekly at 7:00 AM Monday
- Add the "Export to Google Sheets" integration to the schedule
- Receive a fresh YC company directory in your Sheet every Monday morning
- Build pivot tables: batch x industry, stage x region, isHiring counts over time
Make.com / Zapier / n8n
Use the Apify connector on Make, Zapier, or n8n. Trigger downstream workflows on:
- New companies (this week's run minus last week's = newly-added YC companies)
- Stage transitions (
Seed→Early= recent fundraise signal — fire a Slack alert) isHiringflips totrue(new hiring season — push to your recruiter Slack)- New launches (
launchedAtis within the last 7 days — push to your Twitter scheduler)
Power BI / Tableau / Looker
Connect Apify's REST API as a data source. Refresh on the Apify schedule. Build dashboards covering:
- YC batch size evolution over 18 years
- Industry distribution per batch (the AI surge visualized)
- Geographic dealflow heatmaps
- Top Companies progression — who graduated to topCompany in the last quarter?
Postgres / Snowflake / BigQuery
Use the Apify webhook integration to POST run results directly to a data warehouse ingestion endpoint after every scheduled run. Suggested schema:
CREATE TABLE yc_companies (company_id BIGINT PRIMARY KEY,name TEXT,slug TEXT,yc_profile_url TEXT,website TEXT,one_liner TEXT,long_description TEXT,logo_url TEXT,location TEXT,regions JSONB,batch TEXT,status TEXT,stage TEXT,team_size INTEGER,industry TEXT,subindustry TEXT,industries JSONB,tags JSONB,top_company BOOLEAN,is_hiring BOOLEAN,nonprofit BOOLEAN,former_names JSONB,launched_at TIMESTAMPTZ,launched_at_unix BIGINT,scraped_at TIMESTAMPTZ);CREATE INDEX idx_yc_batch ON yc_companies(batch);CREATE INDEX idx_yc_industry ON yc_companies(industry);CREATE INDEX idx_yc_is_hiring ON yc_companies(is_hiring) WHERE is_hiring = TRUE;
Salesforce / HubSpot CRM Enrichment
Trigger an Apify run weekly, then upsert against Account records keyed on website or companyId. Stage transitions can auto-create Tasks; new Top Company designations can trigger Opportunity stage changes.
Webhooks → Slack / Discord
Pipe the actor's defaultDataset through an Apify webhook into your Slack channel. Recruiters get a daily "Today's newly-hiring YC companies" post. Sales gets a weekly "New YC fintech ICP additions" digest.
Major Markets & Regional Coverage
YC's portfolio is global. Below is a rough distribution of YC companies by region with significance notes.
| Region | YC Presence | Significance |
|---|---|---|
| United States of America | ~3,800 companies | The core — San Francisco, NYC, LA, Boston, Seattle, Austin, Miami |
| Europe | ~600 companies | London, Berlin, Paris, Amsterdam, Stockholm, Madrid, Lisbon |
| India | ~500 companies | Bengaluru, Mumbai, Delhi NCR, Hyderabad — fast-growing YC region |
| Latin America | ~400 companies | São Paulo, Mexico City, Buenos Aires, Bogotá, Santiago |
| Canada | ~200 companies | Toronto, Vancouver, Montreal, Waterloo |
| Asia (ex-India) | ~250 companies | Singapore, Tokyo, Seoul, Jakarta, Manila |
| Africa | ~120 companies | Lagos, Nairobi, Cape Town, Cairo |
| Australia / New Zealand | ~100 companies | Sydney, Melbourne, Auckland |
| Middle East | ~60 companies | Dubai, Tel Aviv, Riyadh |
| Remote-first | grows every batch | Distributed teams, no HQ |
Tip: Combine
regionsfilter with the H1B Visa Database to surface US-based YC employers who actively sponsor H-1Bs — pure gold for international recruiter outreach.
Cost & Performance
| Metric | Value |
|---|---|
| Engine | Direct Algolia REST API (got-scraping HTTP) — no browser |
| Runtime (50 records, simple query) | ~5 seconds |
| Runtime (1,000 records, single filter combination) | ~10 seconds |
| Runtime (full ~5,916-company catalog via batch fan-out) | ~5 minutes |
| Cost per default run | ~0.001 Compute Units (typically less than $0.01) |
| Cost per full-catalog run | ~0.01 CU (typically less than $0.05) |
| Pricing model | Pay-per-event (transparent per-record pricing) |
| Data freshness | Live at runtime — YC's Algolia index is continuously refreshed |
| Auth required | None — uses YC's public ycdc_public Algolia key |
| Proxy required | None — Algolia public endpoint has no anti-bot |
| Concurrency | Safe to run multiple parallel filtered configurations |
| Memory footprint | 256 MB minimum, 1024 MB max — no scraping browser, low RAM |
Compliance, Privacy & Legal Notes
- Public data only — every field returned by this actor is published by Y Combinator at
ycombinator.com/companiesunder their publicycdc_publicAlgolia tag, which is the same tag YC uses to expose data to their own client-side search UI - No PII / no personal data — the dataset describes companies, not individuals. Founder names are not in this dataset. (For founder-level enrichment, consume YC's company page separately.)
- No emails, no phone numbers — the actor does not return any contact information
- Respectful of YC's infrastructure — the actor uses low concurrency (1 in-flight Algolia call), configurable
requestDelay(default 300ms), and 3-attempt exponential backoff. It is explicitly built to be a polite citizen. - YC's robots.txt does not block
/companiesand the underlying Algolia endpoint is unauthenticated and intentionally public - Algolia ToS — the
ycdc_publicsecured key is issued by YC for client-side use; it self-restricts totagFilters=["ycdc_public"]andrestrictIndices=YCCompany_production,YCCompany_By_Launch_Date_production - GDPR / CCPA — this dataset contains no EU or California resident personal data; company-level facts are not personal data under either regulation
- No commercial guarantees — fields, schemas, and Algolia keys are controlled by YC and may change without notice; the actor's normalization layer is built to handle most schema drift gracefully
Important: Use of this dataset for unsolicited bulk communications must comply with CAN-SPAM, TCPA, GDPR, CCPA, and the YC website ToS. The actor publisher is not responsible for downstream misuse.
Frequently Asked Questions
How fresh is the data?
YC updates its Algolia index continuously — new companies appear within hours of being announced. The actor hits Algolia live on every run, so the data is as fresh as YC publishes it.
How many companies will I get?
As of the current snapshot, 5,916 companies are in YC's directory. A default run with no filters returns 500 records (the per-run cap). To pull the full catalog, list every batch in the batches input (~50 batches since IK12) — this fans out into multiple queries and breaks the Algolia 1,000-per-query cap.
Why does Algolia cap a single query at 1,000 results?
It's a security feature of the secured API key YC issues to their client-side search UI. Single-query result depth is capped to prevent bulk scraping via the public key. The actor works around this by fanning out across batches — each batch query is < 300 results so each one returns the full batch.
Does this scraper require login or API keys to YC?
No. The actor uses YC's own public Algolia key — the same one your browser uses when you visit ycombinator.com/companies. You only need an Apify account to run the actor.
Does this scrape ycombinator.com HTML?
No. The actor talks directly to the Algolia search REST API that powers the YC site. This is faster, more reliable, and respectful of YC's web servers (zero impact on ycombinator.com).
Does the actor return founder names or emails?
No. Founder information is not in YC's Algolia index — only company-level metadata. Combine with sibling actors like SEEK for jobs or Levels.fyi for comp data to enrich.
Are inactive / shut-down YC companies included?
Yes. Set statuses: ["Inactive"] to filter to wound-up companies, or leave statuses empty to get every status (Active, Acquired, Public, Inactive).
Can I filter by year of YC participation?
Yes — use the batches filter with cohort names like "Winter 2024", "Summer 2023", etc. To get all of 2024, list ["Winter 2024", "Summer 2024", "Spring 2024", "Fall 2024"].
What's the difference between industry, subindustry, and industries?
industryis the top-level YC category (e.g.,"B2B")subindustryis the more granular vertical with hierarchy syntax (e.g.,"B2B -> Sales")industriesis the array of all industry tags the company carries — often the most useful for filtering
Can I get the YC Top Companies list?
Yes — set topCompaniesOnly: true. This returns YC's curated list of unicorns and best exits (Airbnb, Stripe, Coinbase, DoorDash, Reddit, Twitch, Instacart, Brex, Rappi, GitLab, Faire, et al.).
Does the actor deduplicate?
Yes. Within a single run the actor dedups by Algolia companyId, so even when multiple batch fan-out queries surface the same company (rare, but possible if companies span batches), only one record is saved.
Is proxy or residential IP required?
No. The Algolia public endpoint has no anti-bot or rate-limiting on the ycdc_public tag. You can attach Apify Proxy via proxyConfiguration if your network policy requires it, but it's pure overhead.
How do I pull the entire 5,916-company catalog?
Use Recipe 7 above — list every batch from "Winter 2024" back through "Summer 2005" and "IK12". The fan-out completes in ~5 minutes.
Can I run this on a schedule automatically?
Yes — Apify's built-in Scheduler lets you trigger this actor on any cron expression. Weekly or daily runs work well for change-detection workflows. Combine with webhooks for fully automated pipelines.
Does this actor work with the Apify Free Plan?
Yes — full functionality on the free tier. A default 500-record run costs a fraction of a Compute Unit. The full 5,916-company fan-out still fits within the free monthly CU budget.
What formats can I export the data in?
JSON, CSV, Excel (XLSX), HTML, XML, JSONLines, and RSS — directly from the Apify dataset view. The API also supports streaming for large datasets.
What happens if Algolia returns zero results?
The actor explicitly calls Actor.fail() with a helpful error message ("No records. Try clearing all filters, or check that batch/industry/status spellings match YC's exactly (case-sensitive).") so you never get a silent SUCCEEDED with an empty dataset.
Is the data accurate?
The data is exactly what YC publishes on their own directory — same source, same recency. If a company's status or stage looks wrong, that's YC's directory; the actor does not modify or filter beyond what you request.
How do I report a bug or request a feature?
Open an issue on the Apify Store actor page or contact the developer directly through the Apify Console.
Related Apify Actors by Haketa
Whether you're enriching YC company data with hiring intel, comp data, federal funding, or B2B verification, these sibling actors are designed to compose:
- H1B Visa Database — US Visa Sponsorship Scraper — perfect complement: surface YC employers actively sponsoring H-1Bs for international recruiting
- Levels.fyi Scraper — tech compensation data for YC-backed startups — invaluable for VC, recruiter, and candidate research
- SEEK Scraper (Australia / NZ) — job postings — pair with YC
isHiring: truedata for regional recruiter intel - ProductHunt Launches & Makers Scraper — daily startup launches, makers, votes & reviews — VC/founder/recruiter intel
- BBB Business Scraper — Better Business Bureau ratings — verify post-IPO YC alumni reputation
- SAM.gov Federal Contractor Entity Scraper — federal funding peer dataset — see which YC alumni are also federal contractors
- TTB Alcohol Permittee Scraper — federal licensing peer — useful for YC consumer / alcohol vertical research
- Salary.com Scraper — salary benchmarks for YC alumni job postings
- Texas Pharmacy License Scraper — TSBP — healthcare licensing peer dataset for YC healthtech research
- California DCA Professional License Scraper — CA professional licensing — useful for YC regulated-industry research
- Ohio eLicense Scraper — Ohio professional licensing — sibling regulatory dataset
- Illinois IDFPR License Scraper — Illinois professional licensing — sibling regulatory dataset
Comparison vs. Alternatives
| Approach | Setup time | Coverage | Data freshness | Cost (5,916 records) | Schema normalization | Proxy needed |
|---|---|---|---|---|---|---|
| This actor | < 1 minute | 5,916+ companies | Live at runtime | < $0.05 | Built-in | No |
| Manual ycombinator.com browsing | Hours/days | Limited by attention span | Live | Free | None | No |
| Headless browser scrape (Puppeteer) | 1-2 days dev | Full | Live | $1-5 per run (CU cost) | DIY | Optional |
| Custom Algolia client | 4-8 hours dev | Full (if you handle 1000-cap) | Live | Free + infra | DIY | No |
| Paid startup database (PitchBook, CB Insights) | Days to onboard | Vast (not just YC) | Daily | $1,000-50,000/year | Built-in | N/A |
| LinkedIn Sales Navigator | Hours | YC alumni only inferred | Live | $99-149/seat/mo | None | N/A |
Why Pay-Per-Event Pricing?
Most data scrapers either charge a flat monthly subscription (you pay even if you don't use it) or per-Compute-Unit (unpredictable). This actor uses pay-per-event pricing, which means:
- You only pay when the actor runs
- Charges scale with how much data you actually consume
- Transparent, line-item billing inside Apify
- No monthly minimums or annual commitments
- Free to evaluate — sample 50 records for pennies before committing to a full catalog pull
- Predictable cost-per-record — easy to forecast scrape budget for procurement
Changelog
| Version | Date | Notes |
|---|---|---|
| 1.0.0 | 2026-05 | Initial public release — direct Algolia API integration, batch fan-out for >1,000 results, 3-attempt retry with exponential backoff, Actor.fail() on zero records, full ISO-8601 timestamp normalization, pay-per-event pricing |
Keywords
Y Combinator scraper · YC companies database · YC startup directory scraper · Y Combinator API alternative · ycombinator.com/companies scraper · YC batch directory · startup directory scraper · funded startups scraper · B2B SaaS prospecting · VC portfolio scraper · YC alumni directory · Algolia startup search · YC Algolia API · YC Winter 2024 directory · YC Summer 2024 batch scraper · YC Top Companies list scraper · YC unicorn list · YC hiring scraper · YC isHiring filter · YC company API · funded startup lead generation · ICP list builder YC · YC fintech startups · YC AI startups · YC dev tools directory · YC healthcare startups · YC India directory · YC Latin America directory · YC growth-stage companies · YC seed-stage prospects · YC acquisition target list · YC competitor mapping · YC recruiter intelligence · YC executive search · YC ABM list · YC LinkedIn audience seed · YC investor research · startup database API · startup intelligence platform · founder outreach data · YC alumni hiring · YC batch trends · post-funding ICP scraper · YC company logo URLs · Apify YC actor · Haketa YC scraper
Support
- Bug reports: Use the Issues tab on the Apify Store page
- Feature requests: Same place — please describe the use case and the input combination you'd like to see supported
- Direct contact: Through the Apify developer profile haketa
If this actor saves you time or unlocks a new workflow, a 5-star rating on the Apify Store helps other sales, recruiting, VC, and research teams discover it. Thank you!