Y Combinator [Only $1๐Ÿ’ฐ] Jobs & Companies scraper avatar

Y Combinator [Only $1๐Ÿ’ฐ] Jobs & Companies scraper

Pricing

from $1.00 / 1,000 results

Go to Apify Store
Y Combinator [Only $1๐Ÿ’ฐ] Jobs & Companies scraper

Y Combinator [Only $1๐Ÿ’ฐ] Jobs & Companies scraper

๐Ÿ’ฐ $1/1K One actor for Y Combinator jobs (Work at a Startup) and companies (Startup Directory). Paste any YC URL โ€” auto-routed โ€” or use filters. Companies via Algolia: no proxy, clean schema. Optional founder enrichment: LinkedIn/Twitter URLs, company socials, open jobs. Full batch history to 2005.

Pricing

from $1.00 / 1,000 results

Rating

0.0

(0)

Developer

Muhamed Didovic

Muhamed Didovic

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

20 hours ago

Last modified

Share

Y Combinator Scraper

Y Combinator data, structured. Jobs, companies, founders, socials โ€” one actor, no proxy needed for companies.

Scrape both surfaces of ycombinator.com from a single Apify actor: jobs (Work at a Startup) and companies (Startup Directory). Auto-routes any YC URL to the right scraper, or compose a query from filters.

How it works

How Y Combinator Scraper Works


โœจ Why use this scraper?

  • Two surfaces, one actor. Jobs (Work at a Startup) and Companies (Startup Directory) in the same dataset, distinguishable by row shape (jobId vs slug).
  • Companies fetch via YC's public Algolia index โ€” no HTML parsing, no proxy required, ~2 s for 100 rows. Clean, structured fields with no scraping artefacts (no leaked alt text, no concatenated values).
  • Rich output schemas. ~33 fields per job (salary parsed into min/max/currency, equity, founders with bios, JSON-LD datePosted). ~26 fields per company (batch, industries, regions, team size, stage, status, top-company / hiring / nonprofit flags).
  • Optional founder enrichment with proper name / title separation (e.g. Brian Chesky / Founder/CEO, not split-on-whitespace), plus LinkedIn + Twitter URLs per founder and company-level socials (linkedin, twitter, facebook, crunchbase, github).
  • Optional open-jobs enrichment per company, with cleanly separated title / salary / location / equity / experience fields.
  • Multi-keyword company discovery โ€” pass several keywords; each runs as a separate Algolia search and results merge by company id with dedupe.
  • Full YC batch history back to Summer 2005.

Overview

Built for recruiters, sourcers, BD/sales teams, investors, and anyone doing market research on YC-backed startups. The actor produces a heterogeneous dataset: each row is either a job posting or a company profile. You can run jobs and companies in the same job (mix URLs of both kinds), and tell them apart in downstream tooling by the presence of jobId (jobs) vs slug (companies).

Companies-mode goes through YC's Algolia search API โ€” fast, no proxy, no HTML parsing. Jobs-mode uses Crawlee + Cheerio against YC's server-rendered job listing/detail pages.


Supported inputs

Jobs URLs

PatternWhat it does
/jobsYC's curated jobs index (~20 jobs)
/jobs/role/{role}All jobs in a role (software-engineer, designer, product-manager, operations, marketing, sales-manager, recruiting-hr, support, science)
/jobs/role/{role}/{location}Role + location (san-francisco, new-york, los-angeles, seattle, austin, chicago, india, remote) โ€” location applied locally because YC filters it client-side
/jobs/location/{location}Location-only listing
/companies/{co}/jobs/{job}A single job-detail page

Companies URLs

PatternWhat it does
/companiesAll companies (paginated through Algolia)
/companies?batch=โ€ฆ&industry=โ€ฆ&query=โ€ฆ&isHiring=true&top_company=true&minEmployeeSize=10%2B&maxEmployeeSize=100Companies search with any combination of filters
/companies/{slug}Single-company lookup (e.g. โ€ฆ/companies/airbnb)

Filter form (when no URLs)

mode = jobs (default) or companies. Then the matching filter set:

  • ๐Ÿ’ผ Jobs: role, location.
  • ๐Ÿข Companies: queries[], topCompany, isHiring, nonprofit, batch[], industries[], regions[], minEmployeeSize, maxEmployeeSize.

๐ŸŽฏ Use cases

TeamTypical use
Recruiters / talent sourcingPull active YC job postings filtered by role + city, watch for new postings in monitoringMode
Investors / VCsTrack company batches, stages, and team sizes across every YC cohort back to 2005
BD / salesBuild a target list of YC companies by industry, region, employee size; enrich with founder LinkedIns for outreach
Founder / market researchDiscover companies by keyword across the full directory; find similar / competitive companies to map a market
Data engineering / pipelinesSchedule daily runs into a warehouse for YC startup intelligence; monitoringMode keeps datasets incremental

  1. You provide YC URLs (Option A) or filters (Option B). Mix Jobs and Companies URLs freely; the actor routes each one.
  2. Jobs URLs hit a Crawlee CheerioCrawler. The listing's inlined jobPostings JSON is parsed; the location slug from the URL is applied as a local substring filter against each job's location string (YC's listings filter location client-side, so we compensate).
  3. Companies URLs hit YC's public Algolia index (YCCompany_production). All Algolia filters compose with AND across attributes / OR within attribute. Multi-keyword runs union the results and dedupe by objectID.
  4. Optional company enrichment. When scrapeFounderDetails or scrapeOpenJobs is on, the actor fetches /companies/{slug} and/or /companies/{slug}/jobs per row and merges the parsed founders/socials/openJobs into the company row. Concurrency = 5.
  5. Output to dataset. Job rows and company rows go into the same dataset; jobs are also exported as data.csv / data.json.

Banner source is readme-stuff/how-it-works-yc-v1.svg โ€” edit the SVG and re-rasterize when you change the copy. A 2ร— retina version is at readme-stuff/how-it-works-yc-v1@2x.png (this is the version hosted at the URL above). The hosted PNG is served from a GitHub Pages repo so it renders in both GitHub and the Apify Console.


Quick start

// Recent SF software-engineering jobs
{ "mode": "jobs", "role": "software-engineer", "location": "san-francisco", "maxItems": 50 }
// Top hiring B2B companies from the last two batches, 100-500 employees
{ "mode": "companies", "batch": ["Spring 2026", "Winter 2026"], "industries": ["B2B"],
"minEmployeeSize": "100+", "maxEmployeeSize": "500", "isHiring": true, "topCompany": true,
"maxItems": 100 }
// Multi-keyword company discovery (results merged + deduped by company id)
{ "mode": "companies", "queries": ["dev tools", "observability", "feature flags"], "maxItems": 50 }
// Companies + founder/social/jobs enrichment (one extra HTTP per company per toggle)
{ "mode": "companies", "batch": ["Spring 2026"], "isHiring": true,
"scrapeFounderDetails": true, "scrapeOpenJobs": true, "maxItems": 25 }
// Or skip the form and paste any YC URL โ€” the actor auto-routes
{ "startUrls": [
"https://www.ycombinator.com/jobs/role/software-engineer/san-francisco",
"https://www.ycombinator.com/companies?batch=Spring%202026&industry=Healthcare",
"https://www.ycombinator.com/companies/airbnb"
], "maxItems": 50 }

Input configuration

The form is two collapsible sections โ€” alternatives, not steps:

Three top-level sections. Option A and Option B are alternatives; Run options apply to both regardless of which you pick.

  • Option A | Search by URL ๐Ÿ”— โ€” Jobs or Companies (recommended) โ€” startUrls[]. If non-empty, Option B is ignored.
  • Option B | Configure with Filters ๐ŸŽ›๏ธ โ€” used only when Option A is empty. Field titles inside the panel are prefixed by category:
    • Mode โ€” "jobs" (default) or "companies".
    • ๐Ÿ’ผ Jobs ยท โ€ฆ โ€” role, location.
    • ๐Ÿข Companies ยท โ€ฆ โ€” queries[], topCompany, isHiring, nonprofit, batch[], industries[], regions[], minEmployeeSize, maxEmployeeSize.
    • ๐Ÿ™Œ Companies Enrich ยท โ€ฆ โ€” scrapeFounderDetails, scrapeOpenJobs (each adds one HTTP per company; concurrency 5).
  • Run options โš™๏ธ โ€” applies to both โ€” shared run-time settings:
    • maxItems โ€” output cap (max records to return, default 100). Applies to both modes.
    • maxPages โ€” pagination depth (max listing pages per URL, default 10). Jobs scraper only; Companies paginates Algolia automatically.
    • monitoringMode, maxConcurrency, minConcurrency, maxRequestRetries, proxy โ€” Jobs scraper only. Companies hits Algolia directly and ignores them.

Multi-select filters (batch[], industries[], regions[]) use AND across attributes, OR within. The "All โ€ฆ" sentinel values (default) are stripped before the Algolia query so leaving them = no filter.

Non-paying users are capped at 100 items and have monitoringMode disabled.


Note on YC's location filtering

YC's listing pages include a location segment in the URL but apply that filter client-side โ€” the SSR'd jobPostings JSON is role-filtered only. This actor compensates by parsing the location slug out of the URL and applying a substring filter against each job's location string locally. Special cases: remote matches anything containing "Remote"; india matches the country (/\b(india|IN)\b/i). Cities use a case-insensitive substring match against the slug with - replaced by space.

A job listed as "San Francisco, CA, US / Remote (US)" matches both the san-francisco and remote slugs.


Output overview

Heterogeneous dataset โ€” job rows and company rows have different shapes but share the same dataset. Distinguish by:

  • presence of jobId (jobs) vs slug (companies)
  • or the url path (/companies/{co}/jobs/{job} vs /companies/{slug}).

Output samples

Job row (truncated for display)

Captured with startUrls=["https://www.ycombinator.com/jobs/role/software-engineer/san-francisco"], maxItems=1:

{
"jobId": "gD334As-systems-engineer",
"title": "Systems Engineer",
"url": "https://www.ycombinator.com/companies/substack/jobs/gD334As-systems-engineer",
"companyName": "Substack",
"companySlug": "substack",
"companyUrl": "https://www.ycombinator.com/companies/substack",
"ycBatch": "W18",
"jobType": "Full-time",
"roleCategory": "Engineering",
"roleSubcategory": "Devops",
"salaryRange": "$185K - $225K",
"salaryMin": 185000,
"salaryMax": 225000,
"salaryCurrency": "USD",
"equity": "",
"location": "San Francisco, CA, US / New York, NY, US",
"postedAgo": "2 days",
"experience": "6+ years",
"visaSponsorship": "",
"description": "Substack is building a new economic engine for culture โ€ฆ",
"descriptionHtml": "<p>Substack is building โ€ฆ</p>",
"companyDescription": "Start a newsletter. Build your community. โ€ฆ",
"companyFounded": 2017,
"companyTeamSize": 90,
"companyStatus": "Active",
"companyLocation": "San Francisco",
"founders": [
{ "name": "Chris Best", "role": "Co-founder & CEO at Substackโ€ฆ" },
{ "name": "Hamish McKenzie", "role": "COO" },
{ "name": "Jairaj Sethi", "role": "๐Ÿ‘‹๐Ÿฝ" }
],
"datePosted": "2025-09-11T18:28:34Z",
"companyWebsite": "https://substack.com",
"scrapedAt": "2026-05-02T10:49:39.558Z"
}

Company row, with both enrichment toggles on (truncated)

Captured with startUrls=["https://www.ycombinator.com/companies/airbnb"], scrapeFounderDetails=true, scrapeOpenJobs=true:

{
"id": 271,
"slug": "airbnb",
"name": "Airbnb",
"url": "https://www.ycombinator.com/companies/airbnb",
"batch": "Winter 2009",
"industry": "Consumer",
"subindustry": "Consumer -> Travel, Leisure and Tourism",
"industries": ["Consumer", "Travel, Leisure and Tourism"],
"regions": ["United States of America", "America / Canada"],
"allLocations": "San Francisco, CA, USA",
"oneLiner": "Book accommodations around the world.",
"teamSize": 6132,
"status": "Public",
"stage": "Growth",
"topCompany": true,
"isHiring": false,
"nonprofit": false,
"launchedAt": "2012-01-17T09:00:56.000Z",
"website": "http://airbnb.com",
"tags": ["Marketplace", "Travel"],
"formerNames": [],
"appVideoPublic": false,
"demoDayVideoPublic": false,
"founders": [
{
"name": "Brian Chesky", "title": "Founder/CEO",
"linkedinUrl": "https://www.linkedin.com/in/brianchesky/",
"twitterUrl": "https://twitter.com/bchesky",
"isActive": true, "hasEmail": true
},
{ "name": "Nathan Blecharczyk", "title": "Founder/CTO", "linkedinUrl": "โ€ฆ", "twitterUrl": "โ€ฆ" },
{ "name": "Joe Gebbia", "title": "Founder/CPO", "linkedinUrl": "โ€ฆ", "twitterUrl": "โ€ฆ" }
],
"socials": {
"linkedin": "https://www.linkedin.com/company/airbnb/",
"twitter": "https://twitter.com/Airbnb",
"facebook": "https://www.facebook.com/airbnb/",
"crunchbase": "https://www.crunchbase.com/organization/airbnb"
},
"openJobs": [],
"scrapedAt": "2026-05-02T10:49:25.606Z"
}

Output fields

Jobs row (~33 fields)

FieldDescription
jobIdYC's job id, e.g. gD334As-systems-engineer.
title, urlPosting title and absolute URL (https://www.ycombinator.com/companies/{co}/jobs/{id}).
companyName, companySlug, companyUrl, companyTaglineCompany display name, slug, profile URL, one-liner.
ycBatchYC batch (e.g. S21, W18).
jobType, roleCategory, roleSubcategoryE.g. Full-time, Engineering, Devops.
salaryRangeRaw display string from the listing ("$185K - $225K").
salaryMin, salaryMax, salaryCurrencyParsed numeric range and currency (USD/GBP/EUR/INR).
equityEquity range string from YC.
locationJob location string from YC.
postedAgoRelative time from listing (e.g. "2 days").
applyUrlDirect apply link (the YC OAuth bridge to workatastartup.com).
experienceMin experience requirement.
visaSponsorship"Will sponsor" if offered, else empty.
description, descriptionHtmlFull description as text and as HTML (preserves newlines).
companyDescription, companyFounded, companyTeamSize, companyStatus, companyLocationCompany metadata.
foundersArray of { name, role } โ€” role falls back through founder_bio โ†’ title โ†’ "Founder".
datePostedISO datetime from the embedded JobPosting JSON-LD.
companyWebsite, companyLogoExternal website and YC small-logo URL.
scrapedAtISO timestamp of when the row was produced.

Companies row (~26 fields, plus enrichment fields)

FieldDescription
id, slug, nameYC company id, slug, display name.
urlProfile URL (https://www.ycombinator.com/companies/{slug}).
batchE.g. "Winter 2009", "Spring 2026".
industry, subindustry, industries[]Top-level industry, second-level subindustry, full industries array.
regions[]HQ region tags.
allLocationsFree-form location string.
oneLiner, longDescriptionShort tagline and full description.
teamSizeHeadcount or null.
status"Active" / "Acquired" / "Public" / "Inactive" / etc.
stageYC's stage label: "Early" / "Growth" / "Public" / etc.
topCompany, isHiring, nonprofitBooleans.
launchedAtISO datetime (converted from epoch).
websiteExternal company website.
logoSmall-logo URL from YC's S3.
tags[]Free-form YC tags.
formerNames[]Past company names if YC has them.
appVideoPublic, demoDayVideoPublicWhether YC's videos are publicly available.
scrapedAtISO timestamp.

With scrapeFounderDetails: true:

FieldDescription
founders[]{ name, title, bio, linkedinUrl, twitterUrl, avatarUrl, isActive, hasEmail }. name and title are properly separated.
socials{ linkedin, twitter, facebook, crunchbase, github } (non-empty values only).
appVideoUrl, demoDayVideoUrlDirect video URLs when YC has them and they're public.

With scrapeOpenJobs: true:

FieldDescription
openJobs[]{ jobId, title, url, type, roleCategory, roleSubcategory, salaryRange, equity, location, experience, applyUrl, postedAgo, visaSponsorship }.

Monitoring mode

When monitoringMode is enabled (jobs only), the actor only emits jobs whose numeric id has not been seen in previous runs by the same Apify user. Useful for:

  • Tracking new YC job postings as they appear
  • Building a historical archive without re-scraping
  • Keeping downstream notifications free of duplicates

The actor maintains a per-user Key-Value store keyed YC-JOBS-SEEN-{apifyUserId}. On each run with monitoringMode: true, every listing job is checked against this store; new ids are added (with a small stub: id, url, title) and only those new jobs are enqueued for detail scraping. Reset by deleting the corresponding KV store from the Apify console.

monitoringMode is currently jobs-only. Companies-mode dedup is planned but not implemented.


Local development

npm install
npm run start:dev # runs src/main.ts via tsx
npm run build # tsc to dist/
npm run lint # eslint src/

Local input is read from storage/key_value_stores/default/INPUT.json. For fully isolated runs set APIFY_LOCAL_STORAGE_DIR and CRAWLEE_STORAGE_DIR to a temp path.


Limitations / known gaps

  • Single-company URL uses Algolia full-text query + post-filter to exact slug match (slug isn't in YC's filterable attributes). Returns 1 row max regardless of maxItems.
  • Companies mode ignores monitoringMode โ€” dedup is jobs-only for now.
  • Mixed start URLs (jobs + companies): maxItems is shared. Companies first, jobs gets the remainder.
  • Multi-queries budget is shared across keywords. If maxItems is small, later keywords may not fire.
  • Companies enrichment cost โ€” scrapeFounderDetails and scrapeOpenJobs each add one HTTP per company (concurrency = 5). For 100 companies, expect ~10โ€“20 seconds extra per toggle.

โ“ FAQ

Can I scrape both jobs and companies in the same run? Yes. Mix any /jobs/... and /companies... URLs in startUrls; each one auto-routes. With maxItems shared, companies path consumes its share first and jobs gets the remainder.

Does the Companies path need a proxy? No. It hits YC's public Algolia index directly, no rate-limit issues, no IP blocks. The proxy setting only applies to the Jobs scraper.

What does the Jobs location slug actually filter on? A case-insensitive substring of the slug (with - โ†’ ) against each job's location string. remote matches anything containing "Remote"; india matches the country code IN or the word India. YC applies that filter client-side, so we replicate it locally.

Can I get founders' LinkedIn URLs? Yes โ€” set scrapeFounderDetails: true. Each founder row includes linkedinUrl and twitterUrl parsed from the inlined company JSON on /companies/{slug}.

Does scrapeOpenJobs overlap with running Jobs mode? Different surface. scrapeOpenJobs enriches a company row with that company's open postings. Jobs mode (or a /jobs/... URL) returns each job as its own row in the dataset.

What's monitoringMode and when should I use it? Jobs-only flag that dedupes against a per-user KV store of seen job ids. Use it for scheduled runs that should only emit new postings.

Can I use the actor with a single-company URL like /companies/airbnb? Yes. It returns exactly one row regardless of maxItems because slug isn't a filterable Algolia attribute โ€” the actor full-text-searches the slug and post-filters to an exact match.

How do I control cost? Use maxItems strictly. Companies-mode hits Algolia paginated 100 at a time; jobs-mode follows pagination up to maxPages. Disable both enrichment toggles if you don't need founders/socials/jobs โ€” they add HTTP per row.


Support

  • File an issue or a feature request via the Apify Console Issues tab on the actor page.
  • Custom integrations (different output shape, additional filters, scheduled feeds into a warehouse) โ€” open an issue describing the use case.
  • The repo source is open in src/ โ€” main.ts orchestrates dispatch, lib/ycScrape.ts handles jobs HTML parsing, lib/ycCompanies.ts is the Algolia client + enrichment.

License

ISC. See package.json.