Companies & Profiles Linkedin scraper.
Get comprehensive profiles of individuals and companies based on your keywords and filters.
Unleash the power of data! 🌐🔍
status + reason on every output record (profiles and companies). Each item now carries status (FOUND / NOT_FOUND) and reason (DOES_NOT_EXIST, UNREACHABLE, or null when found), defaulted at emit time. Non-existent or unreachable handles are now catalogued as NOT_FOUND records instead of vanishing silently: a LinkedIn 404 → DOES_NOT_EXIST, a fetch that exhausts retries → UNREACHABLE, and a company whose Voyager response resolves no entity → DOES_NOT_EXIST. Both fields are documented in the output schema and shown as the first column in the Profiles/Companies table views.
Fixed
Profiles occasionally returned with only their identity block missing (no firstName/lastName/headline/location/profileId, but experience/education/skills present). The cause: a leased account/proxy sometimes returns a garbled/partial profile shell — not a 404, long enough to look real, but with the top-card identity block absent. The shell handler now detects this (no name and no nonIterableProfileId), retires that session and retries the shell on a fresh account (up to 3 rotations). The detection is deliberately conservative — it fires only when both the name and the profileId are missing, so a real profile that merely lacks a headline is never rotated away. If all rotations still return a partial shell, the actor keeps the partial data rather than dropping the profile. Added offline tests for the isPartialShell predicate (incl. real captured shells, which are never flagged).
Fixed
Profile summary was duplicated. LinkedIn renders the About text twice (a truncated collapsed preview + the full expanded text); the parser joined both, doubling the summary verbatim. It now de-duplicates — dropping any paragraph that is a duplicate of, or a truncated prefix of, a longer one. Confirmed on a 15-profile live run: 0 duplication (e.g. a 537-char doubled summary is now the correct 268 chars).
Profile firstName/lastName missing on some profiles, with the name leaking into headline. Some profile shells carry no firstName/lastName JSON fields at all — the name only exists as the top-card's first rendered line. The top-card extractor now isolates that name line (so it no longer leaks into headline) and recovers the name from it when the JSON fields are absent. Live name coverage went 6/15 → 10/15 (every profile that returns a shell now gets its name). Added offline regression tests on real captured shells (incl. a zero-name-field shell).
[10.11.0] — 2026-06-05
Added
Output schema (.actor/dataset_schema.json + .actor/output_schema.json, wired via actor.jsonstorages.dataset / output). Declares every field the actor can emit (profiles + companies, union, all nullable, nested objects) and gives the Console two dedicated table views — Profiles and Companies. Built from real run outputs, not just the TS interfaces.
README fully rewritten following the Apify Academy best-practice structure: v10 "rebuilt from scratch" header, documented input (all fields + example) and output (maximal profile + company JSON examples covering every emittable field/section), data tables, how-to, pricing, FAQ/legality, and the new banner.
urn (member URN) is now documented in the output schema and README — it was already emitted but undocumented.
Changed
Output field names harmonized between profiles and companies (breaking for the v2 output shape):
followerCount (company) → followersCount (now shared with profiles).
backgroundImageUrl (profile) → coverImageUrl (now shared with companies).
industryName (profile) → industry (now shared with companies).
Distinct-by-design fields kept as-is: profile profilePictureUrl vs company logoUrl, profile summary vs company description.
[10.10.1] — 2026-06-05
Fixed
get-companies returned the wrong company. The Voyager details parser picked the first company in the response's included[] (which also carries affiliated/similar companies) whenever the universalName match missed — and the match was case-sensitive, so a slug like Bebity (≠ stored bebity) silently fell through to an unrelated company (it returned Blanche.Agency for Bebity). It now selects the company the query actually resolved to, via the response's primary *elements URN (case-independent), with a case-insensitive universalName fallback.
A company NAME (not a URL) is now searched, not looked up as a slug. Under the legacy isUrl:true flag, a bare token in get-companies (e.g. Bebity) was treated as a direct universalName GraphQL lookup, which resolves to the wrong company. A markerless entry now routes to a proper company search (companies vertical filter); only an explicit /company/{slug} URL stays a direct lookup. Profile routing is unchanged.
[10.10.0] — 2026-06-05
Added
Experience role descriptions — each experience item now carries a description field (the role blurb / "…see more" text). It is detected structurally via the SDUI maxLineCountExpression marker, so it is never confused with the location.
input routing private log at startup — shows, per run, how each input entry was classified (searchCount / directCount plus the actual searchTerms / directSlugs, capped to 25). Makes "why was my name treated as a URL?" obvious from the logs.
Fixed
Profile sections cross-contaminated each other (intermittent, ~1 run in 4). When LinkedIn returned the Part1 card with every section inlined into the root flight row, the section scope matched that one row for all sections, so experience / education / certifications / volunteer each received the full concatenated blob and mislabelled it. Sections are now scoped by the node's observabilityIdentifier subtree, which is robust to both the inlined and split layouts.
location on experiences contained the role description note (and the note was lost when a real location existed). Notes now go to the new description field; location only ever holds a real location.
Company "media / CTA" taglines parsed as fake roles ("Learn more here", promo lines) — they broke multi-role grouping (sub-roles lost their company). Dropped via the *-media-button tracking marker. The same fix removes uploaded certification attachments ("…certificate.pdf") that were emitted as bogus certifications.
Multi-role groups with a standalone employment-type line ("Permanent", "Full-time") lost the company on each sub-role and put the type into location. The type is now read as employmentType (the secondary line before the date; location is the one after).
Blank "+N skills" affordance nodes split experiences — a whitespace-only node was treated as a job title, swallowing the next real role. Whitespace-only text is now ignored everywhere.
Education "Activities and societies" became a bogus school entry (or was silently dropped when long). It is now captured as the entry's activities field.
Honors & Projects emitted noise entries — the "Associated with {company/school}" association line and the "Other contributors" avatar affordance were parsed as separate entries (and the association line stole the real entry's description). Both are now skipped.
Skills empty-state placeholder ("Nothing to see for now" / "Skills that X adds will appear here.") was emitted as two skills on profiles with no skills. Now filtered.
A typed name submitted with the legacy isUrl:true flag was treated as a direct profile slug instead of a search (e.g. searching "Hugo Picquet" fetched a non-existent /in/Hugo picquet). Under isUrl:true, only entries that can actually be a slug (URL marker or a space-free token) are routed to a direct lookup; anything else is a search.
[10.9.0] — 2026-06-04
Fixed
v1-shaped inputs were silently ignored, returning the schema default (keywords:["web dev"], limit:1) instead of the submitted targets. Root cause: the schema exposed its fields under v2__-prefixed names, so Apify injected the v2__*defaults for every field a v1-shaped input (action/keywords/limit) didn't provide. Those injected defaults flipped normalizeInput's isV2 discriminator to true (v2-wins precedence), so the engine read the empty v2__search:["web dev"] / v2__limit:1 defaults and discarded the real keywords/limit. Net effect: a run of 4 profile URLs with limit:4 searched "web dev" capped at 1 and emitted a single record.
Changed
Input fields renamed back to the shared v1/v2 names (v2__action→action, v2__search→keywords, v2__limit→limit, v2__location→location, v2__profileFields→profileFields, v2__proxyConfiguration→proxyConfiguration). There is now a single set of field names, so normalizeInput and the startup-summary config read them directly — the v2__ vs v1 discriminator (and the v2-wins precedence that caused the bug) is gone. Legacy isUrl:true is still honored as a direct-lookup hint; isName remains a no-op. Behavior is otherwise unchanged.
[10.8.0] — 2026-06-03
Changed
Simplified v2 input. Targeting collapses from four concepts (action + isUrl + isName + the implicit meaning of keywords) to two: v2__action (Profiles / Companies) and v2__search (one bulk list of search terms, URLs, or /in/·/company/ slug-paths). URLs are auto-detected; a bare term is a search; a /in/{slug} or /company/{slug} marker forces a direct lookup (returns one result, ignores the limit). Removed the confusing isUrl/isName toggles and the enrichWith* fields from the v2 UI. The location field becomes v2__location (same autocomplete behavior).
Added
normalizeInput backward-compat layer (src-v2/dtos/normalize-input.ts). All v2 fields are prefixed v2__; the presence of any v2__ key marks an input as v2. Legacy v1-shaped inputs (action/keywords/isUrl/isName) still work unchanged — they ride through the schema's additionalProperties: true and are mapped to the canonical model (isName is dropped, being a no-op in v2). Raw LinkedIn IDs are rejected with a clear message instead of issuing a dead request. The schema requires nothing (required: []) so v1-shaped inputs still pass platform validation. Covered by offline tests in src-v2/__tests__/run.ts.
[10.7.0] — 2026-06-03
Added
location filter now works (get-profiles). The previously-inert "🌍 Location" input is wired into people-search as the geoUrn facet. Each entry is resolved autocomplete-style: free text (e.g. Paris, United States, Greater London) is matched against LinkedIn's geo typeahead and the top match is used; a raw geo id (106383538), a urn:li:geo:… URN, or a URL carrying geoUrn=… is used directly (no lookup). Multiple entries widen the filter (matches ANY). Resolution runs once at startup via one leased account and is baked into every search request (geoUrn=["…"] + origin=FACETED_SEARCH), including pagination. Unmatched entries are skipped with a warning (the run continues; if none resolve it runs unfiltered) — the matched place names are surfaced on the public log. New src-v2/linkedin/geo.ts (URL builder + typeahead parser + raw-id detection) and src-v2/runtime/geo-resolver.ts, both with offline tests over a real captured typeahead fixture. Geo typeahead transport is the Voyager GraphQL voyagerSearchDashReusableTypeahead (same rot caveat as the company queryId; see docs/RE §1e). Profiles only — company search is unaffected.
[10.6.0] — 2026-06-03
Fixed
limit ("🔢 Limit of result per query") no longer caps directly-provided URLs, and is now per-query. It was implemented as a single global counter applied to every enqueued item — including profile/company URLs passed in directly. With the default limit=1, a run of N URLs (e.g. the big-one preset: 493 profile URLs) returned only 1 item. Now: direct URLs bypass the per-query cap (all N are scraped, still subject to the daily quota), and in search mode limit is a per-keyword budget — limit=5 over 2 search terms yields up to 10 rows, matching the input-schema description. Gate logic extracted to src-v2/runtime/limit-gate.ts with offline tests; it composes with the daily-quota cap from 10.5.0 (per-query limit AND global remaining-quota both apply).
Added
profileFields multi-select (get-profiles) — choose which profile sections to scrape; each option maps to one LinkedIn request (About → above card; Experience/Education/Certifications/Volunteer → part1 card; Languages → part4 card; Skills/Honors/Projects/Organizations → their pagers). Fewer selections = fewer requests = faster, cheaper, lower ban risk. Empty/absent = scrape everything (backward compatible). Base identity (name, headline, location, picture, counts) always returned via the mandatory shell. Selection logic in src-v2/runtime/profile-fields.ts with offline tests.
Company output enriched (get-companies): CompanyData now also exposes phone ({ number, extension }), callToAction ({ text, type, url }), verified (page verification), active, pageType (COMPANY/SCHOOL/SHOWCASE), jobSearchUrl, and hashtags (associated #tags, leading # stripped). Field shapes validated against the decompiled Voyager Company / CallToAction / PhoneNumber / PageVerification models; covered by offline tests over the microsoft/stripe fixtures.
[10.5.0] — 2026-06-02
Added
Per-user daily result quota in v2 (src-v2/runtime/rate-limit.ts), ported from v1's Redis approach. Counts delivered dataset items (profiles and companies, one shared quota) per Apify userId over a 24h sliding window, in Redis. Config: REDIS_URL + DAILY_LIMIT (default 150000). Same key scheme as v1 (rate_limit:{userId}:count / :window_start), so a shared Redis makes v1 and v2 draw from the same budget.
The run's effective cap is min(input.limit, remaining); the request gate stops enqueuing at that cap, so the daily limit is never exceeded. Each emitted item increments the Redis counter.
Graceful degradation: no REDIS_URL (or a connection error) → the limiter is a no-op and the run proceeds unthrottled.
Optional RATE_LIMIT_PREFIX env to namespace the Redis keys (e.g. claudetest:), isolating from existing data on a shared Redis. Default "" keeps the v1-compatible rate_limit:{userId} scheme.
Client-facing quota line on the public log channel (Daily quota: X/Y used today — Z remaining); all other rate-limit internals (Redis connection, detailed counters) stay on the private/debug channel.
Changed
Unlike v1 (which Actor.fail()s), hitting the daily cap now lets the run succeed: allowed results are delivered and the run finishes with a warning + an Apify status message (Daily limit … Resets in ~N min). Already-at-cap runs exit immediately without leasing accounts.
[10.4.0] — 2026-06-02
Added
action=get-companies in v2 — company search and company details (previously threw). Company search reuses the unified Crawlee queue + session affinity as a POST RSC stream (mirror of people-search, page-key d_flagship3_search_srp_companies); regex-extracts /company/{universalName} slugs. Company details are fetched from the Voyager GraphQL API (/voyager/api/graphql, voyagerOrganizationDashCompanies), which returns normalized JSON (application/vnd.linkedin.normalized+json+2.1) — not SDUI/RSC. One GET per company → one dataset item (no cards/pagers).
Offline tests over real captured fixtures (company-microsoft.voyager.json, company-stripe.voyager.json, company-search-consulting.rsc.txt): parser fields, slug extraction, and the Voyager header contract.
Docs
docs/RE/LINKEDIN_API_REVERSE_ENGINEERING.md §7 — full company reverse-engineering (search + details endpoints, headers, field mapping), captured live via CamouxAI.
docs/DECISIONS.md D12 — company details use Voyager GraphQL JSON (distinct transport/parser from profiles); the queryId is a versioned hash that rots on client bumps.
Request queue is now memory-only in public mode, instead of persisted-then-dropped. v10.3.0 wrote every queued request (search/shell/card/pager) to storage/request_queues/ during the run and deleted them in a finally block — so they were still visible mid-run and the cleanup was best-effort. The request queue is now backed by an in-memory storage client (@crawlee/memory-storage, persistStorage: false), so the engine's internals are never written to the run's storage at all. Scoped to the request queue only: the dataset output keeps the default storage client and persists as before. General-debug mode still uses the default persistent queue for the Apify crawler UI.
[10.3.0] — 2026-06-02
Added
General-debug mode (src-v2/): activate via the GENERAL_DEBUG=1 env var or a hidden input field. When on, all Crawlee classes (session pool, statistics) run with persistence + stats enabled for the Apify crawler UI; when off, none of it is exposed.
Double logger (logs.priv / logs.pub / logs.all): a friendly, safe public log stream by default; switches to a detailed private stream under general-debug. All levels (debug/info/warning/error, plus warn alias).
Startup summary: every run opens with the scraper version, log mode, general-debug state, action/limit/keyword counts, and a warning for any missing required env vars.
Changed
All src-v2 log call sites migrated onto the double logger; account leasing / usage telemetry / RSC internals now log only on the private channel.
Public mode now mutes Crawlee's own class loggers (crawler, session pool, request queue, statistics) by raising the shared root @apify/log level to ERROR; this also hides Crawlee's routine retry warnings (Reclaiming failed request …, which expose the crawler name + target URL + retry count). Our funny public logs ride a child logger kept at INFO, so the user-facing narrative still prints; a genuine fatal crawler error still surfaces. Private mode raises the root to DEBUG so everything shows.
Fixed
Per-second "Statistics" log spam in public mode. The previous suppression trick set statisticsOptions.logIntervalSecs to ~41 years; logIntervalSecs * 1000 overflowed setInterval's 32-bit signed range and silently clamped to ~1 ms, flooding the log dozens of times per second (and the line leaked engine internals). Suppression is now done by log level; the interval stays a sane value (60s public / 30s private).
Request queue left in storage in public mode. Session-pool and Statistics state were already gated out of the Key-Value store, but the request queue (every queued search/shell/card/pager request) is a separate store and still lingered in storage/request_queues/. main.ts now drops the request queue at shutdown when general-debug is off — the dataset output (a separate store) is untouched, and private mode keeps the queue for the Apify crawler UI.
[10.2.1] — 2026-06-02
Fixed
Leased accounts are now released at shutdown (AccountRegistry.releaseAll() in main.ts's finally) — they were leased but never returned to Milkbox, holding leases open server-side.
Idempotent profile-shell seed: a re-queued shell no longer resets a live accumulator entry (which would dedupe its cards by uniqueKey and strand the profile so it never emits).
choosePagers filters out any unresolved pager (defensive under the project's loose TypeScript).
Added
postNavigationHook marks the account good (registry.markGood) on a <400 response, so account-health telemetry sees successes, not only errors/bans.
Docs
New docs/SESSION-AFFINITY.md (system + how to use/reuse the brick + gotchas); CLAUDE.md updated to the unified-queue flow (the old two-phase description was stale).
[10.2.0] — 2026-06-02
Added
Unified single-queue v2 pipeline with session affinity. Every LinkedIn call (search → profile shell → cards → pagers) is now a queued Crawlee Request instead of Phase-1 direct fetch + Phase-2 inline ctx.sendRequest. A profile's cards/pagers are pinned to the account that handled its shell and run concurrently on that one account/IP; each new profile takes a fresh session. Output unchanged: one dataset item per profile.
Generic crawlee-session-affinity brick (src-v2/crawlee/session-affinity/): soft request→session affinity for Crawlee 3.16 (pin via userData, graceful fallback when a session dies, forefront batch). SessionPool is fed by an AccountRegistry that adapts the Milkbox provider. Offline-tested (pnpm run test:affinity).
Two-wave profile accumulator (src-v2/linkedin/profile-accumulator.ts): per-profile completion barrier (cards → pagers) that emits one aggregated item; best-effort (a part that exhausts retries still advances the barrier). Offline-tested (pnpm run test:queue).
Changed
Search retries now ride Crawlee (maxRequestRetries + session retire) instead of the manual 8-account loop. Component/pager POSTs are navigated by the crawler (validated: impit POST returns RSC).
Profile handler split into profile-shell / component / pager handlers + a part-extractors registry; main.ts rebuilt on the affinity brick (maxConcurrency: 50, maxPoolSize: 10, maxRequestRetries: 8).
Fixed
Account ids are sanitized without hyphens so they are valid Apify proxy session tokens (/^[\w._~]+$/); Milkbox UUID ids no longer break proxy resolution.
Per-(profile, part)uniqueKey on card/pager requests prevents Crawlee from deduping one profile's parts against another's (their URLs are identical; the vanity lives in the POST body).
[10.1.0] — 2026-06-02
Added
proxyConfiguration input (standard Apify proxy editor) — supports both Apify Proxy (groups/country) and custom proxy URLs. Search now routes through a proxy (linkedin/http.ts).
File-based proxy list (linkedin/proxies.ts): loads bird-proxies.txt (override via PROXY_LIST_FILE) and builds the fallback ProxyConfiguration from it — used when out of Apify Proxy credits. Takes precedence over the input proxy; the account's bound proxy still wins.
Generic lazy-section discovery (component-request.tsextractLazySectionRefs): reads com.linkedin.sdui.profile.card.ref{profileId}{Section} refs from card responses instead of hardcoding (generalizes the Skills-only special case).
Lazy sections via the detail pager (Lot D): section content is loaded with POST /flagship-web/rsc-action/actions/pagination?sduiid=…pagers.profile.details.{slug} (a proto.sdui.actions.requests.PaginationRequest, empty states, keyed on the viewee profileId) — the broken ref{profileId}Skills call (500) is removed. New SKILLS_PAGER + ENTITY_PAGERS (certifications/courses/publications/honors/projects/organizations/languages, recipes confirmed live) + buildPaginationBody/buildPaginationUrl/paginationPageKey (component-request.ts), extractPagerBuckets/extractSkillsFromPager (rsc-flight.ts), extractCertificationsFromPager (rsc-parser.ts).
skills wired + verified live (6–57 clean skills).
certifications wired (prefers the complete pager over the card inline) + verified live ({name, issuer, date}).
honors / projects / organizations / languages wired + verified live via a generic entity grouper (groupPagerEntities): new ProfileData fields honors ({title, description}), projects ({name, dateRange, description}), organizations ({name, role, dateRange, description}); languages now come from the pager too.
courses/publications: pager recipes ready in ENTITY_PAGERS, parsers still to add. volunteer/patents: pagerId TBD.
Offline test harness (pnpm run test:v2, src-v2/__tests__/): assertions over real captured RSC fixtures (request body shape, ref discovery, section parsing, top-card name/headline/location).
New profile fieldsprofilePictureUrl + backgroundImageUrl (extractTopCardImages): owner's images scoped to the topCard (avoids recommendation/company avatars), largest rendition built from the SDUI image model rootUrl + suffixUrl (≈800×800 photo, 350×1400 cover).
followersCount / connectionsCount now extracted from the topCard rendered text (extractTopCardCounts) — these are not numeric JSON fields in the RSC. Handles both single-node ("255 followers") and split-node ("229" + "connections") forms, and "500+". Verified live: followers on creator profiles, connections on normal profiles.
Per-account browser fingerprint: user-agent, accept-language, and x-li-track on all LinkedIn requests are now driven by browser_session.fingerprint from Milkbox — UA, languages, timezone, and screen match the leased account's browser profile.
LINKEDIN_HTTP3 env toggle: enables HTTP/3 on impit for Phase-1 search requests. Forced off whenever a proxy is in use — impit cannot combine HTTP/3 with a proxy.
Changed
pnpm start now runs v2 (src-v2/main.ts) as the classic Apify actor entry (also used by apify run). v1 stays reachable via pnpm run start:v1. Added build:v2.
Minimal real-web component body (buildComponentBody): drops the guessed replaceableSectionArgs/vieweeProfileId/shouldSetupReplaceableComponent/profileComponentState. Component cards no longer require nonIterableProfileId, so a missed shell extraction no longer zeroes the profile (it's now only a health signal, not a gate).
x-li-application-version bumped 0.2.5529 → 0.2.5782 (the old value was stale).
Phase-1 search now uses impit (ImpitHttpClient, Firefox fingerprint) instead of axios, matching Phase-2's TLS fingerprint.
Fixed
Top-card parsing (extractTopCardFields): headline/name/location were mis-assigned (name or company used as headline; comma-bearing headline classified as location; pronouns "She/Her" used as headline; comma-less locations dropped). Now parsed by stable document order — verified live on satyanadella / williamhgates / hanaelliott / ghislaindurand.
firstName/lastName now fall back to findStringFieldInStream when the tree walk misses the profile shape on large (activity-inlined) shells.
Top-card name with a nickname (e.g. "Zsófia Réka (Sophie) Tóth"): the exact full-name match failed and the headline became the name; now falls back to a starts-with-first + contains-last match to skip the name line.
Phase-2 requests (profile GET + component POSTs) now egress through the leased account's residential proxy via a crawler-level ProxyConfiguration(newUrlFunction) resolving each session's bound proxy. The previous per-request request.proxyUrl was a no-op that leaked the datacenter IP.
Notes
Proxy precedence: a leased account's bound proxy wins (keeps the cookie ↔ IP pairing); the file/input proxy is the fallback for search and accounts without a bound proxy.
Known gaps: skills, certifications, honors, projects, organizations, languages now work via the section pager (Lot D). Remaining: courses, publications (recipe ready in ENTITY_PAGERS, parser to add) and volunteer/patents (pagerId still to capture). (The ref component and the detail-page GET are dead ends — 500 / "Something went wrong".) industryName is absent from the RSC (likely voyager/api JSON only); followersCount/connectionsCount are now read from the topCard text. languages name/proficiency pairing is fragile when a language has no proficiency level. See docs/DATA-SOURCING.md §6 and docs/DECISIONS.md D8.
bird-proxies.txt is committed with credentials in clear — should be moved to a secret / KV store.
[10.0.0] — 2026-06-02
Added
v2 scraping engine (src-v2/) talking directly to LinkedIn's SDUI / RSC API — no dependency on @bebity/linkedin-scraper.
Milkbox-based account/cookie/proxy leasing behind an AccountProvider abstraction, with batched usage telemetry and auto-ban (LOGIN_REQUIRED) on auth errors.
Documentation under docs/ (ARCHITECTURE.md, DATA-SOURCING.md, RE/LINKEDIN_API_REVERSE_ENGINEERING.md).
Changed
Version set to 10.0.0 to mark the v2 rewrite (previously tracked the @bebity/linkedin-scraper package version, 7.x).
Notes
Input/output and published actor identity stay compatible with v1.
v2 currently supports profiles only (get-profiles); get-companies still throws.
enrichWithCompany / enrichWithContact are accepted in the input schema (v1 compat) but not yet implemented in v2.
v1 engine (src/, wrapping @bebity/linkedin-scraper) remains the deployed/shipping path.