H1B Visa Database Scraper | DOL Disclosure Salaries
Pricing
from $3.50 / 1,000 results
H1B Visa Database Scraper | DOL Disclosure Salaries
Scrape the US H1B Visa database (h1bdata.info) — public Department of Labor disclosure data. Per-employer, per-job, per-city, per-year salary records with submit/start dates. 8M+ approved cases since 2014. Critical for immigration attorneys, job seekers, recruiter intel.
Pricing
from $3.50 / 1,000 results
Rating
0.0
(0)
Developer
Haketa
Maintained by CommunityActor stats
0
Bookmarked
3
Total users
2
Monthly active users
a day ago
Last modified
Categories
Share
H1B Visa Database Scraper — DOL Disclosure Salaries, Sponsor History & Prevailing Wage Lookup
The fastest way to query the entire US H1B Visa public disclosure database. This Apify Actor scrapes h1bdata.info — a long-running independent aggregator that re-publishes the US Department of Labor's mandatory H1B / H1B1 / E3 disclosure files (Title 20 CFR §655.760). Every CERTIFIED petition since fiscal year 2014 is in here: employer, job title, base wage, work location, submit date, intended start date. 8M+ approved cases across thousands of sponsors and job titles, no auth, no captcha, no anti-bot, no proxy required.
What This Actor Does
The H1B Visa Database Scraper is a production-grade Apify Actor that turns the entire public US H1B disclosure dataset into structured, filterable JSON. It queries h1bdata.info — a community-maintained mirror of the US Department of Labor's (DOL) Office of Foreign Labor Certification public disclosure feed — and returns one row per CERTIFIED Labor Condition Application (LCA) / H1B petition.
Under federal law (Title 20 CFR §655.760), the DOL must publish every approved foreign-worker petition with the petitioning employer, the job title, the base wage offered, the city/state of work, the application submit date, and the intended start date. h1bdata.info ingests the DOL's quarterly disclosure files and exposes them via a fast, paginated, ungated search UI. This actor turns that UI into an API.
In a single search (e.g. employer=google, job=software engineer, year=2024) the actor will return 30,000+ rows for popular queries. Across the full back catalog (FY 2014 → present) the underlying dataset contains 8M+ certified petitions.
Each record returned includes:
- Employer — petitioning company exactly as filed with the DOL (e.g.
GOOGLE LLC,META PLATFORMS INC,JPMORGAN CHASE & CO) - Job title — the title on the LCA (e.g.
SOFTWARE ENGINEER,DATA SCIENTIST,INVESTMENT BANKING ANALYST) - Base salary — the annual wage offered to the foreign worker (USD, the wage the DOL approved)
- Work location — city + 2-letter state of the intended job site
- Submit date — when the LCA was filed with the DOL
- Start date — the proposed employment start date
- Year — DOL fiscal year of the petition
- Case status — always
CERTIFIED(the DOL only publishes approved petitions; denied/withdrawn cases are not disclosed) - Provenance — exact search URL the row came from + ISO scrape timestamp
The dataset powers immigration-attorney case research, visa-dependent job-seeker discovery, recruiter intelligence, comp benchmarking, investigative journalism, and labor-economics research — all from a source no closed API can match for breadth or cost.
Why scrape h1bdata.info yourself when this exists?
The H1B disclosure feed is public, but turning it into a usable dataset is non-trivial. Teams that try the DIY route run into these obstacles fast:
- The raw DOL files are quarterly Excel/CSV dumps with 100+ columns, inconsistent headers per quarter, and a 12-week publication lag
- h1bdata.info pages return up to ~18 MB of HTML per query (38k+ rows in a single response) — naive
requests.get()calls without a 60-second timeout will silently truncate - The site's HTML table structure is unlabeled — you have to parse column order positionally, not by header
- Salary strings arrive as
$112,000style text and must be normalized to numeric USD - Dates arrive as
MM/DD/YYYY(US format) and need ISO conversion for SQL/BI tools - Location is a combined
CITY, STstring that needs splitting for analytics - Empty querys are rejected silently — at least one of employer/job/city is required
- Different searches return wildly different row counts (10 rows for a niche role, 30k+ for
Google + Software Engineer) — your scraper must tolerate both extremes - The DOL's own performance.dol.gov portal cannot be queried — it only offers full-quarter downloads
- Building a per-employer search loop (1,000 employers × 5 job titles × 10 years = 50,000 queries) requires retry/backoff/dedup logic that's tedious to maintain
This actor solves every one of those: it generates the cross-product of your filters as separate tasks, retries each request 3× with exponential backoff and a 60-second timeout, normalizes salary/date/location fields, deduplicates rows, and emits clean JSON ready for SQL, Pandas, Sheets, or your BI tool.
Quick Start
One-Click Run
- Click "Try for free" on the Apify Store page
- Enter at least one employer (e.g.
google), one job title (e.g.software engineer), and a year (e.g.2024) - Hit Start — petitions stream into the dataset in seconds
- Download as JSON, CSV, Excel, JSONL, HTML, XML, or RSS directly from the Apify dataset view
API Run (Python)
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("haketa/h1b-visa-database-scraper").call(run_input={"employers": ["google", "meta", "amazon", "microsoft", "apple"],"jobTitles": ["software engineer", "machine learning engineer", "data scientist"],"cities": [], # nationwide"year": "2024","minSalary": 150000, # only show $150k+ offers"maxRecords": 5000,})for row in client.dataset(run["defaultDatasetId"]).iterate_items():print(f"{row['employer']:30s} | {row['jobTitle']:35s} | "f"${row['baseSalary']:>8,} | {row['city']}, {row['state']} | "f"{row['submitDate']}")
API Run (Node.js / TypeScript)
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('haketa/h1b-visa-database-scraper').call({employers: ['goldman sachs', 'morgan stanley', 'jpmorgan chase'],jobTitles: ['quantitative analyst', 'investment banking analyst'],cities: ['new york'],year: '2024',maxRecords: 1000,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Pulled ${items.length} certified H1B petitions on Wall Street`);const avgWage =items.reduce((s, r) => s + (r.baseSalary || 0), 0) / items.length;console.log(`Average base wage: $${Math.round(avgWage).toLocaleString()}`);
API Run (cURL)
curl -X POST "https://api.apify.com/v2/acts/haketa~h1b-visa-database-scraper/runs?token=YOUR_TOKEN" \-H "Content-Type: application/json" \-d '{"employers": ["openai"],"jobTitles": ["research engineer", "software engineer"],"year": "2024","maxRecords": 500}'
API Run (paste a raw search URL)
run = client.actor("haketa/h1b-visa-database-scraper").call(run_input={"startUrls": ["https://h1bdata.info/index.php?em=nvidia&job=ai+engineer&city=&year=2024"],"maxRecords": 1000,})
When startUrls is provided it overrides the structured filters — useful when you have a search you've already built interactively on h1bdata.info and want to replay it programmatically.
How It Works
The actor takes a Cartesian product of employers × jobTitles × cities (plus the chosen year) and turns each combination into a single GET request against h1bdata.info/index.php. Each response is parsed with cheerio, normalized, salary-band-filtered, deduplicated, and pushed into the Apify Dataset.
Source endpoint
| Endpoint | Method | Pagination | Notes |
|---|---|---|---|
https://h1bdata.info/index.php?em={emp}&job={job}&city={city}&year={year} | GET | None — one request returns every match | Server may return 18MB+ HTML for popular queries |
The query string accepts any combination of the four parameters; at least one of em, job, or city is required (empty queries are rejected). year accepts a 4-digit DOL fiscal year (2014 → present) or All Years.
Engine
- HTTP-only via
got-scraping— no Playwright, no Puppeteer, no headless Chromium overhead - Realistic browser headers auto-generated by
got-scraping'sheaderGeneratorOptions(Chrome 120+, desktop, US locale, Windows/macOS) - 60-second per-request timeout — needed because popular queries return 18MB+ payloads
- 3-attempt retry with exponential backoff (
2s × attempt + jitter) — survives intermittent network blips - Response sanity check — rejects empty bodies and non-200 statuses
- Polite delay between requests (
requestDelay, default 1000ms + 0-500ms jitter)
Parsing
- cheerio loads the HTML and walks
table#myTable tbody tr(with a positional fallback for any othertable tbody trshape) - Columns are parsed by position:
EMPLOYER | JOB TITLE | BASE SALARY | LOCATION | SUBMIT DATE | START DATE - Salary normalization —
$112,000→112000(integer); the original display string is preserved asbaseSalaryDisplay - Date normalization —
MM/DD/YYYY→YYYY-MM-DD(ISO-8601, sort-safe) - Location split —
"MOUNTAIN VIEW, CA"→{city: "MOUNTAIN VIEW", state: "CA"}; the raw string is preserved aslocation - Year inference — derived from
submitDatewhen present
Output pipeline
- Salary-band post-filter —
minSalary/maxSalaryapplied before push - Deduplication —
(employer | jobTitle | baseSalary | submitDate | city)is hashed; duplicates within a run are dropped - Hard cap —
maxRecordsstops the loop early (set0for unlimited) - Empty-result safety net — if zero records are scraped after all tasks complete, the run is marked
FAILEDviaActor.fail()so monitoring/scheduling integrations can alert
Proxy
Proxy is disabled by default. h1bdata.info has no rate-limit, no CAPTCHA, and no IP-based throttle. You only need to enable Apify Proxy if you're running massive parallel jobs (e.g. 10,000+ employer queries in one run) and want to be courteous about IP diversity.
Input Parameters
{"employers": ["google", "meta", "amazon"],"jobTitles": ["software engineer", "machine learning engineer"],"cities": ["mountain view", "menlo park", "seattle"],"year": "2024","minSalary": 150000,"maxSalary": 0,"maxRecords": 5000,"requestDelay": 1000,"startUrls": [],"proxyConfiguration": { "useApifyProxy": false }}
Parameter reference
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array<string|object> | [] | Paste any h1bdata.info search URL directly (e.g. from your browser). When non-empty, overrides all other filter fields. |
employers | array<string> | ["google"] | Free-text employer names, partial match, case-insensitive. Examples: "google", "goldman sachs", "infosys". Each entry runs as its own task. |
jobTitles | array<string> | ["software engineer"] | Free-text job-title queries, partial match. Each combines with each employer × city as a separate task. |
cities | array<string> | [] | US-city filter, case-insensitive. Empty array = nationwide. Cross-products with employers and job titles. |
year | enum<string> | "All Years" | DOL fiscal year. "All Years" for full 2014-present history or "2026" → "2014" for a single year. |
minSalary | integer | 0 | Post-filter: drop rows where baseSalary < minSalary. 0 = no lower bound. |
maxSalary | integer | 0 | Post-filter: drop rows where baseSalary > maxSalary. 0 = no upper bound. |
maxRecords | integer | 500 | Hard cap across all tasks. 0 = unlimited. Popular queries can yield 30,000+ rows — set generously. |
requestDelay | integer | 1000 | Milliseconds between requests. h1bdata has no rate-limit but 500–2000 ms is polite. |
proxyConfiguration | object | { useApifyProxy: false } | Optional. h1bdata.info has no anti-bot, so proxy is not required. Only enable for very large multi-thousand-task runs. |
Tip: Provide an empty
employersarray + a non-emptyjobTitlesorcitiesto search across all employers for that role/city. Just remember the URL needs at least one ofem/job/cityto return results.
Output Schema
Every row in the dataset uses the same flat shape — easy to flatten into a relational table, Google Sheet, or DataFrame.
Core petition fields
| Field | Type | Description |
|---|---|---|
employer | string | Petitioning employer as filed with DOL, exactly as published (often ALL CAPS, includes legal suffix — e.g. GOOGLE LLC, META PLATFORMS INC) |
jobTitle | string | Job title on the LCA (e.g. SOFTWARE ENGINEER, DATA SCIENTIST III, INVESTMENT BANKING ANALYST) |
baseSalary | number|null | Annual base wage offered, in USD, normalized to an integer |
baseSalaryDisplay | string|null | Original salary string as shown on h1bdata.info (e.g. $112,000) |
city | string|null | City of the work location |
state | string|null | Two-letter US state code |
location | string|null | Raw combined location string (e.g. MOUNTAIN VIEW, CA) |
submitDate | string|null | LCA submit date in YYYY-MM-DD |
startDate | string|null | Intended employment start date in YYYY-MM-DD |
year | integer|null | DOL fiscal year inferred from submitDate |
caseStatus | string | Always "CERTIFIED" — the DOL only publishes approved petitions |
Provenance / search-echo fields
| Field | Type | Description |
|---|---|---|
searchEmployer | string|null | Employer query that produced this row (or null if startUrls was used) |
searchJobTitle | string|null | Job-title query that produced this row |
searchCity | string|null | City query that produced this row |
searchYear | string | Year query that produced this row (e.g. "2024", "All Years") |
sourceUrl | string | Exact h1bdata.info URL the row was scraped from — fully reproducible |
scrapedAt | string | ISO-8601 timestamp of extraction (UTC) |
Example: Google software engineer petition
{"employer": "GOOGLE LLC","jobTitle": "SOFTWARE ENGINEER","baseSalary": 112000,"baseSalaryDisplay": "$112,000","city": "DURHAM","state": "NC","location": "DURHAM, NC","submitDate": "2024-03-04","startDate": "2024-08-15","year": 2024,"caseStatus": "CERTIFIED","searchEmployer": "google","searchJobTitle": "software engineer","searchCity": null,"searchYear": "2024","sourceUrl": "https://h1bdata.info/index.php?em=google&job=software+engineer&city=&year=2024","scrapedAt": "2026-05-18T09:14:22.318Z"}
Example: Goldman Sachs investment-banking analyst petition
{"employer": "GOLDMAN SACHS & CO. LLC","jobTitle": "INVESTMENT BANKING ANALYST","baseSalary": 110000,"baseSalaryDisplay": "$110,000","city": "NEW YORK","state": "NY","location": "NEW YORK, NY","submitDate": "2024-01-22","startDate": "2024-07-08","year": 2024,"caseStatus": "CERTIFIED","searchEmployer": "goldman sachs","searchJobTitle": "investment banking analyst","searchCity": "new york","searchYear": "2024","sourceUrl": "https://h1bdata.info/index.php?em=goldman+sachs&job=investment+banking+analyst&city=new+york&year=2024","scrapedAt": "2026-05-18T09:14:24.402Z"}
Case Status & Visa Type Reference
The DOL only publishes approved petitions in the public disclosure file, so every row this actor returns has caseStatus = "CERTIFIED". Denied, withdrawn, returned, or under-review cases are not disclosed and therefore cannot appear in this dataset.
Visa categories covered
| Visa | Description | In dataset? |
|---|---|---|
| H1B | Specialty occupation worker (Bachelor's+ degree role) | Yes — majority |
| H1B1 | Singapore / Chile free-trade-agreement specialty worker | Yes |
| E3 | Australian specialty occupation worker | Yes |
| H2A / H2B | Seasonal agricultural / non-agricultural workers | No (separate DOL feed) |
| L1A / L1B | Intra-company transferee | No (USCIS-only, not DOL) |
| O1 / O3 | Extraordinary ability | No (USCIS-only) |
| Green Card (PERM) | Permanent labor cert | No (separate DOL feed — see roadmap below) |
DOL fiscal-year coverage
| Year | Status |
|---|---|
| 2014 | Earliest year in h1bdata.info index |
| 2015 → 2023 | Full coverage |
| 2024 | Full coverage |
| 2025 | Rolling (DOL publishes quarterly) |
| 2026 | Partial — Q1 / Q2 typically available by mid-year |
Use Cases
Immigration Law & LCA Case Research
Immigration attorneys, paralegals, and corporate immigration teams use this dataset to:
- Pull employer petition history for prevailing-wage attack arguments and RFE responses
- Benchmark Level 1 vs Level 4 wage offers for a given SOC code and metro to support LCA filings
- Document an employer's H1B sponsorship pattern for I-140 / I-485 case files
- Track concurrent / amended petitions by tracing repeated submit dates for one employer + job
- Build evidence packets for Department of Labor audits and Wage & Hour investigations
- Cross-reference an employer's claimed wage vs. what they've previously filed with DOL
Visa-Dependent Job Seekers (Students, F1, OPT, H1B Holders)
International students on F1/OPT and current H1B holders use the dataset to:
- Identify visa-friendly employers by ranking who actually sponsors in their target city + role
- Set realistic salary expectations by looking up the median LCA wage for their job title at the company they're interviewing with
- Discover smaller sponsors beyond the headline FAANG names — most petitions are filed by mid-cap firms
- Time job changes around H1B transfer windows using the start-date column
- Avoid serial wage-suppressor employers by flagging companies whose LCA wages sit consistently below the BLS Level 1
- Negotiate offers with real-world bid data from the same employer in the same role and metro
Recruiter & Talent-Intel Teams
Internal recruiters, RPO firms, and exec-search teams use the dataset for:
- Competitor sponsorship analysis — who in your industry is bringing in foreign talent, at what scale, in which functions?
- Hot-title detection — track quarter-over-quarter growth in titles like
AI Engineer,ML Researcher,Prompt Engineerto spot category shifts - Sponsor-friendliness scorecards for candidate-facing materials
- Sourcing pools — every employer in the dataset has, by definition, hired internationally before and may have a current opening profile
- Pay-band benchmarking against direct competitors using actual DOL filings (not survey medians)
HR Total-Rewards & Compensation Benchmarking
Comp & ben teams blend H1B disclosure data into compensation studies because the LCA base wage is the offered base wage — not a self-reported survey response:
- Calibrate base salary structures against peer companies in your metro
- Build a free, real-world alternative to Radford, Mercer, and WTW surveys for tech / finance / pharma roles
- Quantify metro premiums — same role in Bay Area vs Austin vs Atlanta, sourced from the same employer
- Validate offer competitiveness in retention reviews
- Detect inadvertent wage compression between long-tenured employees and incoming H1B hires
- Inform pay-equity audits by comparing internal salaries to external LCA-filed wages for matching titles
Investigative Journalism & Wage-Suppression Reporting
Reporters and data desks use H1B disclosure as a primary source for:
- Wage-suppression investigations — flag employers whose LCA wages cluster at DOL Level 1 even for senior titles
- Body-shop / consultancy exposés — identify outsourcing firms filing thousands of low-wage petitions per year
- Geographic-arbitrage reporting — companies headquartering filings in low-prevailing-wage metros while work is performed elsewhere
- Tech-layoff coverage — track post-layoff sponsorship pivots
- Policy-impact stories — quantify the real on-the-ground effect of every USCIS rule change
Government, Think Tank & Labor-Economics Research
Academic economists, policy shops, and public-sector analysts use this data to:
- Estimate H1B labor-market effects at MSA granularity
- Inform STEM-workforce policy with empirical wage and headcount data
- Track industry shifts — banking → big tech → AI → fintech sponsor mix evolution
- Model wage-elasticity of H1B supply by SOC code
- Support immigration-policy testimony with real disclosure data
- Build replication datasets for peer-reviewed labor-economics papers
Tech-Employer Ranking & Industry-Trend Newsletters
Trade publications and data-newsletter operators use the dataset to:
- Publish "Top H1B Sponsors of {YEAR}" rankings by total petitions and median wage
- Run year-over-year comparisons for FAANG, MAANG, AI labs, fintech, big pharma, and consulting
- Build interactive dashboards for paid subscribers (e.g. company search, role search, metro search)
- Detect emerging hiring centers — petitions in Austin, Miami, Raleigh, Bellevue growing faster than NYC/SF
- Publish quarterly "AI Engineer wage tracker" style data drops
University Career Services & International Student Advising
College career centers and graduate-school career offices use the dataset to:
- Show students which employers in their field have historically sponsored international hires
- Benchmark offered salaries for graduating MS/PhD students by program and metro
- Build alumni-employer connections by surfacing alumni-heavy sponsors
- Justify program ROI with concrete post-graduation sponsorship outcomes
- Coach students on which employers are realistic sponsorship targets vs. long shots
Compliance, Audit & M&A Due Diligence
Corporate-development teams and external auditors use H1B disclosure as a verifiable third-party data point:
- Verify an acquisition target's sponsorship history during M&A due diligence — undisclosed H1B obligations are post-close liabilities
- Detect undisclosed offshore-staffing arrangements by cross-checking petition volume vs. headcount disclosures
- Validate prevailing-wage compliance when the target is a federal contractor
- Audit subcontractor labor practices in supply-chain due diligence
- Support post-acquisition integration planning by mapping H1B-dependent talent that needs visa transfers
Real Estate & Location Intelligence
Site-selection analysts and CRE teams treat H1B petition density as a leading indicator of high-income knowledge-worker housing demand:
- Forecast luxury-rental demand in metros with rising H1B petition counts
- Validate corporate-relocation rumors before HQ announcements (filings shift months ahead of press releases)
- Build neighborhood comps that account for international-hire population growth
- Inform retail / hospitality investment in emerging tech-talent corridors
Sample Queries & Recipes
Recipe 1: All Google software-engineer petitions for FY 2024 (the verified smoke test)
{"employers": ["google"],"jobTitles": ["software engineer"],"year": "2024","maxRecords": 1000}
This is the live smoke-test query — 30 records returned with 100% field coverage on the verification run, including rows like GOOGLE LLC | SOFTWARE ENGINEER | $112,000 | DURHAM, NC | 2024-03-04 | CERTIFIED.
Recipe 2: Top FAANG ML / AI hiring across 2024–2025
{"employers": ["google", "meta", "amazon", "apple", "microsoft", "nvidia", "openai", "anthropic"],"jobTitles": ["machine learning engineer", "research engineer", "applied scientist", "ai engineer"],"year": "All Years","minSalary": 200000,"maxRecords": 10000}
Recipe 3: Wall Street quant & banking analyst comp benchmark, NYC only
{"employers": ["goldman sachs", "morgan stanley", "jpmorgan chase", "citi", "bank of america", "jane street", "citadel", "two sigma"],"jobTitles": ["quantitative analyst", "investment banking analyst", "software engineer"],"cities": ["new york"],"year": "2024"}
Recipe 4: Indian-IT body-shop volume tracker
{"employers": ["infosys", "tata consultancy services", "wipro", "cognizant", "hcl", "tech mahindra", "capgemini"],"jobTitles": ["programmer analyst", "systems analyst", "consultant"],"year": "All Years","maxRecords": 50000}
Recipe 5: Sponsor-friendliness scan for a specific role across every US metro
{"employers": [],"jobTitles": ["data scientist"],"year": "2024","minSalary": 120000,"maxRecords": 20000}
Recipe 6: Pharma & biotech R&D petitions
{"employers": ["pfizer", "moderna", "merck", "genentech", "regeneron", "vertex", "eli lilly"],"jobTitles": ["scientist", "research associate", "bioinformatics scientist"],"year": "All Years"}
Recipe 7: Tiny test run — 10 rows to validate your pipeline before a big scrape
{"employers": ["amazon"],"jobTitles": ["software development engineer"],"year": "2024","maxRecords": 10}
Recipe 8: Direct URL replay — paste a search you built in your browser
{"startUrls": ["https://h1bdata.info/index.php?em=stripe&job=&city=san+francisco&year=2024"]}
Integration Examples
Google Sheets
Schedule the actor daily, attach Apify's "Save to Google Sheets" integration, and your team has a living view of (for example) every petition your competitors filed last quarter — refreshed without anyone touching a spreadsheet.
Make.com / Zapier / n8n
Trigger downstream workflows on each new run:
- New rows where
baseSalary > $250,000→ send to Slack#comp-intel - New petitions from any competitor in your tracked list → create a HubSpot deal task
- New employer first-time-sponsor detected → send to your sales / recruiter pipeline
Power BI / Tableau / Looker / Mode
Pull Apify's run results into your BI tool of choice via the Apify REST API and build:
- Top-100 H1B sponsors by year league tables
- Median LCA wage by SOC + metro heat maps
- Year-over-year petition growth for any company
- Wage-band distribution by employer + role
Postgres / Snowflake / BigQuery / Databricks
POST run results to your warehouse via Apify's webhook integration. Suggested schema:
CREATE TABLE h1b_petitions (id BIGSERIAL PRIMARY KEY,employer TEXT NOT NULL,job_title TEXT,base_salary INTEGER,city TEXT,state CHAR(2),submit_date DATE,start_date DATE,fiscal_year SMALLINT,case_status TEXT DEFAULT 'CERTIFIED',source_url TEXT,scraped_at TIMESTAMPTZ,UNIQUE (employer, job_title, base_salary, submit_date, city));CREATE INDEX idx_h1b_employer ON h1b_petitions (employer);CREATE INDEX idx_h1b_city_state ON h1b_petitions (city, state);CREATE INDEX idx_h1b_year_title ON h1b_petitions (fiscal_year, job_title);
Salesforce / HubSpot CRM Enrichment
For staffing, recruiting, and corporate immigration firms: nightly-run the actor against your tracked-employer list, then upsert against Account records — H1B_Petitions_Last_12mo__c becomes a high-signal lead-scoring field.
Webhook → Slack / Discord / Email
Trigger a Make/Zapier webhook on Apify's ACTOR.RUN.SUCCEEDED event, parse the dataset, and post highlights:
"Stripe filed 47 new H1B petitions this quarter — 31 are SF, 12 are NYC, median base $182k. Top role: Software Engineer."
Major US Metros for H1B Activity
| Metro | State | Why it matters for H1B data |
|---|---|---|
| San Francisco / Bay Area | CA | Highest median LCA wage in the country; FAANG, AI labs, fintech |
| New York / NYC | NY | Wall Street, consulting (McKinsey/BCG/Bain), big-law support roles |
| Seattle / Bellevue | WA | Amazon, Microsoft — two largest H1B sponsors by volume historically |
| Austin | TX | Fastest-growing tech metro; Apple, Oracle, Tesla expansions |
| Boston / Cambridge | MA | Pharma + biotech (Moderna, Vertex, Genentech), Big Tech Cambridge campuses |
| Chicago | IL | Trading firms (Citadel, Jump, IMC, DRW), consulting back offices |
| Atlanta | GA | Coca-Cola, Delta, Truist, fintech (NCR, Equifax) |
| Dallas / Plano | TX | JPMorgan, AT&T, Toyota, healthcare IT |
| Houston | TX | Energy majors (ExxonMobil, Chevron, Shell), healthcare |
| Washington DC / NoVa | DC / VA | Federal contractors, AWS GovCloud HQ, defense primes |
| Raleigh-Durham | NC | RTP corridor — IBM, Cisco, Apple expansion |
| Phoenix / Tempe | AZ | TSMC fab, semiconductor expansion, financial services |
| Miami | FL | Crypto, hedge funds relocating from NY/SF |
| Mountain View / Sunnyvale / Menlo Park | CA | Google, Meta, LinkedIn HQ campuses |
| San Jose / Santa Clara | CA | NVIDIA, Cisco, Adobe, Intel |
Cost & Performance
| Metric | Value |
|---|---|
| Engine | HTTP-only (got-scraping + cheerio) — no browser |
| Runtime, single small query (10 rows) | 2 – 5 seconds |
| Runtime, single popular query (30,000 rows / 18MB) | 10 – 30 seconds |
| Runtime, 50 employer × 5 job × 1 year cross-product | 1 – 5 minutes (with default 1s polite delay) |
| Cost per typical run | a few cents (pay-per-event) |
| Pricing model | Pay-per-event — actor start + per dataset item |
| Data freshness | Live at run time — h1bdata.info refreshes with each DOL disclosure release |
| Auth required | None |
| Proxy required | No (optional, disabled by default) |
| Concurrency | Safe to run many parallel filtered configurations |
| Memory footprint | 256 MB sufficient for most runs; 1024 MB for huge multi-thousand-task jobs |
| Retry | 3 attempts per request, exponential backoff (2s × attempt + jitter) |
| Timeout | 60-second per-request HTTP timeout |
| Failure mode | Actor.fail() if zero records scraped (alerts your monitoring) |
Compliance, Privacy & Legal Notes
- Public-record data only. Every field this actor returns is published by the US Department of Labor under the public-disclosure requirements of Title 20 CFR §655.760. The DOL publishes the data; h1bdata.info re-publishes it; this actor structures it. Nothing in the output is private, leaked, or non-public.
- No PII beyond what the DOL already published. Employer and job title are corporate identifiers, not personal. The disclosure file does not include the foreign worker's name, passport number, or contact info.
- No PHI. Pharma / biotech petitions are listed by employer + job title only; there is no patient data anywhere in the dataset.
- No SSNs, passport numbers, or visa-petition USCIS receipt numbers.
- Source attribution is preserved in every row (
sourceUrl) — useful for journalists and academics who need to cite primary sources. - Respect h1bdata.info's terms of service and load profile. The default 1-second polite delay between requests exists for that reason — do not lower it unnecessarily.
- GDPR / CCPA are not implicated for the petitioning employer (corporate entity); the foreign worker is not personally identified in the public file.
- Permissible uses include: research, journalism, recruiting / sourcing intelligence, immigration-law case work, comp benchmarking, policy analysis, and competitive intelligence.
- Do not use this data for: harassment, doxxing, discriminatory employment decisions targeting visa status (which would violate 8 USC §1324b), or any deceptive marketing claim that misrepresents data freshness or origin.
Important: Disclosure data shows the wage offered on the LCA — not necessarily the wage actually paid, not signing bonuses, not RSUs, not deferred comp. Treat it as a floor / benchmark, not as a complete-compensation figure.
Frequently Asked Questions
How fresh is the data?
The actor scrapes h1bdata.info live at run time. h1bdata.info ingests new petitions every time the DOL publishes a new quarterly disclosure file (typically 8–12 weeks after the quarter closes). So FY 2024 Q4 petitions become visible roughly mid-2025, FY 2025 Q1 petitions in mid-to-late 2025, and so on. There is no faster public source for this data.
How many records exist in total?
The DOL has certified 8 million+ H1B / H1B1 / E3 petitions since fiscal year 2014. The exact number visible on any given day depends on which quarters h1bdata.info has ingested. A single popular query like Google + Software Engineer (all years) can return 30,000+ rows on its own.
Why is caseStatus always CERTIFIED?
The US Department of Labor only publishes approved petitions in the public disclosure file. Denied, withdrawn, returned-for-correction, and under-review cases are not disclosed and therefore cannot appear in this dataset.
Does this scraper require login, API key, or CAPTCHA solving?
No. h1bdata.info is fully public, has no login, no CAPTCHA, no anti-bot system. You only need an Apify account to run the actor.
Do I need to use a proxy?
No. Proxy is disabled by default. h1bdata.info has no rate-limit or IP-based throttle. The proxy option exists only for very large parallel runs where IP diversity is desirable for politeness.
Why does my run sometimes take 20–30 seconds for a single query?
Popular queries (a famous employer + common title across all years) return massive HTML payloads — sometimes 18MB+ with 30,000+ rows. The 60-second per-request timeout is calibrated for exactly this case. Smaller queries return in 2–5 seconds.
What happens if I supply an empty query?
h1bdata.info rejects empty queries (returns no results). The actor builds its task list from the cross-product of employers × jobTitles × cities and skips combinations where all three are empty. If zero tasks survive, the run fails fast with a clear message; if every task returns zero rows, Actor.fail() is called so your monitoring/scheduling integration can alert.
Does the actor return denied / withdrawn / pending petitions?
No — see above. Only DOL-certified petitions are in the public file.
Does the dataset include the foreign worker's name?
No. The DOL public-disclosure file deliberately omits the foreign worker's personal identity. The published fields are: petitioning employer, job title, base wage offered, work location, submit date, and start date.
Does it include SOC codes, prevailing-wage level, or worksite ZIP?
Not in the h1bdata.info interface. h1bdata.info publishes the user-friendly subset of the DOL file. For the full raw file (with SOC code, prevailing-wage level, full address, agent attorney, etc.) download the quarterly DOL disclosure files directly from dol.gov/agencies/eta/foreign-labor/performance and join on employer + submit date.
Can I get green-card / PERM disclosure data?
PERM (permanent labor certification) is a separate DOL disclosure feed, not on h1bdata.info. It is on the roadmap as a separate Apify actor — open a feature request if you need it sooner.
Can I filter to only H1B (excluding H1B1 / E3)?
Not directly — h1bdata.info does not expose visa subtype on the result row. The vast majority of petitions in the file are H1B; H1B1 (Singapore/Chile) and E3 (Australia) together are <2% of volume.
How do I get every petition from a specific employer across all years?
{ "employers": ["openai"], "jobTitles": [""], "year": "All Years", "maxRecords": 0 }
A blank job title combined with an employer will return every role that employer has ever sponsored. Set maxRecords: 0 for unlimited.
What about employer-name variations (e.g. "Google" vs "Google LLC" vs "Alphabet Inc")?
h1bdata.info does case-insensitive partial matching on the employer string. "google" will match GOOGLE LLC, GOOGLE INC., GOOGLE PAYMENT CORP, etc. For maximum recall, also try the parent name (alphabet) and any DBA / acquired-subsidiary names you know.
Why are some salary cells null?
Very rarely a row on h1bdata.info has a missing or malformed salary cell (an edge case in older 2014–2015 data). The actor parses what it can and emits baseSalary: null rather than dropping the row, so you can decide downstream how to handle them.
Pivot history — what was this actor before?
This actor was previously published as salary-com-scraper. After a usefulness audit it became clear that the underlying Salary.com data largely duplicated freely-available BLS Occupational Employment Statistics and Glassdoor content — not a niche worth maintaining. The actor was repivoted to the H1B disclosure niche in May 2026 because the source data is genuinely unique, primary-source, federally-mandated, and very high-value for immigration law, recruiting, journalism, and policy research. The actor ID is unchanged; only the source target, schema, and engine differ.
Does this work on the Apify Free Plan?
Yes — full functionality on the free tier. A typical filtered run costs a few cents in compute units.
Can I schedule this to run daily / weekly / monthly?
Yes — Apify's built-in Scheduler lets you trigger this actor on any cron expression. Combine with webhook outputs for fully automated H1B-intel pipelines.
What formats can I export the data in?
JSON, JSONL (streaming), CSV, Excel (XLSX), HTML, XML, RSS — directly from the Apify dataset view, or via the Apify REST API for programmatic consumers.
Are there competing data sources?
The two main competing surfaces are MyVisaJobs and H1BGrader. Both ultimately source from the same DOL disclosure file; h1bdata.info is the longest-running, fastest-querying, and least-monetized of the three, which is why it was chosen as the scrape target.
How do I report a bug or request a feature?
Open an issue on the Apify Store page or contact the developer directly through the Apify Console profile.
Related Apify Actors by Haketa
If you're building a US labor-market, jobs, or federal-disclosure intelligence stack, these companion actors pair well with the H1B Visa Database Scraper:
- SEEK Scraper (Australia / NZ) — live job listings from APAC's largest job board
- Levels.fyi Scraper — self-reported tech compensation (base + equity + bonus) — the perfect complement to LCA wage data
- ProductHunt Launches & Makers Scraper — daily startup launches, makers, votes & reviews — VC/founder/recruiter intel
- TTB Alcohol Permittee Scraper — another federal public-disclosure dataset (Treasury / TTB) in the same legal family
- SAM.gov Federal Contractor Entity Scraper — every entity registered to do business with the US federal government
- Texas Pharmacy License Scraper — TSBP — state-licensed pharmacist / pharmacy directory
- Ohio eLicense Scraper — Ohio professional licenses
- Illinois IDFPR License Scraper — Illinois licensed professionals
- California DCA Professional License Scraper — California consumer-affairs licensees
- Colorado Professional License Scraper — Colorado DORA licenses
- BBB Business Scraper — Better Business Bureau company profiles
Comparison vs. Alternatives
| Approach | Setup time | Data freshness | Cost (10k rows) | Schema normalization | Filtering | Provenance |
|---|---|---|---|---|---|---|
| This actor | < 1 minute | Live at run | a few cents | Yes — built-in | Employer × Job × City × Year + salary band | Per-row source URL |
| Manual h1bdata.info browsing | Hours / days | Live | Free | None | UI only | None |
| DOL raw quarterly CSV download | 4–8 hours dev | 8–12 weeks lagged | Free + infra | DIY | DIY | Manual |
| MyVisaJobs paid subscription | Minutes | Live | $50–500+/mo | Yes | Limited UI | None |
| Custom Python + requests + BeautifulSoup | 1–2 days dev | Live | Free + infra | DIY | DIY | DIY |
| Hand-built per-employer cron + S3 + Athena | 1–2 weeks dev | Quarterly | $$$ | DIY | SQL | Manual |
Why Pay-Per-Event Pricing?
Most data products either lock you into a monthly seat license (you pay even when idle) or charge per Compute Unit (unpredictable bills). This actor uses Apify's pay-per-event model:
- You only pay when the actor actually runs
- Charges scale linearly with how many rows you actually consume
- Transparent line-item billing in the Apify console
- No monthly minimums, no annual contracts
- Free to evaluate — set
maxRecords: 10and validate the schema before scaling up - Perfect for both one-off research projects and high-frequency production scrapers
Changelog
| Version | Date | Notes |
|---|---|---|
| 1.0.0 | 2026-05-18 | Initial public release of the H1B Visa Database Scraper — HTTP-only via got-scraping + cheerio, full h1bdata.info filter parity, salary-band post-filter, deduplication, Actor.fail() on empty results |
| (pre-1.0) | 2024–2026 | Same actor ID was previously published as salary-com-scraper; repivoted to H1B disclosure data because Salary.com largely duplicated freely-available BLS/Glassdoor content while the H1B niche is genuinely unique and high-value |
Keywords
H1B visa scraper · H1B salary database · H1B sponsor lookup · h1bdata.info scraper · US DOL H1B disclosure · H1B salary by employer · H1B salary by job title · H1B prevailing wage · H1B sponsorship history · immigration salary data · H1B visa API · LCA database scraper · Labor Condition Application data · DOL Office of Foreign Labor Certification scraper · H1B sponsor search · H1B sponsor history lookup · H1B base wage scraper · H1B job title salary · H1B petition data · H1B disclosure data extraction · H1B FAANG salaries · Google H1B salary · Meta H1B salary · Amazon H1B salary · Microsoft H1B salary · Apple H1B salary · NVIDIA H1B salary · OpenAI H1B salary · Infosys H1B petitions · TCS H1B petitions · Goldman Sachs H1B salary · JPMorgan H1B salary · H1B Bay Area salary · H1B NYC salary · H1B Seattle salary · H1B Austin salary · H1B Boston salary · immigration attorney data scraping · recruiter intel scraper · compensation benchmarking API · H1B prevailing wage compliance · H1B journalism data · H1B policy research dataset · H1B M&A due diligence · Apify H1B actor · H1B1 visa data · E3 visa data · Title 20 CFR §655.760 disclosure
Support
- Bug reports: Use the Issues tab on the Apify Store page
- Feature requests: Same place — please describe your use case so we can prioritize realistically
- Direct contact: Through the Apify developer profile (haketa)
- Roadmap requests welcome: PERM / green-card disclosure scraper, H2B seasonal-worker scraper, USCIS receipt-number enrichment, prevailing-wage Level 1–4 inference
If this actor saves you time on immigration research, recruiter intel, comp benchmarking, or policy reporting, a 5-star rating on the Apify Store helps other professionals discover it. Thank you.