Website Job Extractor (HTTP) avatar

Website Job Extractor (HTTP)

Pricing

from $4.00 / 1,000 job extracteds

Go to Apify Store
Website Job Extractor (HTTP)

Website Job Extractor (HTTP)

Scrape job listings directly from company websites / career pages / ATS systems. Unlike job portals, letting you identify hiring intent the moment it happens. The freshest signal for B2B targeting. AI-extracted with automatic ATS detection and anti-hallucination validation.

Pricing

from $4.00 / 1,000 job extracteds

Rating

0.0

(0)

Developer

Alessandro Santamaria

Alessandro Santamaria

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

3 hours ago

Last modified

Share

Website Job Extractor

Extract structured job listings from any company career page using AI. Give it a website URL -- it finds career pages, detects ATS systems, handles pagination, and returns clean, structured job data.

Why this actor?

  • Just provide a website URL -- the actor auto-discovers career pages from homepage navigation, subdomains (jobs.domain.ch, karriere.domain.ch), and embedded ATS portals
  • 19 ATS systems detected -- automatically follows links to Personio, Greenhouse, Softgarden, Lever, BambooHR, and 14 more
  • Anti-hallucination pipeline -- detects "no open positions" signals before calling the LLM, validates every extracted job, and assigns confidence scores
  • Works worldwide -- multilingual career page discovery and extraction in 7 languages (DE, EN, FR, ES, IT, PT, NL)

How it works

  1. Discover -- given a website URL, scans homepage navigation and probes common career subdomains to find career pages
  2. Detect -- identifies 19 ATS systems via URL patterns, iframes, links, and HTML signatures, then follows external portal links
  3. Gate -- checks for job-related keywords (m/w/d, Full-time, apply now) and screens for "no open positions" signals before invoking the LLM
  4. Extract -- sends cleaned HTML to an LLM (Gemini Flash, Groq, or OpenRouter) for structured extraction with anti-hallucination rules
  5. Validate -- post-extraction validation with confidence scoring, deduplication across pages, and optional keyword-based relevance filtering
  6. Paginate -- detects and follows next-page links up to 5 levels deep, staying within the per-company page budget

The actor runs HTTP-only (CheerioCrawler) with no browser overhead. Typical memory usage is around 128 MB, making runs fast and cost-efficient.

Input configuration

Minimal example

{
"companies": [
{
"company_id": "example-ag",
"company_name": "Example AG",
"website_url": "https://example.ch"
}
],
"llmProvider": "gemini",
"geminiApiKey": "YOUR_GEMINI_API_KEY"
}

With keyword filtering

Only extract jobs matching specific keywords. Each job receives a relevance score of high, medium, or low.

{
"companies": [
{
"company_id": "example-ag",
"company_name": "Example AG",
"website_url": "https://example.ch"
}
],
"jobKeywords": ["Software Engineer", "DevOps", "Backend"],
"llmProvider": "gemini",
"geminiApiKey": "YOUR_GEMINI_API_KEY"
}

With known career URLs

If you already know the career page URLs, provide them directly to skip discovery.

{
"companies": [
{
"company_id": "example-ag",
"company_name": "Example AG",
"career_urls": [
"https://example.ch/karriere",
"https://example.personio.de/"
]
}
],
"llmProvider": "gemini",
"geminiApiKey": "YOUR_GEMINI_API_KEY"
}

Full parameter reference

ParameterTypeDefaultDescription
companiesarrayrequiredCompanies to extract jobs from. Each item needs company_id and company_name, plus at least one of website_url, career_urls, or career_page_url.
llmProviderstring"gemini"Primary AI provider: gemini, groq, or openrouter
fallbackProviderstring--Second-level fallback provider (used if primary fails)
fallback2Providerstring--Third-level fallback provider
geminiApiKeystring--API key for Google Gemini. Get one free at aistudio.google.com/apikey
llmApiKeystring--API key for Groq or OpenRouter
openrouterApiKeystring--API key for OpenRouter
outputLanguagestring"auto"Output language: en, de, fr, it, es, pt, nl, or auto (matches website language)
maxPagesPerCompanyinteger5Max pages to process per company, including pagination (range: 1-20)
maxConcurrencyinteger3Parallel HTTP requests (range: 1-10)
jobKeywordsstring[]--Filter jobs by keywords. When set, each job gets a relevance score.
webhookUrlstring--URL to POST results to when extraction completes
proxyConfigurationobject--Apify proxy settings. Datacenter proxies work for 95%+ of career pages.

Company input fields

Each company object supports these fields:

FieldRequiredDescription
company_idYesYour internal identifier (UUID or any string)
company_nameYesCompany name (used as context for AI extraction)
website_urlNoCompany homepage -- the actor discovers career pages from here
career_urlsNoArray of known career page URLs
career_page_urlNoSingle career page URL (treated as highest priority)
website_domainNoDomain for subdomain probing (e.g., example.ch triggers checks on jobs.example.ch, karriere.example.ch)

At least one of website_url, career_urls, or career_page_url should be provided.

Output format

Each extracted job is pushed as a separate item to the dataset.

{
"company_id": "example-ag",
"company_name": "Example AG",
"title": "Senior Software Engineer (m/w/d)",
"description": "Entwicklung von Web-Applikationen mit React und Node.js...",
"location": "Zurich",
"employment_type": "Full-time",
"experience_level": "Senior",
"salary_range": "100'000-130'000 CHF",
"department": "Engineering",
"requirements": ["5+ years experience", "React", "Node.js", "TypeScript"],
"benefits": ["Remote work", "Training budget", "Flexible hours"],
"source_url": "https://example.ch/karriere",
"application_url": "https://example.ch/apply/senior-swe",
"posted_at": "2026-03-01",
"workplace_type": "hybrid",
"confidence": 0.95,
"relevance": "high",
"ats_system": "personio",
"extracted_at": "2026-03-08T10:30:00Z"
}

Output fields

FieldTypeDescription
company_idstringThe company ID from your input
company_namestringThe company name from your input
titlestringJob title
descriptionstringJob description (may be summarized for long listings)
locationstringWork location
employment_typestringFull-time, Part-time, Contract, Internship, etc.
experience_levelstringJunior, Mid, Senior, Lead, etc.
salary_rangestringSalary range if listed on the page
departmentstringDepartment or team
requirementsstring[]Required skills and qualifications
benefitsstring[]Listed benefits and perks
source_urlstringURL of the career page where the job was found
application_urlstringDirect application link (if available)
posted_atstringPosting date (if listed)
workplace_typestringremote, hybrid, on-site
confidencenumberExtraction confidence score (0.0-1.0)
relevancestringKeyword relevance: high, medium, or low (only when jobKeywords is set)
ats_systemstringDetected ATS system (e.g., personio, greenhouse)
extracted_atstringISO timestamp of extraction

Supported ATS systems

The actor detects and follows links to these Applicant Tracking Systems:

ATS SystemDomain
Personiopersonio.de, personio.com
Softgardensoftgarden.io
Greenhousegreenhouse.io
Leverlever.co
BambooHRbamboohr.com
JOINjoin.com
Recruiteerecruitee.com
d.vincidvinci-hr.com
Workwiseworkwise.io
Factorialfactorial.ch, factorial.co
CVManagercvmanager.ch
Onlyfy (by XING)onlyfy.jobs
Jobbasejobbase.io
Teamtailorteamtailor.com
SmartRecruiterssmartrecruiters.com
Talentsofttalent-soft.com
rexx systemsrexx-systems.com
Concludisconcludis.de
Covetocoveto.de

When an ATS portal is detected via iframe, link, or HTML pattern, the actor automatically navigates to the portal page and extracts jobs from there.

LLM providers

The actor uses an LLM to parse HTML into structured job data. You can choose from three providers, and configure a fallback chain for resilience.

ProviderFree TierSpeedModelBest For
Gemini (recommended)1M tokens/minFastgemini-2.0-flashMost users -- generous free tier
Groq30 requests/minVery fastllama-3.1-8b-instantHigh-speed extraction
OpenRouterSome free modelsVariesmistral-small-3.1-24bFallback / alternative models

Recommended setup: Use Gemini as primary with Groq as fallback.

{
"llmProvider": "gemini",
"geminiApiKey": "YOUR_GEMINI_KEY",
"fallbackProvider": "groq",
"llmApiKey": "YOUR_GROQ_KEY"
}

Get a free Gemini API key at aistudio.google.com/apikey.

Pricing

This actor uses pay-per-result pricing on the Apify platform:

EventPrice
Company processed$0.01 per company
Job extracted$0.004 per job

Standard Apify platform costs (compute and proxy) apply on top.

Example: Extracting jobs from 100 companies that have a total of 500 open positions would cost $1.00 (companies) + $2.00 (jobs) = $3.00 plus platform fees.

Use cases

  • Recruitment analytics -- monitor hiring activity across companies in your market
  • Competitive intelligence -- track what roles competitors are hiring for
  • Lead generation -- identify companies that are actively hiring (buying signal)
  • Job aggregation -- build a niche job board by extracting listings from company career pages
  • Market research -- analyze hiring trends by region, industry, or role type

Combine with Google Maps Scraper

Chain this actor with the Google Maps Scraper for a complete pipeline:

Google Maps Scraper --> find companies with websites --> Website Job Extractor --> structured job listings

The Google Maps Scraper has a built-in enableJobExtraction option that automatically chains results to this actor.

JavaScript-rendered pages

This actor is HTTP-only and does not run a browser. Career pages built with JavaScript frameworks (React, Vue, Angular) may render no content in their initial HTML. When this happens, the actor automatically detects JS-rendering signals and flags affected companies in the output.

How detection works

After fetching each page, the actor inspects the raw HTML for framework markers, empty root elements, and noscript warnings. If JS-rendering indicators are found and 0 jobs are extracted for that company, a sentinel item is pushed to the dataset.

Detection signals

SignalMeaning
react_root_empty<div id="root"> or <div id="__next"> with near-empty content
react_markersdata-reactroot, __NEXT_DATA__, or _next/static scripts
vue_markersdata-v- attributes or Vue.js script references
angular_markersng-version or ng-app attributes
low_text_ratioLess than 200 chars of text in 5000+ chars of HTML
noscript_warning<noscript> block asking user to enable JavaScript

Sentinel output example

When a company's career page requires JS rendering, the dataset will contain:

{
"company_id": "example-ag",
"company_name": "Example AG",
"title": null,
"source_url": "https://example.com/karriere",
"js_rendering_suspected": true,
"js_indicators": ["react_root_empty", "react_markers", "noscript_warning"],
"extracted_at": "2026-03-09T10:00:00Z"
}

Use the sentinel to build a two-stage pipeline:

Website Job Extractor (HTTP, fast, 128MB)
├─ Normal jobs → use directly
└─ Sentinel (js_rendering_suspected) → re-process with Playwright actor (coming soon)

Why separate actors? The HTTP actor uses a minimal Docker image (~128MB) with no browser. A Playwright actor needs Chrome (~1024MB) and costs ~10x more per compute unit. Keeping them separate lets you run the cheap HTTP pass on all companies and only pay for browser rendering on the ~28% that need it.

Filter sentinels in code

const dataset = await Actor.openDataset();
const { items } = await dataset.getData();
const jobs = items.filter(item => item.title !== null);
const sentinels = items.filter(item => item.js_rendering_suspected === true);
// Re-run sentinels with a Playwright-based actor
for (const sentinel of sentinels) {
console.log(`Needs browser: ${sentinel.company_name} (${sentinel.js_indicators.join(', ')})`);
}

Limitations

  • HTTP-only: This actor does not run a browser. JS-rendered career pages are automatically detected and flagged (see JavaScript-rendered pages above). ATS portals that serve server-rendered HTML (Personio, Greenhouse, Softgarden, BambooHR, Factorial, and most others) work reliably.
  • DACH coverage is highest: Approximately 72% of DACH (Germany, Austria, Switzerland) company career pages work with HTTP-only extraction. International coverage is around 40-50%, depending on how many companies use JS-rendered career pages.
  • LLM API key required: You need to provide at least one LLM API key. Gemini offers a generous free tier.
  • Page budget: The maxPagesPerCompany setting limits how many pages are processed per company (default 5). Companies with hundreds of job listings spread across many pagination pages may need a higher budget.
  • No login-protected pages: The actor cannot access career pages behind authentication.