Website Job Extractor (HTTP)
Pricing
from $4.00 / 1,000 job extracteds
Website Job Extractor (HTTP)
Scrape job listings directly from company websites / career pages / ATS systems. Unlike job portals, letting you identify hiring intent the moment it happens. The freshest signal for B2B targeting. AI-extracted with automatic ATS detection and anti-hallucination validation.
Pricing
from $4.00 / 1,000 job extracteds
Rating
0.0
(0)
Developer

Alessandro Santamaria
Actor stats
0
Bookmarked
1
Total users
1
Monthly active users
3 hours ago
Last modified
Categories
Share
Website Job Extractor
Extract structured job listings from any company career page using AI. Give it a website URL -- it finds career pages, detects ATS systems, handles pagination, and returns clean, structured job data.
Why this actor?
- Just provide a website URL -- the actor auto-discovers career pages from homepage navigation, subdomains (jobs.domain.ch, karriere.domain.ch), and embedded ATS portals
- 19 ATS systems detected -- automatically follows links to Personio, Greenhouse, Softgarden, Lever, BambooHR, and 14 more
- Anti-hallucination pipeline -- detects "no open positions" signals before calling the LLM, validates every extracted job, and assigns confidence scores
- Works worldwide -- multilingual career page discovery and extraction in 7 languages (DE, EN, FR, ES, IT, PT, NL)
How it works
- Discover -- given a website URL, scans homepage navigation and probes common career subdomains to find career pages
- Detect -- identifies 19 ATS systems via URL patterns, iframes, links, and HTML signatures, then follows external portal links
- Gate -- checks for job-related keywords (m/w/d, Full-time, apply now) and screens for "no open positions" signals before invoking the LLM
- Extract -- sends cleaned HTML to an LLM (Gemini Flash, Groq, or OpenRouter) for structured extraction with anti-hallucination rules
- Validate -- post-extraction validation with confidence scoring, deduplication across pages, and optional keyword-based relevance filtering
- Paginate -- detects and follows next-page links up to 5 levels deep, staying within the per-company page budget
The actor runs HTTP-only (CheerioCrawler) with no browser overhead. Typical memory usage is around 128 MB, making runs fast and cost-efficient.
Input configuration
Minimal example
{"companies": [{"company_id": "example-ag","company_name": "Example AG","website_url": "https://example.ch"}],"llmProvider": "gemini","geminiApiKey": "YOUR_GEMINI_API_KEY"}
With keyword filtering
Only extract jobs matching specific keywords. Each job receives a relevance score of high, medium, or low.
{"companies": [{"company_id": "example-ag","company_name": "Example AG","website_url": "https://example.ch"}],"jobKeywords": ["Software Engineer", "DevOps", "Backend"],"llmProvider": "gemini","geminiApiKey": "YOUR_GEMINI_API_KEY"}
With known career URLs
If you already know the career page URLs, provide them directly to skip discovery.
{"companies": [{"company_id": "example-ag","company_name": "Example AG","career_urls": ["https://example.ch/karriere","https://example.personio.de/"]}],"llmProvider": "gemini","geminiApiKey": "YOUR_GEMINI_API_KEY"}
Full parameter reference
| Parameter | Type | Default | Description |
|---|---|---|---|
companies | array | required | Companies to extract jobs from. Each item needs company_id and company_name, plus at least one of website_url, career_urls, or career_page_url. |
llmProvider | string | "gemini" | Primary AI provider: gemini, groq, or openrouter |
fallbackProvider | string | -- | Second-level fallback provider (used if primary fails) |
fallback2Provider | string | -- | Third-level fallback provider |
geminiApiKey | string | -- | API key for Google Gemini. Get one free at aistudio.google.com/apikey |
llmApiKey | string | -- | API key for Groq or OpenRouter |
openrouterApiKey | string | -- | API key for OpenRouter |
outputLanguage | string | "auto" | Output language: en, de, fr, it, es, pt, nl, or auto (matches website language) |
maxPagesPerCompany | integer | 5 | Max pages to process per company, including pagination (range: 1-20) |
maxConcurrency | integer | 3 | Parallel HTTP requests (range: 1-10) |
jobKeywords | string[] | -- | Filter jobs by keywords. When set, each job gets a relevance score. |
webhookUrl | string | -- | URL to POST results to when extraction completes |
proxyConfiguration | object | -- | Apify proxy settings. Datacenter proxies work for 95%+ of career pages. |
Company input fields
Each company object supports these fields:
| Field | Required | Description |
|---|---|---|
company_id | Yes | Your internal identifier (UUID or any string) |
company_name | Yes | Company name (used as context for AI extraction) |
website_url | No | Company homepage -- the actor discovers career pages from here |
career_urls | No | Array of known career page URLs |
career_page_url | No | Single career page URL (treated as highest priority) |
website_domain | No | Domain for subdomain probing (e.g., example.ch triggers checks on jobs.example.ch, karriere.example.ch) |
At least one of website_url, career_urls, or career_page_url should be provided.
Output format
Each extracted job is pushed as a separate item to the dataset.
{"company_id": "example-ag","company_name": "Example AG","title": "Senior Software Engineer (m/w/d)","description": "Entwicklung von Web-Applikationen mit React und Node.js...","location": "Zurich","employment_type": "Full-time","experience_level": "Senior","salary_range": "100'000-130'000 CHF","department": "Engineering","requirements": ["5+ years experience", "React", "Node.js", "TypeScript"],"benefits": ["Remote work", "Training budget", "Flexible hours"],"source_url": "https://example.ch/karriere","application_url": "https://example.ch/apply/senior-swe","posted_at": "2026-03-01","workplace_type": "hybrid","confidence": 0.95,"relevance": "high","ats_system": "personio","extracted_at": "2026-03-08T10:30:00Z"}
Output fields
| Field | Type | Description |
|---|---|---|
company_id | string | The company ID from your input |
company_name | string | The company name from your input |
title | string | Job title |
description | string | Job description (may be summarized for long listings) |
location | string | Work location |
employment_type | string | Full-time, Part-time, Contract, Internship, etc. |
experience_level | string | Junior, Mid, Senior, Lead, etc. |
salary_range | string | Salary range if listed on the page |
department | string | Department or team |
requirements | string[] | Required skills and qualifications |
benefits | string[] | Listed benefits and perks |
source_url | string | URL of the career page where the job was found |
application_url | string | Direct application link (if available) |
posted_at | string | Posting date (if listed) |
workplace_type | string | remote, hybrid, on-site |
confidence | number | Extraction confidence score (0.0-1.0) |
relevance | string | Keyword relevance: high, medium, or low (only when jobKeywords is set) |
ats_system | string | Detected ATS system (e.g., personio, greenhouse) |
extracted_at | string | ISO timestamp of extraction |
Supported ATS systems
The actor detects and follows links to these Applicant Tracking Systems:
| ATS System | Domain |
|---|---|
| Personio | personio.de, personio.com |
| Softgarden | softgarden.io |
| Greenhouse | greenhouse.io |
| Lever | lever.co |
| BambooHR | bamboohr.com |
| JOIN | join.com |
| Recruitee | recruitee.com |
| d.vinci | dvinci-hr.com |
| Workwise | workwise.io |
| Factorial | factorial.ch, factorial.co |
| CVManager | cvmanager.ch |
| Onlyfy (by XING) | onlyfy.jobs |
| Jobbase | jobbase.io |
| Teamtailor | teamtailor.com |
| SmartRecruiters | smartrecruiters.com |
| Talentsoft | talent-soft.com |
| rexx systems | rexx-systems.com |
| Concludis | concludis.de |
| Coveto | coveto.de |
When an ATS portal is detected via iframe, link, or HTML pattern, the actor automatically navigates to the portal page and extracts jobs from there.
LLM providers
The actor uses an LLM to parse HTML into structured job data. You can choose from three providers, and configure a fallback chain for resilience.
| Provider | Free Tier | Speed | Model | Best For |
|---|---|---|---|---|
| Gemini (recommended) | 1M tokens/min | Fast | gemini-2.0-flash | Most users -- generous free tier |
| Groq | 30 requests/min | Very fast | llama-3.1-8b-instant | High-speed extraction |
| OpenRouter | Some free models | Varies | mistral-small-3.1-24b | Fallback / alternative models |
Recommended setup: Use Gemini as primary with Groq as fallback.
{"llmProvider": "gemini","geminiApiKey": "YOUR_GEMINI_KEY","fallbackProvider": "groq","llmApiKey": "YOUR_GROQ_KEY"}
Get a free Gemini API key at aistudio.google.com/apikey.
Pricing
This actor uses pay-per-result pricing on the Apify platform:
| Event | Price |
|---|---|
| Company processed | $0.01 per company |
| Job extracted | $0.004 per job |
Standard Apify platform costs (compute and proxy) apply on top.
Example: Extracting jobs from 100 companies that have a total of 500 open positions would cost $1.00 (companies) + $2.00 (jobs) = $3.00 plus platform fees.
Use cases
- Recruitment analytics -- monitor hiring activity across companies in your market
- Competitive intelligence -- track what roles competitors are hiring for
- Lead generation -- identify companies that are actively hiring (buying signal)
- Job aggregation -- build a niche job board by extracting listings from company career pages
- Market research -- analyze hiring trends by region, industry, or role type
Combine with Google Maps Scraper
Chain this actor with the Google Maps Scraper for a complete pipeline:
Google Maps Scraper --> find companies with websites --> Website Job Extractor --> structured job listings
The Google Maps Scraper has a built-in enableJobExtraction option that automatically chains results to this actor.
JavaScript-rendered pages
This actor is HTTP-only and does not run a browser. Career pages built with JavaScript frameworks (React, Vue, Angular) may render no content in their initial HTML. When this happens, the actor automatically detects JS-rendering signals and flags affected companies in the output.
How detection works
After fetching each page, the actor inspects the raw HTML for framework markers, empty root elements, and noscript warnings. If JS-rendering indicators are found and 0 jobs are extracted for that company, a sentinel item is pushed to the dataset.
Detection signals
| Signal | Meaning |
|---|---|
react_root_empty | <div id="root"> or <div id="__next"> with near-empty content |
react_markers | data-reactroot, __NEXT_DATA__, or _next/static scripts |
vue_markers | data-v- attributes or Vue.js script references |
angular_markers | ng-version or ng-app attributes |
low_text_ratio | Less than 200 chars of text in 5000+ chars of HTML |
noscript_warning | <noscript> block asking user to enable JavaScript |
Sentinel output example
When a company's career page requires JS rendering, the dataset will contain:
{"company_id": "example-ag","company_name": "Example AG","title": null,"source_url": "https://example.com/karriere","js_rendering_suspected": true,"js_indicators": ["react_root_empty", "react_markers", "noscript_warning"],"extracted_at": "2026-03-09T10:00:00Z"}
Recommended pipeline
Use the sentinel to build a two-stage pipeline:
Website Job Extractor (HTTP, fast, 128MB)├─ Normal jobs → use directly└─ Sentinel (js_rendering_suspected) → re-process with Playwright actor (coming soon)
Why separate actors? The HTTP actor uses a minimal Docker image (~128MB) with no browser. A Playwright actor needs Chrome (~1024MB) and costs ~10x more per compute unit. Keeping them separate lets you run the cheap HTTP pass on all companies and only pay for browser rendering on the ~28% that need it.
Filter sentinels in code
const dataset = await Actor.openDataset();const { items } = await dataset.getData();const jobs = items.filter(item => item.title !== null);const sentinels = items.filter(item => item.js_rendering_suspected === true);// Re-run sentinels with a Playwright-based actorfor (const sentinel of sentinels) {console.log(`Needs browser: ${sentinel.company_name} (${sentinel.js_indicators.join(', ')})`);}
Limitations
- HTTP-only: This actor does not run a browser. JS-rendered career pages are automatically detected and flagged (see JavaScript-rendered pages above). ATS portals that serve server-rendered HTML (Personio, Greenhouse, Softgarden, BambooHR, Factorial, and most others) work reliably.
- DACH coverage is highest: Approximately 72% of DACH (Germany, Austria, Switzerland) company career pages work with HTTP-only extraction. International coverage is around 40-50%, depending on how many companies use JS-rendered career pages.
- LLM API key required: You need to provide at least one LLM API key. Gemini offers a generous free tier.
- Page budget: The
maxPagesPerCompanysetting limits how many pages are processed per company (default 5). Companies with hundreds of job listings spread across many pagination pages may need a higher budget. - No login-protected pages: The actor cannot access career pages behind authentication.