Pricing

from $4.00 / 1,000 job extracteds

Website Job Extractor (HTTP)

Scrape job listings directly from company websites / career pages / ATS systems. Unlike job portals, letting you identify hiring intent the moment it happens. The freshest signal for B2B targeting. AI-extracted with automatic ATS detection and anti-hallucination validation.

Pricing

from $4.00 / 1,000 job extracteds

Rating

0.0

(0)

Developer

Ale

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Website Job Extractor

Extract structured job listings from any company career page using AI. Give it a website URL -- it finds career pages, detects ATS systems, handles pagination, and returns clean, structured job data.

Why this actor?

Just provide a website URL -- the actor auto-discovers career pages from homepage navigation, subdomains (jobs.domain.ch, karriere.domain.ch), and embedded ATS portals
19 ATS systems detected -- automatically follows links to Personio, Greenhouse, Softgarden, Lever, BambooHR, and 14 more
Anti-hallucination pipeline -- detects "no open positions" signals before calling the LLM, validates every extracted job, and assigns confidence scores
Works worldwide -- multilingual career page discovery and extraction in 7 languages (DE, EN, FR, ES, IT, PT, NL)

Use with AI Agents (MCP)

Connect this actor to any MCP-compatible AI client — Claude Desktop, Claude.ai, Cursor, VS Code, LangChain, LlamaIndex, or custom agents.

Apify MCP server URL:

https://mcp.apify.com?tools=santamaria-automations/website-job-extractor

Example prompt once connected:

"Use website-job-extractor to process data with website job extractor. Return results as a table."

Clients that support dynamic tool discovery (Claude.ai, VS Code) will receive the full input schema automatically via add-actor.

How it works

Discover -- given a website URL, scans homepage navigation and probes common career subdomains to find career pages
Detect -- identifies 19 ATS systems via URL patterns, iframes, links, and HTML signatures, then follows external portal links
Gate -- checks for job-related keywords (m/w/d, Full-time, apply now) and screens for "no open positions" signals before invoking the LLM
Extract -- sends cleaned HTML to an LLM (Gemini Flash, Groq, or OpenRouter) for structured extraction with anti-hallucination rules
Validate -- post-extraction validation with confidence scoring, deduplication across pages, and optional keyword-based relevance filtering
Paginate -- detects and follows next-page links up to 5 levels deep, staying within the per-company page budget

The actor runs HTTP-only (CheerioCrawler) with no browser overhead. Typical memory usage is around 128 MB, making runs fast and cost-efficient.

Input configuration

Minimal example

{
  "companies": [
    {
      "company_id": "example-ag",
      "company_name": "Example AG",
      "website_url": "https://example.ch"
    }
  ],
  "llmProvider": "gemini",
  "geminiApiKey": "YOUR_GEMINI_API_KEY"
}

With keyword filtering

Only extract jobs matching specific keywords. Each job receives a relevance score of high, medium, or low.

{
  "companies": [
    {
      "company_id": "example-ag",
      "company_name": "Example AG",
      "website_url": "https://example.ch"
    }
  ],
  "jobKeywords": ["Software Engineer", "DevOps", "Backend"],
  "llmProvider": "gemini",
  "geminiApiKey": "YOUR_GEMINI_API_KEY"
}

With known career URLs

If you already know the career page URLs, provide them directly to skip discovery.

{
  "companies": [
    {
      "company_id": "example-ag",
      "company_name": "Example AG",
      "career_urls": [
        "https://example.ch/karriere",
        "https://example.personio.de/"
      ]
    }
  ],
  "llmProvider": "gemini",
  "geminiApiKey": "YOUR_GEMINI_API_KEY"
}

Full parameter reference

Parameter	Type	Default	Description
`companies`	array	required	Companies to extract jobs from. Each item needs `company_id` and `company_name`, plus at least one of `website_url`, `career_urls`, or `career_page_url`.
`llmProvider`	string	`"gemini"`	Primary AI provider: `gemini`, `groq`, or `openrouter`
`fallbackProvider`	string	--	Second-level fallback provider (used if primary fails)
`fallback2Provider`	string	--	Third-level fallback provider
`geminiApiKey`	string	--	API key for Google Gemini. Get one free at aistudio.google.com/apikey
`llmApiKey`	string	--	API key for Groq or OpenRouter
`openrouterApiKey`	string	--	API key for OpenRouter
`outputLanguage`	string	`"auto"`	Output language: `en`, `de`, `fr`, `it`, `es`, `pt`, `nl`, or `auto` (matches website language)
`maxPagesPerCompany`	integer	`5`	Max pages to process per company, including pagination (range: 1-20)
`maxConcurrency`	integer	`3`	Parallel HTTP requests (range: 1-10)
`jobKeywords`	string[]	--	Filter jobs by keywords. When set, each job gets a `relevance` score.
`webhookUrl`	string	--	URL to POST results to when extraction completes
`proxyConfiguration`	object	--	Apify proxy settings. Datacenter proxies work for 95%+ of career pages.

Company input fields

Each company object supports these fields:

Field	Required	Description
`company_id`	Yes	Your internal identifier (UUID or any string)
`company_name`	Yes	Company name (used as context for AI extraction)
`website_url`	No	Company homepage -- the actor discovers career pages from here
`career_urls`	No	Array of known career page URLs
`career_page_url`	No	Single career page URL (treated as highest priority)
`website_domain`	No	Domain for subdomain probing (e.g., `example.ch` triggers checks on `jobs.example.ch`, `karriere.example.ch`)

At least one of website_url, career_urls, or career_page_url should be provided.

Output format

Each extracted job is pushed as a separate item to the dataset.

{
  "company_id": "example-ag",
  "company_name": "Example AG",
  "title": "Senior Software Engineer (m/w/d)",
  "description": "Entwicklung von Web-Applikationen mit React und Node.js...",
  "location": "Zurich",
  "employment_type": "Full-time",
  "experience_level": "Senior",
  "salary_range": "100'000-130'000 CHF",
  "department": "Engineering",
  "requirements": ["5+ years experience", "React", "Node.js", "TypeScript"],
  "benefits": ["Remote work", "Training budget", "Flexible hours"],
  "source_url": "https://example.ch/karriere",
  "application_url": "https://example.ch/apply/senior-swe",
  "posted_at": "2026-03-01",
  "workplace_type": "hybrid",
  "confidence": 0.95,
  "relevance": "high",
  "ats_system": "personio",
  "ats_url": "https://example-ag.jobs.personio.de/",
  "career_page_url": "https://example.ch/karriere",
  "extracted_at": "2026-03-08T10:30:00Z"
}

Output fields

Field	Type	Description
`company_id`	string	The company ID from your input
`company_name`	string	The company name from your input
`title`	string	Job title
`description`	string	Job description (may be summarized for long listings)
`location`	string	Work location
`employment_type`	string	Full-time, Part-time, Contract, Internship, etc.
`experience_level`	string	Junior, Mid, Senior, Lead, etc.
`salary_range`	string	Salary range if listed on the page
`department`	string	Department or team
`requirements`	string[]	Required skills and qualifications
`benefits`	string[]	Listed benefits and perks
`source_url`	string	URL of the career page where the job was found
`application_url`	string	Direct application link (if available)
`posted_at`	string	Posting date (if listed)
`workplace_type`	string	remote, hybrid, on-site
`confidence`	number	Extraction confidence score (0.0-1.0)
`relevance`	string	Keyword relevance: `high`, `medium`, or `low` (only when `jobKeywords` is set)
`ats_system`	string	Detected ATS system (e.g., `personio`, `greenhouse`)
`ats_url`	string	URL to the ATS portal page (e.g., Personio or Greenhouse listing page)
`career_page_url`	string	Career page URL where job discovery started
`extracted_at`	string	ISO timestamp of extraction

Find businesses to enrich

Real Estate

E-Commerce

Enrichment

Supported ATS systems

The actor detects and follows links to these Applicant Tracking Systems:

ATS System	Domain
Personio	personio.de, personio.com
Softgarden	softgarden.io
Greenhouse	greenhouse.io
Lever	lever.co
BambooHR	bamboohr.com
JOIN	join.com
Recruitee	recruitee.com
d.vinci	dvinci-hr.com
Workwise	workwise.io
Factorial	factorial.ch, factorial.co
CVManager	cvmanager.ch
Onlyfy (by XING)	onlyfy.jobs
Jobbase	jobbase.io
Teamtailor	teamtailor.com
SmartRecruiters	smartrecruiters.com
Talentsoft	talent-soft.com
rexx systems	rexx-systems.com
Concludis	concludis.de
Coveto	coveto.de

When an ATS portal is detected via iframe, link, or HTML pattern, the actor automatically navigates to the portal page and extracts jobs from there.

LLM providers

The actor uses an LLM to parse HTML into structured job data. It supports three providers with automatic fallback -- just provide your API keys and the system builds the longest possible fallback chain.

Provider	Free Tier	Speed	Model	Best For
Gemini (recommended)	1M tokens/min	Fast	gemini-2.0-flash	Most users -- generous free tier
Groq	30 requests/min	Very fast	llama-3.1-8b-instant	High-speed extraction
OpenRouter	Some free models	Varies	mistral-small-3.1-24b	Fallback / alternative models

Auto-discovery fallback chain

Simply provide API keys for the providers you want to use. The actor automatically detects which providers have valid keys and builds a fallback chain:

{
  "geminiApiKey": "YOUR_GEMINI_KEY",
  "llmApiKey": "YOUR_GROQ_KEY",
  "openrouterApiKey": "YOUR_OPENROUTER_KEY"
}

With this setup, the actor automatically uses: Gemini → Groq → OpenRouter. If Gemini's daily quota runs out, it instantly falls back to Groq. If Groq is rate-limited, it falls back to OpenRouter. No manual fallback configuration needed.

Error-aware fallback: The system classifies errors (quota exhaustion vs. transient rate limits vs. auth errors) and provides clear console messages explaining what happened and why it switched providers.

You can still explicitly set llmProvider, fallbackProvider, and fallback2Provider if you want to override the automatic chain order.

Get a free Gemini API key at aistudio.google.com/apikey.

Pricing

This actor uses pay-per-result pricing on the Apify platform:

Event	Price
Company processed	$0.01 per company
Job extracted	$0.004 per job

Standard Apify platform costs (compute and proxy) apply on top.

Example: Extracting jobs from 100 companies that have a total of 500 open positions would cost $1.00 (companies) + $2.00 (jobs) = $3.00 plus platform fees.

Use cases

Recruitment analytics -- monitor hiring activity across companies in your market
Competitive intelligence -- track what roles competitors are hiring for
Lead generation -- identify companies that are actively hiring (buying signal)
Job aggregation -- build a niche job board by extracting listings from company career pages
Market research -- analyze hiring trends by region, industry, or role type

Combine with Google Maps Scraper

Chain this actor with the Google Maps Scraper for a complete pipeline:

Google Maps Scraper --> find companies with websites --> Website Job Extractor --> structured job listings

The Google Maps Scraper has a built-in enableJobExtraction option that automatically chains results to this actor.

JavaScript-rendered pages

This actor is HTTP-only and does not run a browser. Career pages built with JavaScript frameworks (React, Vue, Angular) may render no content in their initial HTML. When this happens, the actor automatically detects JS-rendering signals and flags affected companies in the output.

How detection works

After fetching each page, the actor inspects the raw HTML for framework markers, empty root elements, and noscript warnings. If JS-rendering indicators are found and 0 jobs are extracted for that company, a sentinel item is pushed to the dataset.

Detection signals

Signal	Meaning
`react_root_empty`	`<div id="root">` or `<div id="__next">` with near-empty content
`react_markers`	`data-reactroot`, `__NEXT_DATA__`, or `_next/static` scripts
`vue_markers`	`data-v-` attributes or Vue.js script references
`angular_markers`	`ng-version` or `ng-app` attributes
`low_text_ratio`	Less than 200 chars of text in 5000+ chars of HTML
`noscript_warning`	`<noscript>` block asking user to enable JavaScript

Sentinel output example

When a company's career page requires JS rendering, the dataset will contain:

{
  "company_id": "example-ag",
  "company_name": "Example AG",
  "title": null,
  "source_url": "https://example.com/karriere",
  "js_rendering_suspected": true,
  "js_indicators": ["react_root_empty", "react_markers", "noscript_warning"],
  "extracted_at": "2026-03-09T10:00:00Z"
}

Two-stage pipeline with browser fallback

Use the sentinel to build a two-stage pipeline with the Website Job Extractor (Browser):

Website Job Extractor (HTTP, fast, 128MB)
  ├─ Normal jobs → use directly
  └─ Sentinel (js_rendering_suspected) → Website Job Extractor (Browser, 1-4GB)

Why separate actors? The HTTP actor uses a minimal Docker image (~128MB) with no browser. The browser actor needs Chrome (~1024MB) and costs ~2x more per result. Keeping them separate lets you run the cheap HTTP pass on all companies and only pay for browser rendering on the ~28% that need it.

Automatic chaining: Set enablePlaywrightFallback: true and the HTTP actor automatically triggers the browser actor for JS-flagged companies. The browser run ID is saved in the key-value store as BROWSER_FALLBACK_RUN_ID.

Filter sentinels in code

const dataset = await Actor.openDataset();
const { items } = await dataset.getData();

const jobs = items.filter(item => item.title !== null);
const sentinels = items.filter(item => item.js_rendering_suspected === true);

// Re-run sentinels with the browser actor
for (const sentinel of sentinels) {
  console.log(`Needs browser: ${sentinel.company_name} (${sentinel.js_indicators.join(', ')})`);
}

End-to-end pipeline

This actor is part of a 5-actor enrichment suite that works together:

Google Maps Scraper ──→ find companies with websites
        │
        ├──→ Website Job Extractor (HTTP) ──→ structured job listings
        │         └── JS-flagged companies ──→ Website Job Extractor (Browser)
        │
        └──→ Website Contact Extractor (HTTP) ──→ team contacts
                  └── JS-flagged companies ──→ Website Contact Extractor (Browser)

Actor	Purpose	Memory	Link
Google Maps Scraper	Find companies by location	~80MB	View
Website Job Extractor	Extract jobs (HTTP)	~128MB	This actor
Website Job Extractor (Browser)	Extract jobs from JS pages	~1-4GB	View
Website Contact Extractor	Extract contacts (HTTP)	~256MB	View
Website Contact Extractor (Browser)	Extract contacts from JS pages	~1-4GB	View

Limitations

HTTP-only: This actor does not run a browser. JS-rendered career pages are automatically detected and flagged (see JavaScript-rendered pages above). Use the Browser variant for these companies. ATS portals that serve server-rendered HTML (Personio, Greenhouse, Softgarden, BambooHR, Factorial, and most others) work reliably.
DACH coverage is highest: Approximately 72% of DACH (Germany, Austria, Switzerland) company career pages work with HTTP-only extraction. International coverage is around 40-50%, depending on how many companies use JS-rendered career pages.
LLM API key required: You need to provide at least one LLM API key. Gemini offers a generous free tier. Provide multiple keys for automatic fallback.
Page budget: The maxPagesPerCompany setting limits how many pages are processed per company (default 5). Companies with hundreds of job listings spread across many pagination pages may need a higher budget.
No login-protected pages: The actor cannot access career pages behind authentication.

Career Page Job Scraper — Greenhouse, Lever & Any ATS

scrapepilot/career-page-job-scraper----greenhouse-lever-any-ats

Scrape real job listings from any company career page. Supports Greenhouse, Lever, Workday, Ashby and 10+ ATS platforms. Returns job title, location, department, salary, required skills, remote status and apply link. Filter by keyword, department and remote.

Scrape Pilot

4.0

Multi-ATS Company Jobs Scraper

automation-lab/multi-ats-jobs-scraper

Scrape job listings from company career pages across 5 ATS platforms: Greenhouse, Workday, SmartRecruiters, Lever, and Ashby. Standardized output with 15 fields per job. Pure HTTP, no browser needed.

Stas Persiianenko

Career Scraper Plus

canadesk/career-scraper-plus

Collect Jobs from Greenhouse, Lever, Personio, Recruitee, SmartRecruiters, Workable and Workday in one go. It's fast and costs little.

Canadesk Support

LinkedIn Decision-Maker Finder

netdesignr/linkedin-decision-maker-finder

Find the right LinkedIn decision-makers from names, company pages, people-search URLs, or exact profile URLs, then return confidence-scored matches with optional public profile and contact enrichment.

Netdesignr

1.0

ATS Job Scraper

enosgb/ats-job-scraper

Extract job postings from Greenhouse, Lever, Ashby, Workday, and Rippling with automatic ATS detection and standardized JSON output.

Enos Melo

Company Career Page Scraper

piotrv1001/company-career-page-scraper

Scrapes job listings from any company career page, automatically discovering ATS platforms like Greenhouse, Lever, Ashby and Workday. Uses AI to extract skills, requirements, benefits and salary from job descriptions. Ideal for recruiting intelligence, job market research, and lead generation.

FalconScrape

220

LinkedIn Decision Maker Finder

piotrv1001/linkedin-decision-maker-finder

Finds decision makers (CEO, CTO, VP, etc.) at target companies using public LinkedIn data. Extracts names, titles, locations, education, and company details with confidence scoring. Ideal for sales prospecting, lead generation, and recruitment research.

FalconScrape

176

Google Maps Scraper

datarava/google-maps-scraper

Scrape Google Maps business data at scale. Extract phone numbers, emails, reviews, ratings, addresses & more from thousands of locations. Enrich results with real-time emails and social links from business websites. Export instantly or automate via API.

DATARAVA

620

5.0

Workday Jobs API

fantastic-jobs/workday-jobs-api

The perfect Workday Job Scraper API! Direct postings from thousands of company career sites. Enriched with AI and LinkedIn company data, with up to 60 fields per job! Powered by Fantastic.jobs

Fantastic.jobs

221

5.0

Linkedin Company Url Mass Finder | Pay Per Event

scrapeverse/linkedin-company-url-mass-finder-pay-per-event

Stop wasting time manually searching for LinkedIn company pages. With LinkedIn Mass Company URL Finder, input your company list and get instant access to their LinkedIn URLs. Automate your search and streamline your workflow today!

ScrapeVerse

Website Job Extractor (HTTP)

Website Job Extractor

Why this actor?

Use with AI Agents (MCP)

How it works

Input configuration

Minimal example

With keyword filtering

With known career URLs

Full parameter reference

Company input fields

Output format

Output fields

Related Actors

Supported ATS systems

LLM providers

Auto-discovery fallback chain

Pricing

Use cases

Combine with Google Maps Scraper

JavaScript-rendered pages

How detection works

Detection signals

Sentinel output example

Two-stage pipeline with browser fallback

Filter sentinels in code

End-to-end pipeline

Limitations

You might also like

Career Page Job Scraper — Greenhouse, Lever & Any ATS

Multi-ATS Company Jobs Scraper

Career Scraper Plus

LinkedIn Decision-Maker Finder

ATS Job Scraper

Company Career Page Scraper

LinkedIn Decision Maker Finder

Google Maps Scraper

Workday Jobs API

Linkedin Company Url Mass Finder | Pay Per Event