SEPE Spain Job Scraper – Ofertas de Empleo ES
Pricing
from $1.00 / 1,000 results
SEPE Spain Job Scraper – Ofertas de Empleo ES
Scrapes job offers from SEPE Spain (sepe.es) with stealth Camoufox, proxy rotation, skills filtering (regex + TF-IDF ML), deduplication, change-detection alerts, and Prometheus metrics export. Supports province/CCAA filtering with Vizcaya/Bizkaia focus.
Pricing
from $1.00 / 1,000 results
Rating
0.0
(0)
Developer
David Cortes
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
SEPE Spain Job Scraper – Ofertas de Empleo ES Pro
The #1 Apify Actor for scraping SEPE Spain job offers with full stealth, smart skills filtering, and K8s-ready Prometheus metrics.
- Anti-bot max: Camoufox (Firefox stealth) + Apify residential proxies + random delays + cookie handling
- Smart skills filter: regex + scikit-learn TF-IDF cosine similarity (catches "contenedores" → Docker, "orquestación" → Kubernetes)
- Change-detection alerts: compares every run against the previous one → new / changed / removed offers
- Deduplication: SHA-256 hash per offer, persisted across runs
- K8s-ready: Prometheus metrics exported to KV store (scrapeable by any Prometheus server)
- Province focus: Vizcaya / Bizkaia by default, all 52 Spanish provinces supported
Output Example
{"url": "https://www.sepe.es/HomeSepe/Personas/encontrar-empleo/...","titulo_oferta": "DevOps Engineer – Kubernetes / AWS","empresa": "Tecnología Vasca S.L.","provincia": "Vizcaya","salario": "35.000 € - 50.000 €/año","skills_requeridas": ["Kubernetes", "Docker", "AWS", "Terraform", "Linux", "CI/CD"],"fecha_publicacion": "2026-04-15","enlace_aplicar": "https://www.sepe.es/HomeSepe/Personas/encontrar-empleo/.../solicitar"}
Input Schema
| Field | Type | Default | Description |
|---|---|---|---|
start_urls | array | SEPE national pages | Extra entry-point URLs |
provincias | array | ["Vizcaya","Bizkaia"] | Province/CCAA filter (52 provinces supported) |
skills | array | ["Kubernetes","Docker","Python"] | Skills to filter by (regex + ML) |
max_pages | int | 50 | Max listing pages per entry-point |
use_ml_skills | bool | true | Enable TF-IDF ML skill matching |
use_proxy | bool | true | Use Apify residential proxy |
proxy_groups | array | ["RESIDENTIAL"] | Proxy groups |
proxy_country | string | "ES" | Proxy country (ES = Spanish IP) |
headless | bool | true | Headless browser mode |
min_delay | float | 2.0 | Min delay between requests (s) |
max_delay | float | 5.0 | Max delay between requests (s) |
Quick Start
Run locally
# 1. Clone / enter the projectcd sepe-empleo-es-pro# 2. Create virtual env and install dependenciespython -m venv .venvsource .venv/bin/activate # Windows: .venv\Scripts\activatepip install -r requirements.txt# 3. Install Camoufox browser binariespython -m camoufox fetch# 4. Put test input in placecp test_input.json storage/key_value_stores/default/INPUT.json# 5. Runpython -m my_actor
Run on Apify
# Login (one-time)apify login# Push and deployapify push# Run with test inputapify run --input-file test_input.json
Run via API
curl -X POST \"https://api.apify.com/v2/acts/YOUR_USERNAME~sepe-empleo-es-pro/runs" \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_TOKEN" \-d '{"provincias": ["Vizcaya", "Bizkaia"],"skills": ["Kubernetes", "Python", "DevOps"],"max_pages": 50}'
Architecture
my_actor/├── main.py # Actor entry point, crawler setup, post-processing├── routes.py # Crawlee router: NAV / LIST / DETAIL handlers + XHR interception├── extractors.py # Multi-selector SEPE data extraction with JSON-LD + regex fallbacks├── skills_matcher.py # Regex + TF-IDF scikit-learn skills detection (30+ tech skills)├── dedup.py # SHA-256 offer deduplication, cross-run persistence├── alerts.py # Change-detection: new / changed / removed offers diff├── metrics.py # Prometheus metrics (counters, gauges, histograms)└── config.py # Province codes (52), SEPE URLs, CSS selectors, rate-limit settings
Anti-bot Stack
| Layer | Technology | Config |
|---|---|---|
| Browser fingerprint | Camoufox (Firefox stealth) | os=windows/macos, locale=es-ES, geoip=true |
| IP rotation | Apify Residential proxies | countryCode=ES (Spanish IPs) |
| Timing | Random delays | 2–5 s per request (configurable) |
| Detection evasion | Cookie auto-accept | Handles SEPE's cookie banner |
| CAPTCHA detection | Text/title heuristics | Auto-retry on fresh session |
| Header generation | Camoufox built-in | Realistic browser headers |
| Scrolling | JS scroll simulation | Triggers lazy-loaded content |
| XHR interception | Playwright response hook | Catches SEPE's JSON API calls |
Skills Matching Pipeline
Input text (job description)│▼┌───────────────────┐ ┌─────────────────────────┐│ Regex matcher │ │ TF-IDF cosine similarity ││ (30+ skills, │ + │ (scikit-learn, threshold ││ 50+ aliases) │ │ 0.25, ngram 1-2) │└───────────────────┘ └─────────────────────────┘│ │└───────────┬───────────────┘▼Canonical skill names["Kubernetes","Docker","Python"]
Prometheus Metrics (K8s-ready)
Metrics are exported in standard Prometheus text format to the Actor's Key-Value Store under the key prometheus_metrics. Retrieve them via:
curl "https://api.apify.com/v2/key-value-stores/STORE_ID/records/prometheus_metrics" \-H "Authorization: Bearer YOUR_TOKEN"
Available metrics
| Metric | Type | Description |
|---|---|---|
sepe_offers_scraped_total | Counter | Total offers stored |
sepe_offers_new_total | Counter | New vs previous run |
sepe_offers_changed_total | Counter | Changed offers |
sepe_offers_removed_total | Counter | Removed offers |
sepe_requests_total{status} | Counter | Requests by status (success/failed/retried) |
sepe_pages_skipped_duplicates_total | Counter | Dedup skips |
sepe_skills_matched_total{skill} | Counter | Matches per skill |
sepe_offers_in_dataset | Gauge | Current dataset size |
sepe_dedup_ratio | Gauge | Duplicate ratio (0–1) |
sepe_pages_crawled_total | Gauge | Pages visited |
sepe_proxy_errors_total | Gauge | Proxy/network errors |
sepe_scrape_duration_seconds | Histogram | Total run duration |
sepe_page_load_duration_seconds | Histogram | Per-page load time |
Kubernetes scraping example
# prometheus-scrape-config.yaml- job_name: sepe_scrapermetrics_path: /v2/key-value-stores/STORE_ID/records/prometheus_metricsscheme: httpsbearer_token: YOUR_APIFY_TOKENstatic_configs:- targets: [api.apify.com]
Change Detection Alerts
After each run an alerts_report is saved to the Key-Value Store and also pushed to the dataset as a record with _record_type: "alerts_summary":
{"_record_type": "alerts_summary","generated_at": "2026-04-15T10:30:00Z","stats": {"new_count": 47,"changed_count": 12,"removed_count": 3,"total_current": 1024,"total_previous": 980},"sample_new_offers": [ ... ],"changed_offers": [ {"offer": {...}, "changed_fields": ["salario"]} ],"removed_offers": [ ... ]}
Integrate with Zapier / Make / Slack via the Apify webhook → trigger on run completion.
Province Codes Reference
All 52 Spanish provinces are supported. Examples:
| Input | Province | Code |
|---|---|---|
"Vizcaya" or "Bizkaia" | Vizcaya / Bizkaia | 48 |
"Madrid" | Madrid | 28 |
"Barcelona" | Barcelona | 08 |
"Guipúzcoa" or "Gipuzkoa" | Guipúzcoa | 20 |
"Valencia" or "València" | Valencia | 46 |
"Sevilla" | Sevilla | 41 |
Legal & Compliance
- Only public, freely accessible data is scraped
- Rate-limited to ≤ 1 request/second (configurable)
- Respects SEPE's robots.txt structure
- No login, no personal data, no GDPR-protected content
- Data is from sepe.es which is a Spanish public institution
Troubleshooting
| Problem | Solution |
|---|---|
| 0 offers returned | SEPE may have changed page structure; check Actor logs for CSS selector misses |
| CAPTCHA detected | Enable Apify Residential proxy (use_proxy: true, proxy_groups: ["RESIDENTIAL"]) |
| Slow runs | Increase max_concurrency in main.py or reduce max_delay |
| Missing skills | Add aliases to SKILLS_TAXONOMY in config.py |
| Stale dedup | Delete previous_offers_hashes and previous_offers_snapshot from KV store |
Deploy to Apify
apify loginapify push
