Ambitionbox Job Scraper
Pricing
Pay per usage
Ambitionbox Job Scraper
Production-grade job scraper for AmbitionBox using a **Cheerio-first, Playwright-fallback** architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
ai
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
14 days ago
Last modified
Categories
Share
AmbitionBox Ultra-Fast Job Scraper
Production-grade job scraper for AmbitionBox using a Cheerio-first, Playwright-fallback architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.
Architecture Overview
Core Principles
- Nuxt SSR JSON First: Extract
window.__NUXT__from HTML using regex (NO JavaScript execution) - CheerioCrawler Primary: Fast, lightweight scraping for all phases
- PlaywrightCrawler Fallback: ONLY when Cheerio fails to extract critical fields
- Three-Phase Pipeline: Listing → Job Detail → Company Overview
- Deterministic URL Construction: Use
companyUrlNamefrom Nuxt state as single source of truth
Data Flow
Phase 1: Listing Extraction (CheerioCrawler)↓ Extract window.__NUXT__.data[1].jobs↓ Parse job listings + companyUrlName↓ Store in KeyValueStore↓Phase 2: Job Detail Enrichment (CheerioCrawler)↓ Extract description, rating, skills↓ Resolve company URL from companyUrlName↓ Update KeyValueStore↓Phase 3: Company Overview Enrichment (CheerioCrawler)↓ Extract size, website, industry, description↓ STRICT employee count validation↓ Merge job + company data↓ Calculate confidence score↓Export to Apify Dataset
Performance Targets
- Concurrency: 40 requests
- Throughput: 1200 requests/minute
- Timeouts: 20s handler, 30s navigation
- Retries: Max 1, on [429, 500, 502, 503]
Project Structure
cherro-scrapper/├── src/│ └── main.js # Main orchestration├── routes/│ ├── listing.js # Phase 1: Listing extraction│ ├── jobDetail.js # Phase 2: Job detail enrichment│ └── company.js # Phase 3: Company overview enrichment├── utils/│ ├── nuxtParser.js # Nuxt state extraction│ ├── validators.js # Data validation (strict rules)│ ├── normalizers.js # Data normalization│ └── confidenceScore.js # Quality scoring├── .actor/│ ├── actor.json # Apify actor configuration│ └── input_schema.json # Input schema├── package.json├── Dockerfile├── .env.example└── README.md
Installation
Local Development
# Clone repositorycd cherro-scrapper# Install dependenciesnpm install# Copy environment templatecp .env.example .env# Edit .env with your configuration# (Optional: Add APIFY_TOKEN for local testing)# Run scrapernpm start
Apify Deployment
# Install Apify CLInpm install -g apify-cli# Login to Apifyapify login# Push to Apifyapify push# Run on Apify platform# Navigate to https://console.apify.com/actors
Configuration
Input Parameters
Configure via Apify Console or INPUT.json:
{"startUrls": ["https://www.ambitionbox.com/jobs","https://www.ambitionbox.com/jobs?q=software+engineer"],"maxConcurrency": 40,"maxRequestsPerMinute": 1200,"requestHandlerTimeoutSecs": 20}
Environment Variables
See .env.example for local testing configuration.
Data Schema
Output Format
Each job record in the dataset contains:
{"jobId": "12345","title": "Senior Software Engineer","companyName": "Example Corp","companyUrlName": "example-corp","location": "Bangalore","postedDate": "2025-12-15","salary": {"min": 1500000,"max": 2500000,"currency": "INR"},"experience": {"min": 3,"max": 5},"description": "Job description text...","skills": ["JavaScript", "React", "Node.js"],"companyRating": 4.2,"employeeCount": {"min": 201,"max": 500,"raw": "201-500"},"companyWebsite": "https://example.com","industry": "Information Technology","companyDescription": "Company description text...","headquarters": "Bangalore, India","confidenceScore": 87.5,"confidenceLevel": "GOOD","scrapedAt": "2025-12-18T09:44:20.000Z","sourceUrl": "https://www.ambitionbox.com/jobs"}
Confidence Scoring
Data quality score (0-100) based on field completeness:
- 90-100: EXCELLENT - All mandatory and most optional fields present
- 75-89: GOOD - All mandatory fields + some enrichment
- 60-74: FAIR - Mandatory fields present, limited enrichment
- 40-59: POOR - Some mandatory fields missing
- 0-39: VERY_POOR - Multiple mandatory fields missing
Critical Implementation Details
Employee Count Validation
STRICT RULES (implemented in utils/validators.js):
✅ ACCEPT:
- Ranges:
"201-500","1-10" - Lakh format:
"1 Lakh+","2 Lakhs" - Large numbers:
"10,000+","5000" - K values ≥ 100:
"100k","500k"
❌ REJECT:
- Contains "follow":
"5.6k followers" - K values < 100:
"5.6k","10k","50k"
Company URL Resolution
Priority Order:
- companyUrlName from Nuxt state (SINGLE SOURCE OF TRUTH)
- Extract from job detail page anchor
- Construct slug from company name (LAST RESORT)
Format: https://www.ambitionbox.com/overview/{companyUrlName}-overview
Nuxt State Extraction
Method: Regex-based extraction from HTML string
// Extract window.__NUXT__ = {...}const nuxtRegex = /window\.__NUXT__\s*=\s*({.+?})\s*;?/s;const match = html.match(nuxtRegex);const nuxtState = JSON.parse(match[1]);// Navigate to jobsconst jobs = nuxtState.data[1].jobs;
NO JavaScript execution - works in CheerioCrawler.
Troubleshooting
Common Issues
Issue: No jobs found in Nuxt state
Solution:
- Check if AmbitionBox changed their Nuxt state structure
- Verify
data[1].jobspath is correct - Enable debug logging to inspect raw Nuxt state
Issue: Employee count always null
Solution:
- Check if validation rules are too strict
- Inspect raw employee count values in logs
- Adjust selectors in
routes/company.js
Issue: Low confidence scores
Solution:
- Review field weights in
utils/confidenceScore.js - Check if selectors are extracting data correctly
- Verify company URLs are resolving properly
Debug Mode
Enable verbose logging:
// In src/main.js, add:const crawler = new CheerioCrawler({// ... other configlog: {level: 'debug',},});
Performance Optimization
Recommended Settings
For maximum throughput:
{"maxConcurrency": 40,"maxRequestsPerMinute": 1200}
For stability (avoid rate limiting):
{"maxConcurrency": 20,"maxRequestsPerMinute": 600}
Monitoring
Check Apify Console for:
- Request queue size
- Dataset item count
- Failed requests
- Retry histogram
Dependencies
{"apify": "^3.1.10","crawlee": "^3.7.0","cheerio": "^1.0.0-rc.12"}
NO hallucinated packages - all dependencies are official and verified.
License
ISC
Support
For issues or questions:
- Check Apify logs for error messages
- Review this README for troubleshooting steps
- Inspect KeyValueStore for intermediate data
- Enable debug logging for detailed output
Built with: Node.js 18+, Crawlee, Apify, Cheerio
Architecture: Cheerio-first, Playwright-fallback
Performance: 40 concurrent requests, 1200 req/min throughput