Ambitionbox Job Scraper
Under maintenancePricing
Pay per usage
Ambitionbox Job Scraper
Under maintenanceProduction-grade job scraper for AmbitionBox using a **Cheerio-first, Playwright-fallback** architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
ai
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
18 days ago
Last modified
Categories
Share
AmbitionBox Ultra-Fast Job Scraper
Production-grade job scraper for AmbitionBox using a Cheerio-first, Playwright-fallback architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.
Architecture Overview
Core Principles
- Nuxt SSR JSON First: Extract
window.__NUXT__from HTML using regex (NO JavaScript execution) - CheerioCrawler Primary: Fast, lightweight scraping for all phases
- PlaywrightCrawler Fallback: ONLY when Cheerio fails to extract critical fields
- Three-Phase Pipeline: Listing → Job Detail → Company Overview
- Deterministic URL Construction: Use
companyUrlNamefrom Nuxt state as single source of truth
Data Flow
Phase 1: Listing Extraction (CheerioCrawler)↓ Extract window.__NUXT__.data[1].jobs↓ Parse job listings + companyUrlName↓ Store in KeyValueStore↓Phase 2: Job Detail Enrichment (CheerioCrawler)↓ Extract description, rating, skills↓ Resolve company URL from companyUrlName↓ Update KeyValueStore↓Phase 3: Company Overview Enrichment (CheerioCrawler)↓ Extract size, website, industry, description↓ STRICT employee count validation↓ Merge job + company data↓ Calculate confidence score↓Export to Apify Dataset
Performance Targets
- Concurrency: 40 requests
- Throughput: 1200 requests/minute
- Timeouts: 20s handler, 30s navigation
- Retries: Max 1, on [429, 500, 502, 503]
Project Structure
cherro-scrapper/├── src/│ └── main.js # Main orchestration├── routes/│ ├── listing.js # Phase 1: Listing extraction│ ├── jobDetail.js # Phase 2: Job detail enrichment│ └── company.js # Phase 3: Company overview enrichment├── utils/│ ├── nuxtParser.js # Nuxt state extraction│ ├── validators.js # Data validation (strict rules)│ ├── normalizers.js # Data normalization│ └── confidenceScore.js # Quality scoring├── .actor/│ ├── actor.json # Apify actor configuration│ └── input_schema.json # Input schema├── package.json├── Dockerfile├── .env.example└── README.md
Installation
Local Development
# Clone repositorycd cherro-scrapper# Install dependenciesnpm install# Copy environment templatecp .env.example .env# Edit .env with your configuration# (Optional: Add APIFY_TOKEN for local testing)# Run scrapernpm start
Apify Deployment
# Install Apify CLInpm install -g apify-cli# Login to Apifyapify login# Push to Apifyapify push# Run on Apify platform# Navigate to https://console.apify.com/actors
Configuration
Input Parameters
Configure via Apify Console or INPUT.json:
{"startUrls": ["https://www.ambitionbox.com/jobs","https://www.ambitionbox.com/jobs?q=software+engineer"],"maxConcurrency": 40,"maxRequestsPerMinute": 1200,"requestHandlerTimeoutSecs": 20}
Environment Variables
See .env.example for local testing configuration.
Data Schema
Output Format
Each job record in the dataset contains:
{"jobId": "12345","title": "Senior Software Engineer","companyName": "Example Corp","companyUrlName": "example-corp","location": "Bangalore","postedDate": "2025-12-15","salary": {"min": 1500000,"max": 2500000,"currency": "INR"},"experience": {"min": 3,"max": 5},"description": "Job description text...","skills": ["JavaScript", "React", "Node.js"],"companyRating": 4.2,"employeeCount": {"min": 201,"max": 500,"raw": "201-500"},"companyWebsite": "https://example.com","industry": "Information Technology","companyDescription": "Company description text...","headquarters": "Bangalore, India","confidenceScore": 87.5,"confidenceLevel": "GOOD","scrapedAt": "2025-12-18T09:44:20.000Z","sourceUrl": "https://www.ambitionbox.com/jobs"}
Confidence Scoring
Data quality score (0-100) based on field completeness:
- 90-100: EXCELLENT - All mandatory and most optional fields present
- 75-89: GOOD - All mandatory fields + some enrichment
- 60-74: FAIR - Mandatory fields present, limited enrichment
- 40-59: POOR - Some mandatory fields missing
- 0-39: VERY_POOR - Multiple mandatory fields missing
Critical Implementation Details
Employee Count Validation
STRICT RULES (implemented in utils/validators.js):
✅ ACCEPT:
- Ranges:
"201-500","1-10" - Lakh format:
"1 Lakh+","2 Lakhs" - Large numbers:
"10,000+","5000" - K values ≥ 100:
"100k","500k"
❌ REJECT:
- Contains "follow":
"5.6k followers" - K values < 100:
"5.6k","10k","50k"
Company URL Resolution
Priority Order:
- companyUrlName from Nuxt state (SINGLE SOURCE OF TRUTH)
- Extract from job detail page anchor
- Construct slug from company name (LAST RESORT)
Format: https://www.ambitionbox.com/overview/{companyUrlName}-overview
Nuxt State Extraction
Method: Regex-based extraction from HTML string
// Extract window.__NUXT__ = {...}const nuxtRegex = /window\.__NUXT__\s*=\s*({.+?})\s*;?/s;const match = html.match(nuxtRegex);const nuxtState = JSON.parse(match[1]);// Navigate to jobsconst jobs = nuxtState.data[1].jobs;
NO JavaScript execution - works in CheerioCrawler.
Troubleshooting
Common Issues
Issue: No jobs found in Nuxt state
Solution:
- Check if AmbitionBox changed their Nuxt state structure
- Verify
data[1].jobspath is correct - Enable debug logging to inspect raw Nuxt state
Issue: Employee count always null
Solution:
- Check if validation rules are too strict
- Inspect raw employee count values in logs
- Adjust selectors in
routes/company.js
Issue: Low confidence scores
Solution:
- Review field weights in
utils/confidenceScore.js - Check if selectors are extracting data correctly
- Verify company URLs are resolving properly
Debug Mode
Enable verbose logging:
// In src/main.js, add:const crawler = new CheerioCrawler({// ... other configlog: {level: 'debug',},});
Performance Optimization
Recommended Settings
For maximum throughput:
{"maxConcurrency": 40,"maxRequestsPerMinute": 1200}
For stability (avoid rate limiting):
{"maxConcurrency": 20,"maxRequestsPerMinute": 600}
Monitoring
Check Apify Console for:
- Request queue size
- Dataset item count
- Failed requests
- Retry histogram
Dependencies
{"apify": "^3.1.10","crawlee": "^3.7.0","cheerio": "^1.0.0-rc.12"}
NO hallucinated packages - all dependencies are official and verified.
License
ISC
Support
For issues or questions:
- Check Apify logs for error messages
- Review this README for troubleshooting steps
- Inspect KeyValueStore for intermediate data
- Enable debug logging for detailed output
Built with: Node.js 18+, Crawlee, Apify, Cheerio
Architecture: Cheerio-first, Playwright-fallback
Performance: 40 concurrent requests, 1200 req/min throughput