Ambitionbox Job Scraper avatar

Ambitionbox Job Scraper

Pricing

Pay per usage

Go to Apify Store
Ambitionbox Job Scraper

Ambitionbox Job Scraper

Production-grade job scraper for AmbitionBox using a **Cheerio-first, Playwright-fallback** architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

ai

ai

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

14 days ago

Last modified

Share

AmbitionBox Ultra-Fast Job Scraper

Production-grade job scraper for AmbitionBox using a Cheerio-first, Playwright-fallback architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.

Architecture Overview

Core Principles

  • Nuxt SSR JSON First: Extract window.__NUXT__ from HTML using regex (NO JavaScript execution)
  • CheerioCrawler Primary: Fast, lightweight scraping for all phases
  • PlaywrightCrawler Fallback: ONLY when Cheerio fails to extract critical fields
  • Three-Phase Pipeline: Listing → Job Detail → Company Overview
  • Deterministic URL Construction: Use companyUrlName from Nuxt state as single source of truth

Data Flow

Phase 1: Listing Extraction (CheerioCrawler)
↓ Extract window.__NUXT__.data[1].jobs
↓ Parse job listings + companyUrlName
↓ Store in KeyValueStore
Phase 2: Job Detail Enrichment (CheerioCrawler)
↓ Extract description, rating, skills
↓ Resolve company URL from companyUrlName
↓ Update KeyValueStore
Phase 3: Company Overview Enrichment (CheerioCrawler)
↓ Extract size, website, industry, description
↓ STRICT employee count validation
↓ Merge job + company data
↓ Calculate confidence score
Export to Apify Dataset

Performance Targets

  • Concurrency: 40 requests
  • Throughput: 1200 requests/minute
  • Timeouts: 20s handler, 30s navigation
  • Retries: Max 1, on [429, 500, 502, 503]

Project Structure

cherro-scrapper/
├── src/
│ └── main.js # Main orchestration
├── routes/
│ ├── listing.js # Phase 1: Listing extraction
│ ├── jobDetail.js # Phase 2: Job detail enrichment
│ └── company.js # Phase 3: Company overview enrichment
├── utils/
│ ├── nuxtParser.js # Nuxt state extraction
│ ├── validators.js # Data validation (strict rules)
│ ├── normalizers.js # Data normalization
│ └── confidenceScore.js # Quality scoring
├── .actor/
│ ├── actor.json # Apify actor configuration
│ └── input_schema.json # Input schema
├── package.json
├── Dockerfile
├── .env.example
└── README.md

Installation

Local Development

# Clone repository
cd cherro-scrapper
# Install dependencies
npm install
# Copy environment template
cp .env.example .env
# Edit .env with your configuration
# (Optional: Add APIFY_TOKEN for local testing)
# Run scraper
npm start

Apify Deployment

# Install Apify CLI
npm install -g apify-cli
# Login to Apify
apify login
# Push to Apify
apify push
# Run on Apify platform
# Navigate to https://console.apify.com/actors

Configuration

Input Parameters

Configure via Apify Console or INPUT.json:

{
"startUrls": [
"https://www.ambitionbox.com/jobs",
"https://www.ambitionbox.com/jobs?q=software+engineer"
],
"maxConcurrency": 40,
"maxRequestsPerMinute": 1200,
"requestHandlerTimeoutSecs": 20
}

Environment Variables

See .env.example for local testing configuration.

Data Schema

Output Format

Each job record in the dataset contains:

{
"jobId": "12345",
"title": "Senior Software Engineer",
"companyName": "Example Corp",
"companyUrlName": "example-corp",
"location": "Bangalore",
"postedDate": "2025-12-15",
"salary": {
"min": 1500000,
"max": 2500000,
"currency": "INR"
},
"experience": {
"min": 3,
"max": 5
},
"description": "Job description text...",
"skills": ["JavaScript", "React", "Node.js"],
"companyRating": 4.2,
"employeeCount": {
"min": 201,
"max": 500,
"raw": "201-500"
},
"companyWebsite": "https://example.com",
"industry": "Information Technology",
"companyDescription": "Company description text...",
"headquarters": "Bangalore, India",
"confidenceScore": 87.5,
"confidenceLevel": "GOOD",
"scrapedAt": "2025-12-18T09:44:20.000Z",
"sourceUrl": "https://www.ambitionbox.com/jobs"
}

Confidence Scoring

Data quality score (0-100) based on field completeness:

  • 90-100: EXCELLENT - All mandatory and most optional fields present
  • 75-89: GOOD - All mandatory fields + some enrichment
  • 60-74: FAIR - Mandatory fields present, limited enrichment
  • 40-59: POOR - Some mandatory fields missing
  • 0-39: VERY_POOR - Multiple mandatory fields missing

Critical Implementation Details

Employee Count Validation

STRICT RULES (implemented in utils/validators.js):

ACCEPT:

  • Ranges: "201-500", "1-10"
  • Lakh format: "1 Lakh+", "2 Lakhs"
  • Large numbers: "10,000+", "5000"
  • K values ≥ 100: "100k", "500k"

REJECT:

  • Contains "follow": "5.6k followers"
  • K values < 100: "5.6k", "10k", "50k"

Company URL Resolution

Priority Order:

  1. companyUrlName from Nuxt state (SINGLE SOURCE OF TRUTH)
  2. Extract from job detail page anchor
  3. Construct slug from company name (LAST RESORT)

Format: https://www.ambitionbox.com/overview/{companyUrlName}-overview

Nuxt State Extraction

Method: Regex-based extraction from HTML string

// Extract window.__NUXT__ = {...}
const nuxtRegex = /window\.__NUXT__\s*=\s*({.+?})\s*;?/s;
const match = html.match(nuxtRegex);
const nuxtState = JSON.parse(match[1]);
// Navigate to jobs
const jobs = nuxtState.data[1].jobs;

NO JavaScript execution - works in CheerioCrawler.

Troubleshooting

Common Issues

Issue: No jobs found in Nuxt state

Solution:

  • Check if AmbitionBox changed their Nuxt state structure
  • Verify data[1].jobs path is correct
  • Enable debug logging to inspect raw Nuxt state

Issue: Employee count always null

Solution:

  • Check if validation rules are too strict
  • Inspect raw employee count values in logs
  • Adjust selectors in routes/company.js

Issue: Low confidence scores

Solution:

  • Review field weights in utils/confidenceScore.js
  • Check if selectors are extracting data correctly
  • Verify company URLs are resolving properly

Debug Mode

Enable verbose logging:

// In src/main.js, add:
const crawler = new CheerioCrawler({
// ... other config
log: {
level: 'debug',
},
});

Performance Optimization

For maximum throughput:

{
"maxConcurrency": 40,
"maxRequestsPerMinute": 1200
}

For stability (avoid rate limiting):

{
"maxConcurrency": 20,
"maxRequestsPerMinute": 600
}

Monitoring

Check Apify Console for:

  • Request queue size
  • Dataset item count
  • Failed requests
  • Retry histogram

Dependencies

{
"apify": "^3.1.10",
"crawlee": "^3.7.0",
"cheerio": "^1.0.0-rc.12"
}

NO hallucinated packages - all dependencies are official and verified.

License

ISC

Support

For issues or questions:

  1. Check Apify logs for error messages
  2. Review this README for troubleshooting steps
  3. Inspect KeyValueStore for intermediate data
  4. Enable debug logging for detailed output

Built with: Node.js 18+, Crawlee, Apify, Cheerio

Architecture: Cheerio-first, Playwright-fallback

Performance: 40 concurrent requests, 1200 req/min throughput