Ambitionbox Job Scrapper avatar

Ambitionbox Job Scrapper

Under maintenance

Pricing

from $0.80 / actor start

Go to Apify Store
Ambitionbox Job Scrapper

Ambitionbox Job Scrapper

Under maintenance

Production-grade job scraper for AmbitionBox using a **Cheerio-first, Playwright-fallback** architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.

Pricing

from $0.80 / actor start

Rating

0.0

(0)

Developer

ai

ai

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Categories

Share

AmbitionBox Ultra-Fast Job Scraper

Production-grade job scraper for AmbitionBox using a Cheerio-first, Playwright-fallback architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.

Architecture Overview

Core Principles

  • Nuxt SSR JSON First: Extract window.__NUXT__ from HTML using regex (NO JavaScript execution)
  • CheerioCrawler Primary: Fast, lightweight scraping for all phases
  • PlaywrightCrawler Fallback: ONLY when Cheerio fails to extract critical fields
  • Three-Phase Pipeline: Listing → Job Detail → Company Overview
  • Deterministic URL Construction: Use companyUrlName from Nuxt state as single source of truth

Data Flow

Phase 1: Listing Extraction (CheerioCrawler)
↓ Extract window.__NUXT__.data[1].jobs
↓ Parse job listings + companyUrlName
↓ Store in KeyValueStore
Phase 2: Job Detail Enrichment (CheerioCrawler)
↓ Extract description, rating, skills
↓ Resolve company URL from companyUrlName
↓ Update KeyValueStore
Phase 3: Company Overview Enrichment (CheerioCrawler)
↓ Extract size, website, industry, description
↓ STRICT employee count validation
↓ Merge job + company data
↓ Calculate confidence score
Export to Apify Dataset

Performance Targets

  • Concurrency: 40 requests
  • Throughput: 1200 requests/minute
  • Timeouts: 20s handler, 30s navigation
  • Retries: Max 1, on [429, 500, 502, 503]

Project Structure

cherro-scrapper/
├── src/
│ └── main.js # Main orchestration
├── routes/
│ ├── listing.js # Phase 1: Listing extraction
│ ├── jobDetail.js # Phase 2: Job detail enrichment
│ └── company.js # Phase 3: Company overview enrichment
├── utils/
│ ├── nuxtParser.js # Nuxt state extraction
│ ├── validators.js # Data validation (strict rules)
│ ├── normalizers.js # Data normalization
│ └── confidenceScore.js # Quality scoring
├── .actor/
│ ├── actor.json # Apify actor configuration
│ └── input_schema.json # Input schema
├── package.json
├── Dockerfile
├── .env.example
└── README.md

Installation

Local Development

# Clone repository
cd cherro-scrapper
# Install dependencies
npm install
# Copy environment template
cp .env.example .env
# Edit .env with your configuration
# (Optional: Add APIFY_TOKEN for local testing)
# Run scraper
npm start

Apify Deployment

# Install Apify CLI
npm install -g apify-cli
# Login to Apify
apify login
# Push to Apify
apify push
# Run on Apify platform
# Navigate to https://console.apify.com/actors

Configuration

Input Parameters

Configure via Apify Console or INPUT.json:

{
"startUrls": [
"https://www.ambitionbox.com/jobs",
"https://www.ambitionbox.com/jobs?q=software+engineer"
],
"maxConcurrency": 40,
"maxRequestsPerMinute": 1200,
"requestHandlerTimeoutSecs": 20
}

Environment Variables

See .env.example for local testing configuration.

Data Schema

Output Format

Each job record in the dataset contains:

{
"jobId": "12345",
"title": "Senior Software Engineer",
"companyName": "Example Corp",
"companyUrlName": "example-corp",
"location": "Bangalore",
"postedDate": "2025-12-15",
"salary": {
"min": 1500000,
"max": 2500000,
"currency": "INR"
},
"experience": {
"min": 3,
"max": 5
},
"description": "Job description text...",
"skills": ["JavaScript", "React", "Node.js"],
"companyRating": 4.2,
"employeeCount": {
"min": 201,
"max": 500,
"raw": "201-500"
},
"companyWebsite": "https://example.com",
"industry": "Information Technology",
"companyDescription": "Company description text...",
"headquarters": "Bangalore, India",
"confidenceScore": 87.5,
"confidenceLevel": "GOOD",
"scrapedAt": "2025-12-18T09:44:20.000Z",
"sourceUrl": "https://www.ambitionbox.com/jobs"
}

Confidence Scoring

Data quality score (0-100) based on field completeness:

  • 90-100: EXCELLENT - All mandatory and most optional fields present
  • 75-89: GOOD - All mandatory fields + some enrichment
  • 60-74: FAIR - Mandatory fields present, limited enrichment
  • 40-59: POOR - Some mandatory fields missing
  • 0-39: VERY_POOR - Multiple mandatory fields missing

Critical Implementation Details

Employee Count Validation

STRICT RULES (implemented in utils/validators.js):

ACCEPT:

  • Ranges: "201-500", "1-10"
  • Lakh format: "1 Lakh+", "2 Lakhs"
  • Large numbers: "10,000+", "5000"
  • K values ≥ 100: "100k", "500k"

REJECT:

  • Contains "follow": "5.6k followers"
  • K values < 100: "5.6k", "10k", "50k"

Company URL Resolution

Priority Order:

  1. companyUrlName from Nuxt state (SINGLE SOURCE OF TRUTH)
  2. Extract from job detail page anchor
  3. Construct slug from company name (LAST RESORT)

Format: https://www.ambitionbox.com/overview/{companyUrlName}-overview

Nuxt State Extraction

Method: Regex-based extraction from HTML string

// Extract window.__NUXT__ = {...}
const nuxtRegex = /window\.__NUXT__\s*=\s*({.+?})\s*;?/s;
const match = html.match(nuxtRegex);
const nuxtState = JSON.parse(match[1]);
// Navigate to jobs
const jobs = nuxtState.data[1].jobs;

NO JavaScript execution - works in CheerioCrawler.

Troubleshooting

Common Issues

Issue: No jobs found in Nuxt state

Solution:

  • Check if AmbitionBox changed their Nuxt state structure
  • Verify data[1].jobs path is correct
  • Enable debug logging to inspect raw Nuxt state

Issue: Employee count always null

Solution:

  • Check if validation rules are too strict
  • Inspect raw employee count values in logs
  • Adjust selectors in routes/company.js

Issue: Low confidence scores

Solution:

  • Review field weights in utils/confidenceScore.js
  • Check if selectors are extracting data correctly
  • Verify company URLs are resolving properly

Debug Mode

Enable verbose logging:

// In src/main.js, add:
const crawler = new CheerioCrawler({
// ... other config
log: {
level: 'debug',
},
});

Performance Optimization

For maximum throughput:

{
"maxConcurrency": 40,
"maxRequestsPerMinute": 1200
}

For stability (avoid rate limiting):

{
"maxConcurrency": 20,
"maxRequestsPerMinute": 600
}

Monitoring

Check Apify Console for:

  • Request queue size
  • Dataset item count
  • Failed requests
  • Retry histogram

Dependencies

{
"apify": "^3.1.10",
"crawlee": "^3.7.0",
"cheerio": "^1.0.0-rc.12"
}

NO hallucinated packages - all dependencies are official and verified.

License

ISC

Support

For issues or questions:

  1. Check Apify logs for error messages
  2. Review this README for troubleshooting steps
  3. Inspect KeyValueStore for intermediate data
  4. Enable debug logging for detailed output

Built with: Node.js 18+, Crawlee, Apify, Cheerio

Architecture: Cheerio-first, Playwright-fallback

Performance: 40 concurrent requests, 1200 req/min throughput