# Ambitionbox Job Scraper (`yodeling_elevator/ambitionbox-job-scraper`) Actor

Production-grade job scraper for AmbitionBox using a **Cheerio-first, Playwright-fallback** architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.

- **URL**: https://apify.com/yodeling\_elevator/ambitionbox-job-scraper.md
- **Developed by:** [ai](https://apify.com/yodeling_elevator) (community)
- **Categories:** Jobs, Automation, Developer tools
- **Stats:** 2 total users, 1 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AmbitionBox Ultra-Fast Job Scraper

Production-grade job scraper for AmbitionBox using a **Cheerio-first, Playwright-fallback** architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.

### Architecture Overview

#### Core Principles

- **Nuxt SSR JSON First**: Extract `window.__NUXT__` from HTML using regex (NO JavaScript execution)
- **CheerioCrawler Primary**: Fast, lightweight scraping for all phases
- **PlaywrightCrawler Fallback**: ONLY when Cheerio fails to extract critical fields
- **Three-Phase Pipeline**: Listing → Job Detail → Company Overview
- **Deterministic URL Construction**: Use `companyUrlName` from Nuxt state as single source of truth

#### Data Flow

````

Phase 1: Listing Extraction (CheerioCrawler)
↓ Extract window.**NUXT**.data\[1].jobs
↓ Parse job listings + companyUrlName
↓ Store in KeyValueStore
↓
Phase 2: Job Detail Enrichment (CheerioCrawler)
↓ Extract description, rating, skills
↓ Resolve company URL from companyUrlName
↓ Update KeyValueStore
↓
Phase 3: Company Overview Enrichment (CheerioCrawler)
↓ Extract size, website, industry, description
↓ STRICT employee count validation
↓ Merge job + company data
↓ Calculate confidence score
↓
Export to Apify Dataset

```

#### Performance Targets

- **Concurrency**: 40 requests
- **Throughput**: 1200 requests/minute
- **Timeouts**: 20s handler, 30s navigation
- **Retries**: Max 1, on [429, 500, 502, 503]

### Project Structure

```

cherro-scrapper/
├── src/
│   └── main.js              ## Main orchestration
├── routes/
│   ├── listing.js           ## Phase 1: Listing extraction
│   ├── jobDetail.js         ## Phase 2: Job detail enrichment
│   └── company.js           ## Phase 3: Company overview enrichment
├── utils/
│   ├── nuxtParser.js        ## Nuxt state extraction
│   ├── validators.js        ## Data validation (strict rules)
│   ├── normalizers.js       ## Data normalization
│   └── confidenceScore.js   ## Quality scoring
├── .actor/
│   ├── actor.json           ## Apify actor configuration
│   └── input\_schema.json    ## Input schema
├── package.json
├── Dockerfile
├── .env.example
└── README.md

````

### Installation

#### Local Development

```bash
## Clone repository
cd cherro-scrapper

## Install dependencies
npm install

## Copy environment template
cp .env.example .env

## Edit .env with your configuration
## (Optional: Add APIFY_TOKEN for local testing)

## Run scraper
npm start
````

#### Apify Deployment

```bash
## Install Apify CLI
npm install -g apify-cli

## Login to Apify
apify login

## Push to Apify
apify push

## Run on Apify platform
## Navigate to https://console.apify.com/actors
```

### Configuration

#### Input Parameters

Configure via Apify Console or `INPUT.json`:

```json
{
  "startUrls": [
    "https://www.ambitionbox.com/jobs",
    "https://www.ambitionbox.com/jobs?q=software+engineer"
  ],
  "maxConcurrency": 40,
  "maxRequestsPerMinute": 1200,
  "requestHandlerTimeoutSecs": 20
}
```

#### Environment Variables

See `.env.example` for local testing configuration.

### Data Schema

#### Output Format

Each job record in the dataset contains:

```json
{
  "jobId": "12345",
  "title": "Senior Software Engineer",
  "companyName": "Example Corp",
  "companyUrlName": "example-corp",
  "location": "Bangalore",
  "postedDate": "2025-12-15",
  "salary": {
    "min": 1500000,
    "max": 2500000,
    "currency": "INR"
  },
  "experience": {
    "min": 3,
    "max": 5
  },
  "description": "Job description text...",
  "skills": ["JavaScript", "React", "Node.js"],
  "companyRating": 4.2,
  "employeeCount": {
    "min": 201,
    "max": 500,
    "raw": "201-500"
  },
  "companyWebsite": "https://example.com",
  "industry": "Information Technology",
  "companyDescription": "Company description text...",
  "headquarters": "Bangalore, India",
  "confidenceScore": 87.5,
  "confidenceLevel": "GOOD",
  "scrapedAt": "2025-12-18T09:44:20.000Z",
  "sourceUrl": "https://www.ambitionbox.com/jobs"
}
```

#### Confidence Scoring

Data quality score (0-100) based on field completeness:

- **90-100**: EXCELLENT - All mandatory and most optional fields present
- **75-89**: GOOD - All mandatory fields + some enrichment
- **60-74**: FAIR - Mandatory fields present, limited enrichment
- **40-59**: POOR - Some mandatory fields missing
- **0-39**: VERY\_POOR - Multiple mandatory fields missing

### Critical Implementation Details

#### Employee Count Validation

**STRICT RULES** (implemented in `utils/validators.js`):

✅ **ACCEPT**:

- Ranges: `"201-500"`, `"1-10"`
- Lakh format: `"1 Lakh+"`, `"2 Lakhs"`
- Large numbers: `"10,000+"`, `"5000"`
- K values ≥ 100: `"100k"`, `"500k"`

❌ **REJECT**:

- Contains "follow": `"5.6k followers"`
- K values < 100: `"5.6k"`, `"10k"`, `"50k"`

#### Company URL Resolution

**Priority Order**:

1. **companyUrlName from Nuxt state** (SINGLE SOURCE OF TRUTH)
2. Extract from job detail page anchor
3. Construct slug from company name (LAST RESORT)

Format: `https://www.ambitionbox.com/overview/{companyUrlName}-overview`

#### Nuxt State Extraction

**Method**: Regex-based extraction from HTML string

```javascript
// Extract window.__NUXT__ = {...}
const nuxtRegex = /window\.__NUXT__\s*=\s*({.+?})\s*;?/s;
const match = html.match(nuxtRegex);
const nuxtState = JSON.parse(match[1]);

// Navigate to jobs
const jobs = nuxtState.data[1].jobs;
```

**NO JavaScript execution** - works in CheerioCrawler.

### Troubleshooting

#### Common Issues

**Issue**: No jobs found in Nuxt state

**Solution**:

- Check if AmbitionBox changed their Nuxt state structure
- Verify `data[1].jobs` path is correct
- Enable debug logging to inspect raw Nuxt state

**Issue**: Employee count always null

**Solution**:

- Check if validation rules are too strict
- Inspect raw employee count values in logs
- Adjust selectors in `routes/company.js`

**Issue**: Low confidence scores

**Solution**:

- Review field weights in `utils/confidenceScore.js`
- Check if selectors are extracting data correctly
- Verify company URLs are resolving properly

#### Debug Mode

Enable verbose logging:

```javascript
// In src/main.js, add:
const crawler = new CheerioCrawler({
  // ... other config
  log: {
    level: 'debug',
  },
});
```

### Performance Optimization

#### Recommended Settings

**For maximum throughput**:

```json
{
  "maxConcurrency": 40,
  "maxRequestsPerMinute": 1200
}
```

**For stability** (avoid rate limiting):

```json
{
  "maxConcurrency": 20,
  "maxRequestsPerMinute": 600
}
```

#### Monitoring

Check Apify Console for:

- Request queue size
- Dataset item count
- Failed requests
- Retry histogram

### Dependencies

```json
{
  "apify": "^3.1.10",
  "crawlee": "^3.7.0",
  "cheerio": "^1.0.0-rc.12"
}
```

**NO hallucinated packages** - all dependencies are official and verified.

### License

ISC

### Support

For issues or questions:

1. Check Apify logs for error messages
2. Review this README for troubleshooting steps
3. Inspect KeyValueStore for intermediate data
4. Enable debug logging for detailed output

***

**Built with**: Node.js 18+, Crawlee, Apify, Cheerio

**Architecture**: Cheerio-first, Playwright-fallback

**Performance**: 40 concurrent requests, 1200 req/min throughput

# Actor input Schema

## `targetJobCount` (type: `integer`):

Total number of jobs to collect across all categories.

## `headless` (type: `boolean`):

Run browser in headless mode (no visible UI). Set to false for debugging.

## Actor input object example

```json
{
  "targetJobCount": 200,
  "headless": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("yodeling_elevator/ambitionbox-job-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("yodeling_elevator/ambitionbox-job-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call yodeling_elevator/ambitionbox-job-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=yodeling_elevator/ambitionbox-job-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Ambitionbox Job Scraper",
        "description": "Production-grade job scraper for AmbitionBox using a **Cheerio-first, Playwright-fallback** architecture. Extracts job listings, enriches with job details and company data, then exports normalized, structured data to Apify Dataset.",
        "version": "1.0",
        "x-build-id": "UYQ72G0peN9Jq7gzs"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/yodeling_elevator~ambitionbox-job-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-yodeling_elevator-ambitionbox-job-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/yodeling_elevator~ambitionbox-job-scraper/runs": {
            "post": {
                "operationId": "runs-sync-yodeling_elevator-ambitionbox-job-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/yodeling_elevator~ambitionbox-job-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-yodeling_elevator-ambitionbox-job-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "targetJobCount"
                ],
                "properties": {
                    "targetJobCount": {
                        "title": "Target Job Count",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Total number of jobs to collect across all categories.",
                        "default": 200
                    },
                    "headless": {
                        "title": "Headless Mode",
                        "type": "boolean",
                        "description": "Run browser in headless mode (no visible UI). Set to false for debugging.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
