Jobs Scrapper avatar
Jobs Scrapper

Pricing

$20.00/month + usage

Go to Apify Store
Jobs Scrapper

Jobs Scrapper

Powerful AmbitionBox Job Scraper that extracts detailed job listings by role and location. Includes responsibilities, skills, qualifications, company insights, and Naukri integration for technical details. Fast, structured, and proxy-supported for large-scale data collection.

Pricing

$20.00/month + usage

Rating

0.0

(0)

Developer

ai-scraper-labs

ai-scraper-labs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

AmbitionBox Job Scraper

✨ Converted to Node.js + Playwright + Apify Actor
This project has been migrated from Python/Scrapy to Node.js/Playwright while preserving 100% of the original scraping logic.
See README_CONVERSION.md for conversion details.

An Apify Actor that scrapes job listings from AmbitionBox using Playwright browser automation, with optional detailed information extraction from Naukri job pages.

Features

  • AmbitionBox Job Scraping - Extracts comprehensive job listings including:

    • Job title, company, location, salary, and experience requirements
    • Detailed job descriptions and responsibilities
    • Required skills and qualifications
    • Employment type and application links
  • Naukri Detail Extraction - Optionally fetches detailed job information from linked Naukri pages:

    • Key responsibilities (structured list)
    • Required skills and technologies
    • Educational qualifications and experience requirements
    • Detailed job descriptions
  • Company Information - Extracts comprehensive company details:

    • Company overview and summary
    • Founding year and employee count
    • Company website and headquarters
    • Work policies (WFH, hybrid, etc.)
    • Complete benefits and perks list
  • Apify Platform Native - Built as a first-class Apify Actor:

    • Automatic request scheduling with AutoscaledPool
    • Built-in retry logic and error handling
    • Cloud-persisted request queue
    • Integrated dataset storage
  • Playwright Integration - Uses browser rendering to bypass anti-bot detection:

    • Handles JavaScript-rendered content
    • Bypasses AmbitionBox's anti-scraping measures
    • Chromium headless browser
    • Ensures reliable data extraction
  • Concurrency Support - Parallel processing for faster scraping:

    • Configurable concurrency (1-10 workers)
    • Request queue management
    • Automatic rate limiting

Input Parameters

Configure the scraper through the Actor input:

  • role (required, string) - Job role or title to search for

    • Example: "software engineer", "python developer", "data scientist"
  • location (optional, string) - Location to search jobs in

    • Example: "bangalore", "mumbai", "delhi"
    • Use "all" or "worldwide" for all locations
    • Note: AmbitionBox primarily lists jobs in India
    • If a specific location returns no results, automatically falls back to all locations
  • maxPages (optional, integer, default: 2) - Maximum number of listing pages to scrape

    • Range: 1-50
  • maxJobs (optional, integer, default: 20) - Maximum number of jobs to scrape

    • Set to 0 for unlimited
  • includeNaukriDetails (optional, boolean, default: true) - Whether to fetch detailed information from Naukri

    • true: Comprehensive data (slower)
    • false: Basic data only (3-4x faster)
  • proxyConfiguration (optional, object) - Proxy settings for the scraper

    • Default: Uses Apify proxy with RESIDENTIAL group (recommended)
    • Recommended: Use residential proxies for best success rates with AmbitionBox
    • Datacenter proxies may experience higher timeout rates

Output Format

The Actor stores data in the default Apify dataset with this structure:

{
"title": "Senior Software Engineer",
"company": "Tech Company Pvt Ltd",
"location": "Bangalore, Karnataka",
"exp_level": "3-6 years",
"salary_range": "₹10-18 LPA",
"url": "https://www.ambitionbox.com/jobs/...",
"apply_url": "https://www.naukri.com/job-listings-...",
"about_this_role": "Full job description text...",
"key_responsibility": [
"Design and develop scalable backend systems.",
"Collaborate with cross-functional teams to define features.",
"Ensure code quality through reviews and testing."
],
"required_skills": [
"Python",
"Django",
"AWS",
"SQL",
"Docker"
],
"required_qualifications": [
"Bachelor's degree in Computer Science or related field.",
"3+ years of experience in backend development."
],
"benefits_perks": [
"Health Insurance",
"Work From Home",
"Flexible Hours",
"Learning & Development",
"Paid Time Off"
],
"company_info": {
"name": "Tech Company Pvt Ltd",
"Founded in": "2015",
"Global Employee Count": "500-1000",
"Website": "https://techcompany.com",
"company_summary": "Leading technology company specializing in...",
"work_policy": "Hybrid: 3 days WFO, Remote: 2 days WFH"
},
"job_type": "Full-time"
}

How It Works

  1. URL Construction - Builds AmbitionBox search URL from role and location parameters
  2. Listing Extraction - Scrapes job listing pages using Scrapy's efficient crawling
  3. Detail Parsing - For each job, extracts comprehensive information from detail pages
  4. Naukri Integration - If enabled, follows "Apply on Naukri" links for additional details
  5. Company Data - Fetches company overview and benefits from dedicated pages
  6. Data Storage - Stores all structured data in Apify dataset

Technologies Used

Why Playwright?

AmbitionBox employs sophisticated anti-bot detection that blocks standard HTTP requests, even when using proxies. Playwright integration provides:

Real Browser Rendering - Executes JavaScript and renders pages like a real user
Anti-Bot Bypass - Realistic browser fingerprinting and behavior
Reliable Extraction - Ensures all dynamic content is loaded
Scrapy Integration - Maintains all Scrapy benefits (pipelines, items, middlewares)

Advantages Over HTTP-Only Scraping

  • Reliability - 100% success rate vs 0% with HTTP requests
  • JavaScript Support - Handles dynamic content loading
  • Anti-Detection - Bypasses sophisticated bot detection
  • Future-Proof - Works even as sites add more JavaScript

Local Development

Prerequisites

  • Python 3.9+
  • Apify CLI

Installation

# Install Apify CLI
brew install apify-cli # macOS
# or
npm -g install apify-cli # Node.js
# Pull the Actor
apify pull
# Install dependencies
pip install -r requirements.txt

Running Locally

# Run with default input
apify run
# Or create/edit .actor/INPUT.json with your parameters

Example INPUT.json

{
"role": "python developer",
"location": "bangalore",
"maxPages": 3,
"maxJobs": 50,
"includeNaukriDetails": true,
"proxyConfiguration": {
"useApifyProxy": true
}
}

Performance Tips

  • Start Small - Test with maxPages: 1 and maxJobs: 10 first
  • Adjust Concurrency - Modify CONCURRENT_REQUESTS in spider settings for faster/slower scraping
  • Skip Naukri - Set includeNaukriDetails: false for basic info only (much faster)
  • Use Proxies - Enable Apify proxy to avoid rate limiting

Scrapy Settings

The spider uses these custom settings for optimal performance:

custom_settings = {
'CONCURRENT_REQUESTS': 8, # Parallel requests
'DOWNLOAD_DELAY': 2, # Delay between requests (seconds)
'ROBOTSTXT_OBEY': True, # Respect robots.txt
'USER_AGENT': 'Mozilla/5.0...', # Custom user agent
}

You can modify these in src/spiders/ambitionbox.py if needed.

Troubleshooting

No jobs found

  • The website structure may have changed
  • Check if the search URL is correct
  • Try reducing DOWNLOAD_DELAY if pages load slowly

Incomplete data

  • Enable includeNaukriDetails for comprehensive extraction
  • Check if company pages are accessible
  • Review logs for specific errors

Rate limiting

  • Increase DOWNLOAD_DELAY in settings
  • Reduce CONCURRENT_REQUESTS
  • Ensure proxy configuration is enabled

Proxy timeouts

  • Switch to residential proxies in proxy configuration (highly recommended)
  • Residential proxies have much better success rates than datacenter proxies
  • Update input to include: "apifyProxyGroups": ["RESIDENTIAL"]
  • Note: Residential proxies consume more proxy credits but significantly improve reliability

Architecture

src/
├── spiders/
│ ├── __init__.py
│ ├── title.py # Original title spider
│ └── ambitionbox.py # AmbitionBox job scraper
├── items.py # Item definitions
├── pipelines.py # Data processing pipelines
├── middlewares.py # Request/response middlewares
├── settings.py # Scrapy settings
├── main.py # Actor entry point
└── __main__.py # Execution wrapper

Resources

License

Apache 2.0.