Jobs Scrapper
Pricing
$20.00/month + usage
Jobs Scrapper
Powerful AmbitionBox Job Scraper that extracts detailed job listings by role and location. Includes responsibilities, skills, qualifications, company insights, and Naukri integration for technical details. Fast, structured, and proxy-supported for large-scale data collection.
Pricing
$20.00/month + usage
Rating
0.0
(0)
Developer
ai-scraper-labs
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
AmbitionBox Job Scraper
✨ Converted to Node.js + Playwright + Apify Actor
This project has been migrated from Python/Scrapy to Node.js/Playwright while preserving 100% of the original scraping logic.
See README_CONVERSION.md for conversion details.
An Apify Actor that scrapes job listings from AmbitionBox using Playwright browser automation, with optional detailed information extraction from Naukri job pages.
Features
-
AmbitionBox Job Scraping - Extracts comprehensive job listings including:
- Job title, company, location, salary, and experience requirements
- Detailed job descriptions and responsibilities
- Required skills and qualifications
- Employment type and application links
-
Naukri Detail Extraction - Optionally fetches detailed job information from linked Naukri pages:
- Key responsibilities (structured list)
- Required skills and technologies
- Educational qualifications and experience requirements
- Detailed job descriptions
-
Company Information - Extracts comprehensive company details:
- Company overview and summary
- Founding year and employee count
- Company website and headquarters
- Work policies (WFH, hybrid, etc.)
- Complete benefits and perks list
-
Apify Platform Native - Built as a first-class Apify Actor:
- Automatic request scheduling with AutoscaledPool
- Built-in retry logic and error handling
- Cloud-persisted request queue
- Integrated dataset storage
-
Playwright Integration - Uses browser rendering to bypass anti-bot detection:
- Handles JavaScript-rendered content
- Bypasses AmbitionBox's anti-scraping measures
- Chromium headless browser
- Ensures reliable data extraction
-
Concurrency Support - Parallel processing for faster scraping:
- Configurable concurrency (1-10 workers)
- Request queue management
- Automatic rate limiting
Input Parameters
Configure the scraper through the Actor input:
-
role (required, string) - Job role or title to search for
- Example: "software engineer", "python developer", "data scientist"
-
location (optional, string) - Location to search jobs in
- Example: "bangalore", "mumbai", "delhi"
- Use "all" or "worldwide" for all locations
- Note: AmbitionBox primarily lists jobs in India
- If a specific location returns no results, automatically falls back to all locations
-
maxPages (optional, integer, default: 2) - Maximum number of listing pages to scrape
- Range: 1-50
-
maxJobs (optional, integer, default: 20) - Maximum number of jobs to scrape
- Set to 0 for unlimited
-
includeNaukriDetails (optional, boolean, default: true) - Whether to fetch detailed information from Naukri
true: Comprehensive data (slower)false: Basic data only (3-4x faster)
-
proxyConfiguration (optional, object) - Proxy settings for the scraper
- Default: Uses Apify proxy with RESIDENTIAL group (recommended)
- Recommended: Use residential proxies for best success rates with AmbitionBox
- Datacenter proxies may experience higher timeout rates
Output Format
The Actor stores data in the default Apify dataset with this structure:
{"title": "Senior Software Engineer","company": "Tech Company Pvt Ltd","location": "Bangalore, Karnataka","exp_level": "3-6 years","salary_range": "₹10-18 LPA","url": "https://www.ambitionbox.com/jobs/...","apply_url": "https://www.naukri.com/job-listings-...","about_this_role": "Full job description text...","key_responsibility": ["Design and develop scalable backend systems.","Collaborate with cross-functional teams to define features.","Ensure code quality through reviews and testing."],"required_skills": ["Python","Django","AWS","SQL","Docker"],"required_qualifications": ["Bachelor's degree in Computer Science or related field.","3+ years of experience in backend development."],"benefits_perks": ["Health Insurance","Work From Home","Flexible Hours","Learning & Development","Paid Time Off"],"company_info": {"name": "Tech Company Pvt Ltd","Founded in": "2015","Global Employee Count": "500-1000","Website": "https://techcompany.com","company_summary": "Leading technology company specializing in...","work_policy": "Hybrid: 3 days WFO, Remote: 2 days WFH"},"job_type": "Full-time"}
How It Works
- URL Construction - Builds AmbitionBox search URL from role and location parameters
- Listing Extraction - Scrapes job listing pages using Scrapy's efficient crawling
- Detail Parsing - For each job, extracts comprehensive information from detail pages
- Naukri Integration - If enabled, follows "Apply on Naukri" links for additional details
- Company Data - Fetches company overview and benefits from dedicated pages
- Data Storage - Stores all structured data in Apify dataset
Technologies Used
- Scrapy - Fast, high-level web scraping framework
- Scrapy-Playwright - Browser automation integration for Scrapy
- Playwright - Modern browser automation library
- Apify SDK for Python - Actor framework and data storage
- BeautifulSoup4 - HTML parsing (for complex extractions)
- Regular Expressions - Advanced text extraction and cleaning
Why Playwright?
AmbitionBox employs sophisticated anti-bot detection that blocks standard HTTP requests, even when using proxies. Playwright integration provides:
✅ Real Browser Rendering - Executes JavaScript and renders pages like a real user
✅ Anti-Bot Bypass - Realistic browser fingerprinting and behavior
✅ Reliable Extraction - Ensures all dynamic content is loaded
✅ Scrapy Integration - Maintains all Scrapy benefits (pipelines, items, middlewares)
Advantages Over HTTP-Only Scraping
- Reliability - 100% success rate vs 0% with HTTP requests
- JavaScript Support - Handles dynamic content loading
- Anti-Detection - Bypasses sophisticated bot detection
- Future-Proof - Works even as sites add more JavaScript
Local Development
Prerequisites
- Python 3.9+
- Apify CLI
Installation
# Install Apify CLIbrew install apify-cli # macOS# ornpm -g install apify-cli # Node.js# Pull the Actorapify pull# Install dependenciespip install -r requirements.txt
Running Locally
# Run with default inputapify run# Or create/edit .actor/INPUT.json with your parameters
Example INPUT.json
{"role": "python developer","location": "bangalore","maxPages": 3,"maxJobs": 50,"includeNaukriDetails": true,"proxyConfiguration": {"useApifyProxy": true}}
Performance Tips
- Start Small - Test with
maxPages: 1andmaxJobs: 10first - Adjust Concurrency - Modify
CONCURRENT_REQUESTSin spider settings for faster/slower scraping - Skip Naukri - Set
includeNaukriDetails: falsefor basic info only (much faster) - Use Proxies - Enable Apify proxy to avoid rate limiting
Scrapy Settings
The spider uses these custom settings for optimal performance:
custom_settings = {'CONCURRENT_REQUESTS': 8, # Parallel requests'DOWNLOAD_DELAY': 2, # Delay between requests (seconds)'ROBOTSTXT_OBEY': True, # Respect robots.txt'USER_AGENT': 'Mozilla/5.0...', # Custom user agent}
You can modify these in src/spiders/ambitionbox.py if needed.
Troubleshooting
No jobs found
- The website structure may have changed
- Check if the search URL is correct
- Try reducing
DOWNLOAD_DELAYif pages load slowly
Incomplete data
- Enable
includeNaukriDetailsfor comprehensive extraction - Check if company pages are accessible
- Review logs for specific errors
Rate limiting
- Increase
DOWNLOAD_DELAYin settings - Reduce
CONCURRENT_REQUESTS - Ensure proxy configuration is enabled
Proxy timeouts
- Switch to residential proxies in proxy configuration (highly recommended)
- Residential proxies have much better success rates than datacenter proxies
- Update input to include:
"apifyProxyGroups": ["RESIDENTIAL"] - Note: Residential proxies consume more proxy credits but significantly improve reliability
Architecture
src/├── spiders/│ ├── __init__.py│ ├── title.py # Original title spider│ └── ambitionbox.py # AmbitionBox job scraper├── items.py # Item definitions├── pipelines.py # Data processing pipelines├── middlewares.py # Request/response middlewares├── settings.py # Scrapy settings├── main.py # Actor entry point└── __main__.py # Execution wrapper
Resources
License
Apache 2.0.