Cmg Werknl Search Service
Pricing
Pay per usage
Cmg Werknl Search Service
DeprecatedAutomates searching on werk.nl
0.0 (0)
Pricing
Pay per usage
0
2
1
Last modified
4 months ago
#CMG Werk.nl Search Service Documentation
1. Overview
The CMG Werk.nl Search Service is a specialized web scraping application designed to automate the process of searching for and extracting candidate profiles from Werk.nl, the Dutch government's employment website. The service uses browser automation to navigate the Werk.nl platform, perform searches based on specific criteria, extract candidate profile data, and store this information in a Firebase database.
2. System Architecture
The application follows a modular architecture with several key components:
2.1. Core Components
- Main Module (src/main.py): Entry point for the application that orchestrates the workflow.
- SeleniumAgentWerkNL: Core browser automation component that handles navigation, authentication, searching, and data extraction.
- FirebaseManager: Manages all interactions with the Firebase database for data storage and retrieval.
- TelegramAuthHandler: Handles two-factor authentication through Telegram bot integration.
2.2. Data Flow
- The process begins with input parameters including search queries, maximum profiles to extract, and user credentials.
- The application authenticates with Werk.nl using stored cookies or login credentials.
- If 2FA is required, the Telegram bot integration facilitates code entry.
- Once authenticated, the application performs searches based on the provided criteria.
- Candidate profiles matching the search criteria are extracted and enriched with additional data.
- The extracted profiles are stored in Firebase for later use.
3. Detailed Component Documentation
3.1. SeleniumAgentWerkNL
This is the primary component responsible for browser automation and interaction with the Werk.nl website.
Key Features:
- Browser Initialization: Sets up a Selenium/undetected-chromedriver instance with proxy support and anti-detection features.
- Authentication: Handles login to Werk.nl with support for 2FA via Telegram.
- Cookie Management: Stores and retrieves browser cookies to maintain authentication sessions.
- Profile Search: Executes searches based on specified criteria.
- Data Extraction: Scrapes candidate profile data from search results.
- Profile Enrichment: Gathers detailed information about candidates.
Implementation Details:
- Uses undetected-chromedriver to bypass bot detection.
- Implements realistic human-like interactions (random delays, mouse movements).
- Handles various authentication flows including 2FA.
3.2. FirebaseManager
Responsible for all database operations with the Firebase backend.
Key Features:
- User Management: Stores and retrieves user data and session information.
- Job Management: Retrieves job details and search queries.
- Candidate Storage: Saves extracted candidate profiles.
- Cookie Management: Securely stores and retrieves authentication cookies with encryption.
Implementation Details:
- Uses Firebase Admin SDK for server-side operations.
- Implements encryption for sensitive data like cookies.
- Provides batch operations for efficient data storage.
3.3. TelegramAuthHandler
Manages the two-factor authentication process using a Telegram bot.
Key Features:
- 2FA Request: Sends requests for 2FA codes to users through Telegram.
- Code Retrieval: Receives authentication codes from users via a streaming connection.
- Notification: Sends status updates and messages to users.
Implementation Details:
- Uses async HTTP client (httpx) for API communication.
- Implements retry mechanisms with exponential backoff for reliability.
- Provides streaming connection for real-time code receipt.
3.4. Data Models
The application uses Pydantic models for structured data handling:
Key Models:
- ProfileData: Represents a candidate's complete profile information.
- Experience: Describes a candidate's work experience.
- Education: Represents educational background information.
- CVProfile: Captures the core CV details for a candidate.
- CandidateMatch: Stores matching scores between candidates and job opportunities.
4. Authentication Flow
The system uses a multi-step authentication process:
- 
Cookie-based Authentication: - The system first attempts to authenticate using stored cookies retrieved from Firebase.
- Cookies are encrypted in storage and decrypted for use.
 
- 
Credential-based Login: - If cookies are invalid or expired, the system falls back to username/password authentication.
 
- 
Two-Factor Authentication: - When 2FA is required, the system: a. Sends a notification to the user via Telegram. b. Requests a 2FA code from the Telegram bot API. c. Waits for the user to provide the code through Telegram. d. Enters the received code on the Werk.nl login form. e. Notifies the user upon successful authentication.
 
- 
Session Persistence: - After successful authentication, cookies are captured and stored in Firebase for future use.
 
5. Search and Extraction Process
- 
Query Preparation: - The system retrieves search queries from Firebase or from direct input.
- Queries are associated with specific job IDs and include search parameters like keywords, locations, etc.
 
- 
Search Execution: - The SeleniumAgentWerkNL navigates to the search interface on Werk.nl.
- Search parameters are entered according to the provided queries.
- Results are paginated and processed systematically.
 
- 
Profile Extraction: - Basic profile information is extracted from search results.
- For each profile of interest, detailed information is gathered through profile page visits.
- Data is structured according to the defined Pydantic models.
 
- 
Profile Filtering and Ranking: - Extracted profiles are filtered based on relevance criteria.
- Profiles are ranked according to match quality for the specific job.
- A maximum number of profiles (as specified in the input) are selected.
 
- 
Data Storage: - Selected profiles are stored in Firebase.
- Association with the original job ID is maintained.
- Additional metadata like extraction time and match scores are recorded.
 
6. Integration Points
6.1. Firebase Integration
- Used for persistent storage of user data, job information, and candidate profiles.
- Serves as the central data repository for the application ecosystem.
6.2. Telegram Bot Integration
- Facilitates two-factor authentication process.
- Provides real-time communication with users for authentication requirements.
6.3. Apify Platform Integration
- The service is designed to run as an Apify Actor.
- Leverages Apify's proxy infrastructure for reliable connections from Dutch IP addresses.
- Uses Apify's logging and storage capabilities.
7. Error Handling and Resilience
- Retry Mechanisms: Critical operations implement retry logic with exponential backoff.
- Exception Handling: Comprehensive error catching and logging for debugging.
- Status Tracking: Operation status is recorded in Firebase for monitoring.
- Proxy Management: Dynamic proxy rotation for avoiding IP blocks.
8. Security Considerations
- Credential Protection: User credentials are not hardcoded but supplied via environment variables.
- Cookie Encryption: Authentication cookies are encrypted before storage.
- Access Control: Firebase security rules control access to sensitive data.
- 2FA Support: Two-factor authentication provides an additional security layer.
9. Deployment
The service is designed to be deployed as an Apify Actor, which provides:
- Scheduled runs
- Webhook integration
- Input parameter configuration
- Results storage and retrieval
The deployment requires configuration of:
- Firebase credentials
- Telegram bot API credentials
- Werk.nl login credentials
- Proxy settings
10. Limitations and Considerations
- Rate Limiting: The service includes delays to avoid triggering rate limits on Werk.nl.
- Session Management: Browser sessions need to be managed carefully to maintain authentication.
- Site Changes: As with any scraper, changes to the Werk.nl website structure may require updates.
- Legal Considerations: Usage should comply with Werk.nl's terms of service and applicable regulations.
11. ChromeDriver Version Management
The application implements a robust solution for handling ChromeDriver version mismatches with Chrome browser installations, which is a common issue with browser automation tools.
11.1. Automatic Version Management
The system uses a multi-layered approach to ensure compatibility:
- 
Auto-Download Compatible Versions: - The application uses the auto-download-undetected-chromedriverpackage to dynamically fetch the correct ChromeDriver version matching the installed Chrome browser.
- This happens automatically during browser initialization, with the compatible driver being stored in a cache directory.
 
- The application uses the 
- 
Fallback to Manual Version Detection: - If automatic download fails, the system falls back to detecting the installed Chrome version using webdriver-manager.
- The main version number is extracted and passed to undetected-chromedriverto ensure compatibility.
 
- If automatic download fails, the system falls back to detecting the installed Chrome version using 
- 
Default Behavior Fallback: - As a last resort, if both automatic methods fail, the system will use the default behavior of undetected-chromedriver.
 
- As a last resort, if both automatic methods fail, the system will use the default behavior of 
11.2. Troubleshooting ChromeDriver Issues
If you encounter ChromeDriver version mismatch errors, you can use the provided utility script or follow the manual steps:
Using the Utility Script
The project includes a utility script that automates the troubleshooting process:
This interactive script can:
- Update all required dependencies
- Clear ChromeDriver caches
- Download the correct ChromeDriver version for your Chrome installation
It's recommended to run all steps when experiencing ChromeDriver issues.
Manual Troubleshooting Steps
If you prefer to resolve issues manually:
- 
Clear the Cache: - Delete cached ChromeDriver executables at:
- Windows: %APPDATA%\undetected_chromedriver
- Linux/macOS: ~/.undetected_chromedriver
- Project cache: cache/undetected_chromedriver(in the project directory)
 
- Windows: 
 
- Delete cached ChromeDriver executables at:
- 
Update Dependencies: - Ensure you have the latest versions of the required packages:
 Or with Poetry:$pip install undetected-chromedriver auto-download-undetected-chromedriver webdriver-manager --upgrade$poetry update undetected-chromedriver auto-download-undetected-chromedriver webdriver-manager
 
- Ensure you have the latest versions of the required packages:
- 
Check Chrome Installation: - Verify that Chrome is properly installed and accessible from the system PATH.
- On some systems, you may need to specify the Chrome binary location explicitly.
 
12. Future Enhancements
Potential areas for enhancement include:
- Improved profile matching algorithms
- More sophisticated filtering options
- Extended data extraction capabilities
- Additional authentication methods
- Performance optimizations for large-scale scraping
This documentation provides a comprehensive overview of the CMG Werk.nl Search Service, its architecture, functionality, and implementation details.
On this page
Share Actor:














