An Actor that automatically locates and scrapes key contact details from German website imprint pages (Impressum). It extracts information such as company name, address, phone numbers, emails, and decision-makers (Entscheider, Entscheidungsträger)
Discount Tier Pricing System: Implemented comprehensive pricing manager with support for four discount tiers (FREE, BRONZE, SILVER, GOLD).
Automatically detects user's discount tier from APIFY_ACTOR_PRICING_TIER environment variable.
Dynamic pricing adjustments based on tier with detailed debug logging.
New PricingManager class in src/utilities/pricing_manager.py.
Changed
Pricing Integration: Replaced charging_manager with pricing_manager throughout the codebase for discount tier support.
Removed
Automatic Billing Events: Removed manual charging for events now handled automatically by Apify:
actor-start: Now charged automatically by Apify platform.
successful-result: Now charged automatically by Apify platform.
Removed unused charge_event import from result_handler.py.
[v0.6.0—beta] — 2025-11-15
Changed
Headless Browser Configuration: Replaced the binary usePlaywright toggle with a three-mode headlessBrowser option offering granular control over fetching strategy:
headlessBrowserOn: Always use browser (most reliable for JavaScript-heavy sites)
headlessBrowserAuto: Automatic mode with HTTP first, browser fallback (default, recommended)
headlessBrowserOff: HTTP only, no browser (fastest, but may fail on dynamic sites)
Backward Compatibility: The deprecated usePlaywright=true setting now acts as an override, forcing headlessBrowserOn mode when explicitly set.
Removed
Optional Error Output: Removed the optional error output that pushes URLs that failed to extract data into the dataset with their error message.
[v0.5.5—beta] — 2025-11-08
Changed
NER API Integration Update: Migrated to new model-specific endpoint /extract-names/german for improved accuracy.
Base URL Configuration: NER_API_URL environment variable now expects base URL only (e.g., https://ner-api.domain.net), endpoint path is automatically appended.
API Response Format: Updated to support new response structure with persons and raw_entities fields.
Added
Created .env.example template file for environment configuration.
Added ENV_SETUP.md with comprehensive documentation for NER API setup, URL construction, and troubleshooting.
Fixed
Improved confidence score extraction from raw_entities field with fallback to default value (0.8) when scores are missing.
Enhanced URL handling to automatically strip trailing slashes from base URLs.
[v0.5.4—beta] — 2025-09-12
Added
Added cost limit checking to automatically stop processing when the user-configured 'Maximum cost per run' is reached.
[v0.5.3—beta] — 2025-09-06
Changed
Asynchronous Handling: The main extract method now creates and runs all extraction tasks concurrently using asyncio.gather.
Small improvements in the company name extraction.
[v0.5.2—beta] — 2025-09-01
Changed
The phone number and email output is now limited to 10 results.
[v0.5.1—beta] — 2025-08-30
Changed
Added additional keywords for decision maker extraction.
[v0.5.0—beta] — 2025-08-28
Added
Migration support: when the server is migrated on Apify's side, the Actor now persists state across runs using Actor.set_value() and Actor.get_value().
The time the website was finished scraping (scraped_at) can now be found under the metadata output.
Changed
Slightly improved the decision maker extraction for better accuracy.
Moved imprint_url output from the metaData to the standard output.
Fixed
Bug in company name extraction that occasionally returned incorrect values.
[v0.4.0—beta] — 2025-08-27
This is a major update, marking the transition from alpha to the first beta release! The actor has been completely rewritten from the ground up to be more powerful, reliable, and flexible.
Added
Dual Fetching Technology: The actor can now use a fast HTTP-based method for simple sites and automatically fall back to a powerful headless browser (Playwright) for modern, JavaScript-heavy websites. This dramatically increases the success rate of finding and scraping imprint pages.
Selective Data Extraction: You now have full control over what data you want. A new input field fieldsToExtract allows you to choose the exact information you need (e.g., only company name and email).
Enhanced Configuration: New input options like metaData and errorOutput have been added to give you more insight and control over the scraping process.
Proxy Support: A proxy server provided by Apify can now be set in the input configuration.
Changed
Reliability Overhaul: The entire codebase has been refactored. This results in better stability and significantly more accurate data extraction.
Smarter Scraping Logic: The algorithms for identifying and parsing data have been completely reworked, leading to higher quality results across a wider variety of websites.
ML-Powered Decision Maker Extraction: The logic for identifying decision-makers has been upgraded from simple keyword matching to a sophisticated NER (Named Entity Recognition) model, resulting in much higher accuracy.
Redesigned Input: The actor's input configuration has been updated to be more intuitive and powerful, replacing the previous simple toggles with more granular controls.
Improved Output Structure: The output JSON is now more cleanly structured and provides additional context, such as confidence scores for certain data points.
[v0.3.0—alpha] — 2025-07-17
Added:
Handelsregister number and court extraction from imprint pages.
Graceful shutdown handling with signal handlers (SIGINT, SIGTERM).
Health check system for monitoring actor responsiveness.
Semaphore-based concurrency control to limit simultaneous requests.
Enhanced HTTP client timeout configuration.
Fixed:
Critical bug where actor would hang indefinitely when URL processing timeout was reached.
Changed:
Enhanced logging for better debugging and monitoring.
[v0.2.3—alpha] — 2025-06-24
Added:
Timeout to automatically skip URLs that take too long to process.
Added URL validation to filter out malformed URLs.
Error loggings for unsuccessfully processed URLs can now be included it the output.
[v0.2.2—alpha] — 2025-05-02
Changed:
Extracted Python directory for looking up German postal codes and cities.
Emails are now sorted based on an algorithm that determents their relevance.
[v0.2.1—alpha] — 2025-05-02
Changed:
Improvements to the extraction of addresses and emails.
Fixed:
Doing the email extraction the script didn't properly filter Unicode encoded characters.
[v0.2.0—alpha] — 2025-04-24
Added:
Search for social media links.
Changed:
Improved performance of the decision maker extraction.
[v0.1.1—alpha] — 2025-04-17
Changed:
Default settings: Decision Makers Search is now set as activated (true) in the default input settings.
Removed:
Input max_dept option removed, since changes by the end user is not required for this actor's functionality.
Fixed:
Decision maker search functionality is now working properly.
[v0.1.0—alpha] — 2025-04-14
Added:
Initial release of the German Imprint Scraper.
Extracts Company Name, Address, Phone, Email from Imprint pages.