An Actor that automatically locates and scrapes key contact details from German website imprint pages (Impressum). It extracts information such as company name, address, phone numbers, emails, and decision-maker details.
Asynchronous Handling: The main extract method now creates and runs all extraction tasks concurrently using asyncio.gather.
Small improvements in the company name extraction.
[v0.5.2-beta] - 2025-09-01
Changed
The phone number and email output is now limited to 10 results.
[v0.5.1-beta] - 2025-08-30
Changed
Added additional keywords for decision maker extraction.
[v0.5.0-beta] - 2025-08-28
Added
Migration support: when the server is migrated on Apify's side, the Actor now persists state across runs using Actor.set_value() and Actor.get_value().
The time the website was finished scraping (scraped_at) can now be found under the metadata output.
Changed
Slightly improved the decision maker extraction for better accuracy.
Moved imprint_url output from the metaData to the standard output.
Fixed
Bug in company name extraction that occasionally returned incorrect values.
[v0.4.0-beta] - 2025-08-27
This is a major update, marking the transition from alpha to the first beta release! The actor has been completely rewritten from the ground up to be more powerful, reliable, and flexible.
Added
Dual Fetching Technology: The actor can now use a fast HTTP-based method for simple sites and automatically fall back to a powerful headless browser (Playwright) for modern, JavaScript-heavy websites. This dramatically increases the success rate of finding and scraping imprint pages.
Selective Data Extraction: You now have full control over what data you want. A new input field fieldsToExtract allows you to choose the exact information you need (e.g., only company name and email).
Enhanced Configuration: New input options like metaData and errorOutput have been added to give you more insight and control over the scraping process.
Proxy Support: A proxy server provided by Apify can now be set in the input configuration.
Changed
Reliability Overhaul: The entire codebase has been refactored. This results in better stability and significantly more accurate data extraction.
Smarter Scraping Logic: The algorithms for identifying and parsing data have been completely reworked, leading to higher quality results across a wider variety of websites.
ML-Powered Decision Maker Extraction: The logic for identifying decision-makers has been upgraded from simple keyword matching to a sophisticated NER (Named Entity Recognition) model, resulting in much higher accuracy.
Redesigned Input: The actor's input configuration has been updated to be more intuitive and powerful, replacing the previous simple toggles with more granular controls.
Improved Output Structure: The output JSON is now more cleanly structured and provides additional context, such as confidence scores for certain data points.
[v0.3.0-alpha] - 2025-07-17
Added:
Handelsregister number and court extraction from imprint pages.
Graceful shutdown handling with signal handlers (SIGINT, SIGTERM).
Health check system for monitoring actor responsiveness.
Semaphore-based concurrency control to limit simultaneous requests.
Enhanced HTTP client timeout configuration.
Fixed:
Critical bug where actor would hang indefinitely when URL processing timeout was reached.
Changed:
Enhanced logging for better debugging and monitoring.
[v0.2.3-alpha] - 2025-06-24
Added:
Timeout to automatically skip URLs that take too long to process.
Added URL validation to filter out malformed URLs.
Error loggings for unsuccessfully processed URLs can now be included it the output.
[v0.2.2-alpha] - 2025-05-02
Changed:
Extracted Python directory for looking up German postal codes and cities.
Emails are now sorted based on an algorithm that determents their relevance.
[v0.2.1-alpha] - 2025-05-02
Changed:
Improvements to the extraction of addresses and emails.
Fixed:
Doing the email extraction the script didn't properly filter Unicode encoded characters.
[v0.2.0-alpha] - 2025-04-24
Added:
Search for social media links.
Changed:
Improved performance of the decision maker extraction.
[v0.1.1-alpha] - 2025-04-17
Changed:
Default settings: Decision Makers Search is now set as activated (true) in the default input settings.
Removed:
Input max_dept option removed, since changes by the end user is not required for this actor's functionality.
Fixed:
Decision maker search functionality is now working properly.
[v0.1.0-alpha] - 2025-04-14
Added:
Initial release of the German Imprint Scraper.
Extracts Company Name, Address, Phone, Email from Imprint pages.