Industry: Retail data collection and analytics
Service: Providing highly accurate and accessible data to some of the biggest names in European retail. Data at Daltix is remarkably curated, going from raw scraped online data to fully enriched market insights that combine proprietary product matching algorithms and offline field data. Customers rely on Datlix’s data and platform for authoritative insights into their business, enabling them to confidently make daily decisions affecting millions of retail customers across Europe.
Customers: Colruyt, Unilever, Intergamma, GfK, etc.
Geography scope: BeNeLux, expanding to Germany and France
Challenge: Slow, complicated, tedious processes when handling data in Python + facing more and more anti-scraping measures
Solution: Apify SDK leaning on Puppeteer integration.
Daltix is a Belgium-based company engaged in extracting data from 250+ e-commerce websites daily. They combine scraped online data with offline insights and purchased analytics, thus forming an integrated data source stream. Daltix’s retail market customers range from DIY and FMCG to food retail; they rely on Daltix’s commercial business insights and granular analytics to support their business decisions.
Daltix team started scraping using a custom-made Python framework built on top of Scrapy. This carried them through to 2019, at which point they had scaled up from 250K to 2 million scraped resources a day. But this advance opened a different kind of issue: a number of structural limits with Scrapy and their original framework that were severely impacting their product and engineering roadmaps.
Moreover, as they were scraping on this new level, Daltix began facing more and more blocking measures from the websites they were trying to access. Python-heavy as they were, it also started being tedious to integrate with browsers, and Puppeteer support left much to be desired.
Daltix team at full strength ✊
Over time, the Daltix team observed all sorts of performance metrics improving drastically:
- Thanks to a sharp reduction in resource requirements needed to run scrapers, Daltix achieved a ~90% saving in Amazon EC2 costs and a 60% reduction in time taken to collect retail data.
- This allowed them to save over 9000 EC2 hours/month, leading to almost 100,000 hours saved in total so far.
- Boosted their scraping from 2 million to 5 million resources per day, now generating roughly 9TB of data/month.
- After almost tripling the number of resources collected daily, they expect these numbers to double again over the course of the next 12 months.
- Despite scaling up their web scraping activity, they now need ~30% fewer engineering input from the team to manage the processes.
- Maintenance has drastically reduced with consistently higher quality datasets and less hassle with anti-scraping measures.
You can check out a technical breakdown of advantages gained by the Daltix team as well.
The higher quality and coverage of data now allows Daltix to combine even more sources and create better datasets for their customers. In terms of the technical aspects, Daltix runs their scrapers on top of AWS, has adopted Snowflake as their data warehouse, and is making plans to move to Playwright. In terms of scale, the company has been gradually growing country by country and is planning to expand further into Europe over the coming years. Overall, the team evaluates the transition to Apify as an important contributor to their success and a strong investment in the future.
CTO of Daltix