From Scrapy to Apify: how a retail data agency saved 90% costs on web scraping
Case study
This success story could be called "Why and how you should move your scrapers from Python to JavaScript", “How to reduce your Amazon EC2 costs by 92% and save your team 4 hours a day of work”, or “How to migrate your scrapers using Apify SDK”. But whatever the best title for it would be, you should know this success story is really about taking calculated risks, having a team that’s willing to relearn and reinvent, and being adaptive in a fast-paced market environment. Let’s see how a company obsessed with providing quality data to enterprises active in retail managed to make that big move from Scrapy to Apify, from Python to JavaScript - all within one year - and what the payoffs are from their decision.
Outline
Name: Daltix
Industry: Retail data collection and analytics
Service: Providing highly accurate and accessible data to some of the biggest names in European retail. Data at Daltix is remarkably curated, going from raw scraped online data to fully enriched market insights that combine proprietary product matching algorithms and offline field data. Customers rely on Datlix’s data and platform for authoritative insights into their business, enabling them to confidently make daily decisions affecting millions of retail customers across Europe.
Customers: Colruyt, Unilever, Intergamma, GfK, etc.
Geography scope: BeNeLux, expanding to Germany and France
Challenge: Slow, complicated, tedious processes when handling data in Python + facing more and more anti-scraping measures
Solution: Apify SDK leaning on Puppeteer integration.
Challenge
Daltix is a Belgium-based company engaged in extracting data from 250+ e-commerce websites daily. They combine scraped online data with offline insights and purchased analytics, thus forming an integrated data source stream. Daltix’s retail market customers range from DIY and FMCG to food retail; they rely on Daltix’s commercial business insights and granular analytics to support their business decisions.
Daltix team started scraping using a custom-made Python framework built on top of Scrapy. This carried them through to 2019, at which point they had scaled up from 250K to 2 million scraped resources a day. But this advance opened a different kind of issue: a number of structural limits with Scrapy and their original framework that were severely impacting their product and engineering roadmaps.
Moreover, as they were scraping on this new level, Daltix began facing more and more blocking measures from the websites they were trying to access. Python-heavy as they were, it also started being tedious to integrate with browsers, and Puppeteer support left much to be desired.
As a result, they were expending huge engineering effort in on-going maintenance, overspending on inefficient machine usage and beginning to encounter sites that they simply could not scrape. For this whole array of reasons, last year Daltix decided to switch to Apify SDK. It was expected to be a challenging project involving migrating 70 scrapers from 250+ locations - practically all of their existing scrapers - onto the new platform. Not to forget retraining and accustoming the team to new JavaScript workflows. Combine that factor with the small original team of three, time for the learning curve, other priorities plus ongoing maintenance - and you’ll get a project that took a year to accomplish. But the payoffs turned out to be more than worth it.
Why did Daltix choose Apify SDK?
It was Daltix’s Lead Engineer on Data Collection (Charlie Orford), who came across the Apify platform and presented it to the team. The pro-Apify argument was simple and rather elegant: scraping the web using JavaScript - the language the web is built on - felt like a more intuitive solution. The whole process of writing actors in Python and then later integrating some JavaScript into Python started to seem quite counterproductive to the Daltix team. To the contrary, the Apify codebase was modern, clean and easy to follow. Crucially, its design leant itself to very easy customization and extension, making it an attractive target to build their renewed framework on top.
Another main pro-Apify argument was the seamless integration with Puppeteer. Last but not least - once Daltix took into account the documentation, evaluation metrics for the success of the mission, and the tutorials - it all became just a matter of time before they fully transformed. The tutorials were especially helpful to the non-core, less technical teams with no prior experience working in JavaScript as well as Apify’s Github fast and thorough issue responses. One year of switching from Python - taking courses combined with applying some previous knowledge - and their integration project with Apify is now complete, producing tangible benefits.
Daltix team at full strength ✊
Key 5 benefits of switching from Scrapy to Apify
Over time, the Daltix team observed all sorts of performance metrics improving drastically:
- Thanks to a sharp reduction in resource requirements needed to run scrapers, Daltix achieved a ~90% saving in Amazon EC2 costs and a 60% reduction in time taken to collect retail data.
- This allowed them to save over 9000 EC2 hours/month, leading to almost 100,000 hours saved in total so far.
- Boosted their scraping from 2 million to 5 million resources per day, now generating roughly 9TB of data/month.
- After almost tripling the number of resources collected daily, they expect these numbers to double again over the course of the next 12 months.
- Despite scaling up their web scraping activity, they now need ~30% fewer engineering input from the team to manage the processes.
- Maintenance has drastically reduced with consistently higher quality datasets and less hassle with anti-scraping measures.
- At the end of the day, Daltix considers moving to JavaScript/Node and adopting Apify a “significant strategic win” for their business.
You can check out a technical breakdown of advantages gained by the Daltix team as well.
What’s the future?
The higher quality and coverage of data now allows Daltix to combine even more sources and create better datasets for their customers. In terms of the technical aspects, Daltix runs their scrapers on top of AWS, has adopted Snowflake as their data warehouse, and is making plans to move to Playwright. In terms of scale, the company has been gradually growing country by country and is planning to expand further into Europe over the coming years. Overall, the team evaluates the transition to Apify as an important contributor to their success and a strong investment in the future.
The combination of Javascript, Node and Apify gave us all the pieces we needed to address every one of our existing challenges, futureproof our platform and resume scaling up our web collection activities. - Charlie Orford, Lead Engineer on Data Collection
Even if you’re not ready to switch from Python to JavaScript just yet, you can still use our newly released Python Client with Beautiful Soup and Pandas tutorials. And if you need more ideas on how web scraping data can be used in retail and e-commerce - head over to our industry pages, it’s all there.
“It was a year-long project for us to switch from Scrapy to Apify as we had to train the team in JavaScript as well as migrate all of our existing scrapers. While the switch was challenging for our small team, it was also a big success and we are very happy with Apify. ”
Simon Esprit
CTO of Daltix