How Zalando built its data lake on Amazon S3
Founded in 2008, Zalando is Europe’s leading online platform for fashion and lifestyle with over 32 million active customers. I am a lead data engineer at Zalando and a steady contributor to the company’s cloud journey. In this blog post, I cover how Amazon Simple Storage Service (Amazon S3) became a cornerstone of the data infrastructure of our company. First, I discuss Zalando’s business need in gaining data insights, and how its historical technology stack provided disparate information. Then, I cover how we decided to migrate to AWS, specifically using Amazon S3 to build a data lake. Finally, I talk about how our Amazon S3 usage has evolved over time, from providing employee access to data, to optimizing our storage costs with various Amazon S3 storage classes.
Hopefully you can learn from the experiences I share and how we established best practices that help us run a data driven company at a multi-petabyte scale.
Zalando’s technology backstory
In 2015, Zalando was a fashion retailer with its IT environment running as a large on-premises monolith. Major parts of the infrastructure, be it on the transactional or the analytical side, were directly integrated and dependent on each other. At the same time as the complexity of systems increased, so did the number of teams that needed to add their bits and pieces. Then the decision was made to move the company from being “just” an online retailer to becoming a platform for fashion. Setting that goal meant preparing for scale. After considering our business needs for today and the future, it was a logical decision for the company to move to the cloud. We evaluated multiple cloud providers, and AWS was chosen as the cloud provider of choice due to its durability, availability, and scalability. We also considered the expansive ecosystem of services that AWS offers that we could leverage in the future. Zalando’s monolithic infrastructure was broken down into microservices, leading to teams with end-to-end responsibilities for their part of the tech landscape in their isolated development and operations space.
Changing our tech infrastructure had a direct impact on the data landscape of the company. Central databases accessed by many components became decentralized backends and communication was done via REST APIs. Our central data warehouse, with direct connections to the transactional data stores, had to cope with a decentralized data production without direct reachability. To overcome these challenges, a central team was put together with the goal of building a data lake at Zalando (Figure 1). There were two initial incentives to start a data lake: having a central data archive in this new, distributed environment and creating a distributed compute engine for the company.
Figure 1: Zalando’s data lake architecture
After lifting the relational databases’ size restrictions, it turned out that the company was already producing much more potentially valuable data. We needed a data storage option that could cope with the growing amounts of data, and whatever option we chose needed to be scalable, reliable, and affordable. Looking at the AWS service portfolio, Amazon S3 was the clear choice to act as the base layer of the new central data lake of the company.
Our first priority in getting the freshly setup and empty data lake started was the integration of the major data sources of the company. By that time, a central event bus was established whose major purpose was service to service communication among distributed microservices. The second major purpose was added by introducing an archiver component to save a copy of all published messages in the data lake. Major business processes were already communicating via the event bus, which made its content highly valuable for analytical processing. The ingestion pipeline is built based on serverless components to fulfill basic data preparation requirements like reformatting and repartitioning. This pipeline is described in more detail in this blog post.
Here is a 5 minute This Is My Architecture video on how we built a serverless event data processing pipeline and pushed the data into our data lake:
While the transition to the cloud was in progress, there were still plenty of valuable datasets produced and stored in the original data warehouse in our data center. One such dataset, for example, was the central sales logic of the company. A second central pipeline took care of making data warehouse (DWH) datasets available in the data lake, too. The third piece to the puzzle was the web tracking data, which was significant in raw data size and incredibly valuable when combined with the already present datasets. Those three pipelines ensured a steady feeding of the company’s initial data lake.
S3 service options in detail
Constantly growing our data lake on Amazon S3 resulted in various situations that facilitated usage of a variety of features provided by S3. For the remainder of this blog post, I cover the features used at Zalando, including their advantages and applicable use cases.
Data sharing and data access
If nobody is using it, there is no point in storing large amounts of data. For that reason, the first and most important challenge for us to solve was data sharing. At Zalando, many teams have their own AWS accounts that immediately put up requirements for cross account data sharing. The easiest ways to do so are bucket policies. Bucket policies allow you to attach a definition directly to your bucket that specifies principals (like roles) that are entitled some actions that they are allowed to perform. An example of such an action is GetObject on a particular resource, like a specific prefix in your bucket. This is very convenient to get started and works very well for low numbers of connections to other accounts. After some time, more and more people wanted access to the data. Not only did the management of the growing bucket policy become a bit cumbersome, eventually we even hit the upper limit of its size.
As there was a need to change to a more isolated and more scalable approach, we started using IAM roles. IAM roles work very similar to bucket policies for this particular use case. You add an inline policy, which looks very similar to that part of the previous bucket policy by specifying actions on certain resources. The only difference is the principal, which is solved by a trust relationship with the target account. In the role itself, you specify the account number you are trusting, which on their side enables them to assume the created role and execute their access through it.
Backing up and recovering data
Once you have a big data lake shared among many for various production uses, you start to think of backing up your data and ensuring recovery in case of incidents. The easiest way to do so is to turn on versioning for your production buckets. Versioning enables storage of previous versions in case you must access an older version or recover deleted versions, even if they are not visible via standard S3 API calls. Essentially each prefix becomes a stack of objects, of which only the latest one will be accessible by default.
Versioning is convenient in case you introduce a bug to your data pipeline and you must roll back the output. More importantly, versioning also works for data deletions. Instead of actually deleting your objects, it puts a delete marker on top of that object’s stack that makes it invisible, yet still accessible if asked for. In 2017 this feature was a life saver for us. Everybody knows the stories of new joiners that accidentally drop the production database. In a huge incident, the full history of our web tracking data was deleted, terabytes of our objects gone. Thanks to having versioning enabled we were able to restore all of them within a single day by simply removing their delete markers.
If you are looking for additional ways of backing up your data, consider S3 Cross-Region Replication (CRR). We started using CRR early, which allows you to store copies of your data in a physically different AWS Region. A convenient way to implement this is replicating your production buckets to a different Region, and combine that with immediate archiving of the target bucket’s contents to Amazon S3 Glacier.
Optimizing storage costs
At some point Amazon S3, even though it has a great pricing model, becomes an objective for cost optimization. What is negligible for gigabytes and terabytes of data becomes a significant part of your cloud bill at multi-petabyte scale.
To address this you must understand what data you actually have. Now, our organization spans across nearly 400 teams with over 8000 S3 buckets in total. It is impossible to understand your storage just from human input. S3 Inventory is a feature that gives you basic information about every object in your configured buckets. Simple things like size of the objects and last_modified timestamps can tell you how old your data is and how much of an impact it has on your cost distribution. If you have versioning enabled this is the way to understand how many objects are not the latest version and if you should trigger a cleanup for those.
The full potential of the S3 inventory is unleashed when you combine it with S3 Access Logs. While at this point you are aware of the impact of your stored data on your costs, S3 access logs let you differentiate between hot and cold data. What are the high value datasets that are accessed all the time, and what are the datasets that are significant in size, but nobody ever reads? The latter are easy candidates for cleanup operations and impactful cost savings.
Amazon S3 comes with different storage classes that address different requirements and access patterns. S3 Standard is the default storage class. It gives you the highest availability and the lowest cost for data retrieval. However, there are cheaper options when it comes to pure storage costs that have trade-offs on the other two characteristics. S3 Standard-Infrequent Access (S3 Standard-IA) is the best option for objects that are not touched for stretches of time but should remain available (for infrequent lookup of historic data). It is around 40% cheaper on storage, while the cost for access requests roughly doubles. S3 Glacier and S3 Glacier Deep Archive are even more cost-effective storage classes, trading for lower retrieval speeds. They are perfect for large amounts of less accessed data.
While the storage classes are great to reduce your costs when you exactly know your data and your use cases, handling them manually is quite the effort. S3 Intelligent-Tiering allows for automatic transition of objects between storage classes based on predefined criteria. We are saving 37% annually in storage costs by using Amazon S3 Intelligent-Tiering to automatically move objects that have not been touched within 30 days to S3 Standard-IA. We then move them back to S3 Standard when they get accessed.
We are saving 37% annually in storage costs by using Amazon S3 Intelligent-Tiering to automatically move objects that have not been touched within 30 days to S3 Standard-IA.
Many datasets lose value over time. While S3 Intelligent-Tiering can save you some money on data storage, after some time you might want to perform clean up with well-chosen retention times. Lifecycle policies are a great way to set up such retention times, but allow for even more functionality. For example, you can define storage class transitions or tagging of objects for later special usage. Among other use cases, we use lifecycle policies as an easy way to clean up datasets that can only be stored for a certain amount of time by law.
Zalando’s journey to the cloud has come a long way. Building a data lake on Amazon S3 has allowed employees across the organization to act on data they previously wouldn’t have had access to. By now there are multiple data pipelines, both centrally managed and operated by our internal users, which constantly increase our ingestion volume into Amazon S3. We were able to reduce our storage cost and save up to 40% annually by using various Amazon S3 services. For example, we used S3 Intelligent-Tiering and lifecycle policies to set up retention times to delete unused data. In return, our internal users are able to extract and analyze information from the data we store and ultimately improve our customers’ experience as they shop through our website and mobile applications.
I would like to close out with advice for those of you who are currently starting your Amazon S3 data lake. When building a data lake on Amazon S3, there are plenty of options on how to establish your data layout, between choosing different buckets, as well as prefixes within them. In the end, it is not only important to choose the naming and distribution plan, but equally important that you adhere to that plan. Once you arrive at a scale of thousands of buckets, plenty of effort must be put into cleanup and organization of data storage to guarantee global manageability.
Since Zalando started setting up its initial Amazon S3-based data lake, both the Zalando data infrastructure and S3 itself have changed a lot. We’ve had the chance to use many of the already available features over the last 5 years. Fast forward to early 2020, Zalando fully embraces decentralization of storage, with teams using their own buckets to store and share datasets through a centralized governance and data management approach. With an inflow of terabytes a day, and a total storage volume of 15 PB and growing, we are constantly evaluating new cloud storage features and innovating on techniques around data management.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.