Build a data lake on Amazon S3: Recent customer case studies
Amazon Simple Storage Service (S3) is the largest and most performant object storage service for structured and unstructured data and the storage service of choice to build a data lake. With Amazon S3, you can cost-effectively build and scale a data lake of any size in a secure environment where data is protected by 99.999999999% (11 9s) of durability.
With a data lake built on Amazon S3, you can use native AWS services to run big data analytics, artificial intelligence (AI), machine learning (ML), high-performance computing (HPC) and media data processing applications to gain insights from your unstructured data sets. Using Amazon FSx for Lustre, you can launch file systems for HPC and ML applications, and process large media workloads directly from your data lake. You also have the flexibility to use your preferred analytics, AI, ML, and HPC applications from the Amazon Partner Network (APN). Because Amazon S3 supports a wide range of features, IT managers, storage administrators, and data scientists are empowered to enforce access policies, manage objects at scale and audit activities across their S3 data lakes.
Amazon S3 hosts more than 10,000 data lakes and we wanted to showcase some recent case studies featuring customers of various industries and use cases that have built a data lake on Amazon S3 to gain value from their data.
Siemens Cyber Defense Center (CDC) uses Amazon SageMaker to label and prepare data, choose and train machine-learning algorithms, make predictions, and act. The solution also uses AWS Glue, a fully managed extract, transform, and load (ETL) service, and AWS Lambda, a serverless service that runs code in response to events.
With a data lake based on Amazon Simple Storage Service (Amazon S3) capable of collecting 6 TB of log data per day, security staff can perform forensic analysis on years’ worth of data without compromising the performance or availability of the Siemens security incident and event management (SIEM) solution. The serverless AWS cyber threat-analytics platform handles 60,000 potentially critical events per second but is developed and managed by a team of fewer than one dozen people.
To address its need for new data insights and less complex data collection, Georgia-Pacific sought to move to an advanced analytics approach enabled by an operations data lake. The company uses Amazon Kinesis to stream real-time data from manufacturing equipment to a central data lake based on Amazon Simple Storage Service (Amazon S3), allowing it to efficiently ingest and analyze structured and unstructured data at scale.
Sysco used Amazon S3 and Amazon S3 Glacier to reduce storage costs by 40 percent increase agility and security, and make time to focus on creating new business applications. Sysco is a global foodservice distribution company that sells, markets, and distributes to restaurants, healthcare and educational facilities, lodging establishments, and other customers in more than 90 countries. AWS enabled Sysco to consolidate its data into a single data lake built on Amazon S3 and Amazon S3 Glacier, allowing the company to run analytics on its data and gain business insights.
Fanatics uses Amazon Simple Storage Service (Amazon S3) to provide secure, durable, and highly scalable storage for its analytical data. Using the Amazon S3 web service interface, the Fanatics data science team can easily store and quickly retrieve any amount of data. Taking advantage of its new AWS data lake solution, Fanatics is now able to analyze the huge volumes of data from its transactional, e-commerce, and back-office systems, and make this data available to its data scientists immediately for analytics.
With its Image Manager service, IDEXX stores more than two million images per week using the scalable storage and fast performance of AWS Simple Storage Service (Amazon S3). “Image Manager provides unlimited storage and ensures images are backed up securely in AWS,” says Jeff Dixon, chief software engineering officer at IDEXX. “Additionally, our Web Picture Archiving and Communication System application runs on the AWS Cloud to support real-time collaboration among radiologists and other specialists anywhere in the world. With more than 250 million pets, 6 billion invoice items, and 600 million prescriptions on file, we have the opportunity to discover tremendous medical insights,” says Dixon. “AWS enabled us to iterate on our data-lake architecture very quickly to discover the best solution for our needs.”
Schoology relied on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3) to support its LMS and Assessment platforms. “Amazon EC2 and Amazon S3 are the building blocks of AWS, which was perfect for us because we needed to quickly migrate our platform over several months in the summer before school started in the fall,” says Sam Marx, senior director of product operations at Schoology. “Then, once we did that and were up and running on AWS, we started looking at how to optimize the environment.” As a result, Schoology began using AWS Auto Scaling to automatically observe applications and adjust compute capacity for scaling up and down. In addition, Schoology uses Amazon Redshift for data collection and aggregation for analytics solutions. Schoology also leverages AWS Lambda for many services and sees serverless as the evolution in delivering value to customers with even less infrastructure overhead and maintenance.
Traffic congestion and air pollution are serious issues in India, particularly in megalopolises such as Bengaluru. Yulu’s mission is to address such challenges by providing sustainable and hassle-free micro mobility solutions for commuters traveling short distances.
Yulu improved service efficiency by 30–35% using its prediction model and AWS data lake. Yulu spent the first six months of operations collecting data to understand usage patterns. It then began constructing its prediction model using Amazon EMR for deeper analysis. “Amazon EMR gives us a seamless integration to move our data from our transaction system to Yulu Cloud – our data lake, which runs on Amazon Simple Storage Service (Amazon S3),” says Naveen Dachuri, co-founder and CTO of Yulu Bikes. “We can now proactively manage our vehicles, so they are always in great condition and act quickly on vehicles that move outside our operational zone to bring them back to high demand areas.”
Build a data lake on Amazon S3
These case studies are just a sampling of the many Amazon S3 data lake case studies to help you see common data lake use cases. Amazon S3 is the preferred centralized repository for structured and unstructured data and the storage service of choice for data lakes. Learn more about building a data lake on Amazon S3.