AWS Storage Blog

AWS re:Invent recap: Break down data silos with a data lake on Amazon S3

UPDATE 9/8/2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

When you have datasets in different places controlled by different groups, you are dealing with data silos, which inherently obscure data. In contrast, a data lake can serve as your central repository of data regardless of source or format. At re:Invent 2020-2021, we had the opportunity to present a session on building a data lake on Amazon S3, and how that can help break down data silos. This approach lets you to quickly and easily democratize value extraction from all of your data, empowering key decision makers.

In our session, we discussed the benefits of deploying a data lake on Amazon S3. We covered typical use cases highlighting the solutions our customers have been building. We concluded with a view of what your own journey to an AWS data lake might look like. In this blog post, we provide you with some of the key takeaways from our re:Invent session. If you missed the live session, remember that you can always view re:Invent sessions on demand.

Challenges with storing, managing, and analyzing data

Customers have told us about several challenges they’ve had to deal with in the past when attempting to build and operate a data lake. Here are the common concerns we’ve been hearing about:

  • Customers want a data lake as the “single source of truth” or authoritative copy of all their data – whether it is structured, semi-structured, or unstructured. Their storage needs can be very elastic at times. So they never want to worry about running out of capacity or bandwidth. They also want a choice of storage tiers at different price points, so they can optimize their storage costs.
  • Data is growing exponentially. It’s no longer just about finding the capacity to hold this data. It’s also about managing complexity by automating all the housekeeping required to secure this data, perform routine operations on it, and optimizing storage cost.
  • There is a need to democratize the data by making it available to many different kinds of personas. When it comes to analytics, it’s no longer just a small group of data analysts doing business intelligence who are looking at this data. We have data scientists, application developers and people in a variety of other roles within the organization. Every one of them is trying to make data-driven decisions. This makes it critical to adopt the right kind of security and access controls. There is a need to provide the appropriate level of access to each persona.
  • Customers are looking for shorter time-to-value for their data. Many of them are no longer talking about nightly “Extract-Transform-Load” (ETL) jobs. The analytics cycles are getting shorter. Real-time and near-real-time analytics use cases are becoming common, and customers need the analytics platforms to support those use cases.
  • Organizations also want choices when it comes to data formats – for example, they want to take advantage of open-source columnar formats like Apache Parquet or ORC. They also want a choice of analytics engines to perform different kinds of processing on their data.

Data silos limit the value of your data

Data silos limit the value of your data - diagram

Your data is only as valuable as the insights you can derive from it. Getting good insights from your data sometimes requires you to look at all of it. For example, let’s look at a typical ecommerce use case. You’re trying to get insights from your end users’ online buying behavior. So you need access to purchase history. Your CRM usually handles your purchase history, which is stored within your data warehouse, along with transactional records from many sources (OLTP, ERP, etc.). You also need access to clickstream data, so you can see how your end user is navigating your website, for example, where they’re clicking. This is something you can get from your webserver logs, which are typically stored with your unstructured data. In many environments, all unstructured data regardless of source (weblogs, sensor data, social media feeds etc.) is stored in large HDFS file systems. It is accessible only from your Hadoop clusters.

There are two different owners for these two different datasets. You now have this challenge of combining these two data sources, and run queries across them. This is what will let you find out whether a new webpage in your site has contributed to your revenues. With traditional data silos, this is sometimes accomplished with lots of data movement, copying across silos, and much patience. It’s difficult to obtain timely, actionable insights in this siloed environment. This is a key reason why our customers have been gravitating to data lakes:  they are looking for ways to break down their data silos, and view their data holistically.

Amazon S3 is the ideal storage platform for your data lake

Amazon S3 is the ideal storage platform for your data lake

Now let’s examine how using Amazon S3 for your data lake platform can help address the challenges we’ve discussed.

Durability, availability, and scalability

Amazon S3 offers you 11 9s of durability and 4 9s of availability out of the box, and that’s with your data stored in one Region. S3 uses a three Availability Zone architecture that gives you a level of resilience that’s not matched by any other provider. With Cross-Region Replication (CRR), you can choose to create a copy in a different Region. This can help if you have a compliance requirement or you need low-latency access to your data from different geographic locations. In terms of scalability, there’s no limit to how much data you can store in S3, and there’s no limit to the number of objects in your S3 buckets. Your throughput scales automatically as you add more objects to your S3 buckets.

Security, compliance, and audit capabilities

When it comes to security, Amazon S3 offers you several ways to lock down your data. For access control, you can use IAM polices, bucket policies, and/or ACLs. In S3, your buckets and objects are always private by default. If you must expose your data to other AWS accounts or the public, you will need to explicitly add policies or ACLs to do that.

With S3 Block Public Access, you have a choice of overriding all bucket policies or ACLs that grant public access. You can choose to block access either at the individual bucket level, or for all buckets in an AWS account.

With Access Analyzer for Amazon S3, you can easily evaluate your bucket access policies and identify any buckets with either public access, or access from other AWS accounts. Historically, Amazon S3 has always provided audit capabilities. For instance, you can use AWS CloudTrail to log every API call to S3, S3 Server Access Logging for details on access requests, and S3 Inventory to audit and report on the replication status and encryption status for every object.

More ways to optimize costs, easiest to automate

Especially as your data lake grows to Petabyte or even Exabyte scale, automation becomes critical. Features like S3 Intelligent-Tiering give you automatic cost savings by moving objects with a single Amazon S3 storage class across four different access tiers when access patterns change. With the recent addition of the Archive Access tier and Deep Archive Access tier, which are opt-in, you now save up to 95% on your storage costs for rarely accessed objects. You get those savings automatically, without any manual effort in analyzing or reviewing access patterns. Storage classes themselves can help you optimize your storage costs depending on your availability needs, storage use case, and access patterns.

For ease of automation, you can also take advantage of S3 Batch Operations. When you must apply an operation to millions or even billions of objects, you can do this with a single API request using S3 Batch Operations.

The most object-level controls

With Amazon S3, you can add up to 10 tags on each object. With both object prefixes and tags, you can configure object-level access controls; object-level storage tiering, and object-level replication rules.

Easy to manage access at scale

If you’re challenged by large numbers of user personas requiring access to the data lake, S3 Access Points may be what you need. S3 Access Points makes it easy to manage access to shared data at scale in S3. Access Points are simply named network endpoints that are attached to your bucket. You can customize the access policy for each Access Point. This gives you a clean way to isolate your end users and applications by Access Point name.

Analytics-optimized storage

With Amazon S3, you can use columnar formats – such as Apache Parquet and ORC – so your analytics can run a lot faster. For managed services like Athena, use of columnar formats helps reduce your processing cost, since your queries must scan less data. You also have compression options like Snappy with Parquet, reducing your capacity requirement, and reducing your storage cost.

With S3 Select, you can use SQL expressions to read only the specific portion of an object that your application needs. Your analytics engines like Hive and Presto can take advantage of S3 Select. Our customers have seen performance improvements of as much as 400% with this approach.

The pace of innovation at AWS

Every AWS service is constantly adding new features and making existing features more powerful and usable; Amazon S3 is no exception.

We just announced Amazon S3 Storage Lens, which provides organization-wide analytics and insights about your S3 usage and activity. Lens can also give you recommendations to optimize your storage. As your data lake grows, tools like Lens can help reduce your operational overhead and reduce costs.

We also just announced Amazon S3 Strong Consistency. Effective immediately, all S3 GET, PUT, and LIST operations, in addition to operations that change object tags, ACLs, or metadata, are now strongly consistent. What you write is what you will read, and the results of a LIST will be an accurate reflection of what’s in the bucket. This applies to all existing and new S3 objects, works in all Regions, and is available to you at no extra charge! There’s no impact on performance, you can update an object hundreds of times per second if you’d like, and there are no global dependencies. This improvement is great for data lakes, but other types of applications also benefit. With strong consistency, migration of on-premises workloads and storage to AWS should now be easier than ever before.

The Amazon S3 ecosystem

The Amazon S3 ecosystem, including Analytics, Machine Learning, Business Intelligence, and analytics-optimized storage.

But here’s perhaps the most compelling benefit of having your data lake on Amazon S3: you get to take advantage of the vast ecosystem of other services and frameworks that integrate with S3. You gain access to both AWS-native services and APN Partner offerings that can help move your data into the data lake. Moreover, along with AWS, the partner ecosystem can enable you to transfer, transform, catalog, analyze, and derive value from your data. All of the AWS-native services in this ecosystem are either serverless, or provide for managed compute. So you don’t have to manage any servers yourself.

A simple way to understand this is, the Amazon S3 API has become the de facto lingua franca of the data storage world. The applications that matter to you speak that language. Third-party data that may be of value to your business is readily accessible in S3 too, via services like AWS Data Exchange. Building your data lake on Amazon S3 empowers you to leverage everything within this ecosystem, and achieve shortest time-to-value from your data.

Use cases with an Amazon S3 data lake

While the Amazon S3 ecosystem is huge, we have observed some well-defined patterns of adoption when it comes to building and operating a data lake. There are the typical use cases that our customers have been focusing on:

  • Serverless analytics: Approaches such as using Amazon Athena to run SQL queries against data in Amazon S3, with Amazon QuickSight for visualizations of analysis results.
  • Streaming data: Real-time streaming analytics with managed services such as Amazon Kinesis Data Analytics, paired with operational analytics using Amazon Elasticsearch or batch analytics using Amazon EMR after the data is persisted to Amazon S3.
  • Hadoop/Spark enablement: Amazon EMR for agile, data-in-place Spark analytics, enabling the use of Amazon EC2 spot pricing for additional cost savings.
  • Data warehouse modernization: Migrating to an AWS managed data warehouse with Amazon Redshift, with Redshift Spectrum to perform queries spanning the data warehouse and unstructured data in Amazon S3.

Many of our customers have been using the AWS Lake Formation service to bring up their data lakes easily and quickly. Lake Formation is designed to streamline the entire deployment process. Using Lake Formation you can collect the data from all your sources, catalog the data, clean, and classify the data using Machine Learning algorithms. Once the data is properly classified, Lake Formation can automate the setup of your access controls. You can use it to apply the correct access controls to the data, in addition to all of your analytics services. This lets you provide fine-grained access control across the hundreds or even thousands of personas in your organization.

Key takeaways and conclusion

To summarize, AWS customers have been adopting Amazon S3 for their data lakes for good reasons:

  • Amazon S3 offers the features that are required to serve as the unified repository for all your data. It can support formats based on open standards, and this helps you eliminate data silos.
  • By integrating easily with many AWS and third-party services for analytics, BI, ML, and visualization, you are able to get insights from your data quickly.
  • With Amazon S3, you can grow your storage footprint and your data processing footprint in a completely decoupled fashion, over time, and only when you need it. That’s what you need for a cost-optimized data lake platform.

We hope this blog post has made you curious to explore our validated approaches to building and operating a data lake. To begin your journey, here are some next steps we recommend:

If you have any comments or questions about our re:Invent session or this blog post, please don’t hesitate to reply in the comments section – thank you!

Ganesh Sundaresan

Ganesh Sundaresan

Ganesh Sundaresan is a Senior Solutions Architect and leader of the storage technical field community within Amazon Web Services. For over three decades, he has been working with enterprises globally to help address their data storage challenges. Outside of work, Ganesh likes to spend time exploring the local wilderness with his family.

Joe Muniz

Joe Muniz

Joe Muniz is a Principal Storage Specialist for Amazon Web Services. His passion is providing customers with the help they need to get started faster, build high-performing applications, and design systems that can take full advantage of the transformational capabilities of AWS.