Data Lake Storage on AWS

The most secure, durable, and scalable storage to build your data lake

Amazon Simple Storage Service (S3) is the largest and most performant object storage service for structured and unstructured data and the storage service of choice to build a data lake. With Amazon S3, you can cost-effectively build and scale a data lake of any size in a secure environment where data is protected by 99.999999999% (11 9s) of durability.

With a data lake built on Amazon S3, you can use native AWS services to run big data analytics, artificial intelligence (AI), machine learning (ML), high-performance computing (HPC) and media data processing applications to gain insights from your unstructured data sets. Using Amazon FSx for Lustre, you can launch file systems for HPC and ML applications, and process large media workloads directly from your data lake. You also have the flexibility to use your preferred analytics, AI, ML, and HPC applications from the Amazon Partner Network (APN). Because Amazon S3 supports a wide range of features, IT managers, storage administrators, and data scientists are empowered to enforce access policies, manage objects at scale and audit activities across their S3 data lakes.

Amazon S3 hosts tens of thousands of data lakes for household brands such as Netflix, Airbnb, Sysco, Expedia, GE, and FINRA, who are using them to securely scale with their needs and to discover business insights every minute.

Store & analyze unstructured data with an S3 data lake (1:43)

Why build a data lake on Amazon S3?

Amazon S3 is designed for 99.999999999% (11 9s) of data durability. With that level of durability, you can expect that if you store 10,000,000 objects in Amazon S3, you should only expect to lose a single object every 10,000 years! The service automatically creates and stores copies of all uploaded S3 objects across multiple systems. This means your data is available when needed and protected against failures, errors, and threats.

Data lake storage infrastructure
Security by design
Protect data with an infrastructure designed for the most data-sensitive organizations

Scalability on demand
Instantly scale up storage capacity, without lengthy resource procurement cycles

Durable against the failure of an entire AWS Availability Zone
Automatically store copies of data across a minimum of three Availability Zones (AZs). To provide fault tolerance, Availability Zones are separated by several miles—but no more than a hundred to ensure low latencies.

AWS services for analytics, HPC, AI, ML, and media data processing
Use AWS native services to run applications on your data lake

Integrations with third-party service providers
Bring preferred analytics platforms to your S3 data lake from the APN.

Wide range of data management features
Comprehensive flexibility to operate at an object level while managing at scale, configure access, enable cost efficiencies, and audit data across an S3 data lake.

Solving big-data challenges with data lakes

Organizations of all sizes, in all industries, are using data lakes to transform data from a cost that must be managed, to a valuable business asset. Data lakes are foundational for making sense of data at an organizational level. Data lakes remove data silos, making it easier to analyze diverse datasets, while keeping data secure, and incorporating machine learning.

In his article, “How Amazon is solving big-data challenges with data lakes,” Dr. Werner Vogels, AWS CTO, explains, “A major reason companies choose to create data lakes is to break down data silos. Having pockets of data in different places, controlled by different groups, inherently obscures data.”

Amazon S3 allows you to migrate, store, manage, and secure all structured and unstructured data at unlimited scale, breaking down data silos.

Read full article »

Key components of a data lake

Moving data to the cloud

AWS provides a portfolio of data transfer services to provide the right solution for any data migration project. The level of connectivity is a major factor in data migration, and AWS has offerings that can address your hybrid cloud storage, online data transfer, and offline data transfer needs.

Hybrid cloud storage

AWS Storage Gateway is a hybrid cloud storage service that lets you seamlessly connect and extend your on-premises applications to AWS Storage. Customers use Storage Gateway to seamlessly replace tape libraries with cloud storage, provide cloud storage-backed file shares, or create a low-latency cache to access data in AWS for on-premises applications. Using AWS Direct Connect, you can establish private connectivity between AWS and your data center, office, or colocation environment, which can reduce your network costs, increase throughput, and provide a more consistent network experience than public internet connections.

Online data transfer

AWS DataSync makes it easy and efficient to transfer hundreds of terabytes and millions of files into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server, up to 10x faster than open-source tools. DataSync automatically handles or eliminates many manual tasks, including scripting copy jobs, scheduling and monitoring transfers, validating data, and optimizing network utilization. Amazon S3 Transfer Acceleration enables fast transfers of files over long distances between your client and your Amazon S3 bucket. Amazon Kinesis and AWS IoT Core make it simple and secure to capture and load streaming data from IoT devices to Amazon S3.

Offline data transfer

The AWS Snow Family is purpose-built for use in edge locations where network capacity is constrained or nonexistent and provides storage and computing capabilities in harsh environments. The AWS Snowball service uses ruggedized, portable storage and edge computing devices for data collection, processing, and migration. Customers can ship the physical Snowball device for offline data migration to AWS. AWS Snowmobile is an exabyte-scale data transfer service used to move massive volumes of data to the cloud, including video libraries, image repositories, or even a complete data center migration.

Learn more about AWS cloud data migration services »

Use AWS services across your data lake

S3 data lake customers have access to numerous AWS analytics applications, AI/ML services and high-performance file systems. This means you can run numerous workloads across your data lake, without additional data processing or transfers to other stores. You can also bring your preferred third-party analytics and machine learning tools to your S3 data lake. 

Build a data lake in days instead of months with AWS Lake Formation

AWS Lake Formation lets you create a secure data lake in days instead of months and is as simple as defining where data resides and what data access and security policies to apply. Lake Formation then collects data from different sources and moves it into a new data lake in Amazon S3. The service cleans, catalogs, and classifies data using machine learning algorithms and enables you to define access control policies. Users can then access a centralized catalog of data which lists available data sets and their usage terms.

Learn more about AWS Lake Formation and sign up »

Announcing AWS Lake Formation (2:44)

Run AWS analytics applications with no data movement

Once data resides in an S3 data lake, you can use any of the following purpose-built analytics services for a range of use cases, from analyzing petabyte-scale data sets to querying the metadata of a single object. With an S3 data lake these can be done without resource- and time-intensive extract, transform, and load (ETL) jobs. You can also bring your preferred analytics platforms to your S3 data lake.

Tech trends: Data lakes and analytics (9:00)
Amazon Athena

Quickly query datasets in your S3 data lake with simple SQL expressions and get results in seconds. Athena is ideal for ad-hoc querying and doesn’t require cluster management, but it can also handle complex analyses, such as large joins, window functions, and arrays.

Amazon EMR

Analyze S3 data with your choice of open source distributed frameworks, like Spark and Hadoop. Spin up and scale an EMR cluster in minutes—without node provisioning, cluster setup and tuning, and Hadoop setup—and run multiple clusters in parallel over the same data set.

AWS Glue

Simplify ETL jobs across your S3 data lake to make your data searchable and queryable. With a few clicks in the AWS console, register your data sources and then AWS Glue will crawl them to construct a data catalog using metadata (for table definitions and schemas).

Amazon Redshift Spectrum

Run fast, complex queries using SQL expressions across exabytes of S3 data without moving to Redshift. You can run multiple clusters in parallel across the same data sets. Existing Redshift customers can use this feature to extend analytics to their unstructured data in Amazon S3.

Learn more about the above AWS analytics services for data lakes »

Launch AI and Machine Learning jobs with your data stored in S3

You can quickly launch AWS AI services such as Amazon Comprehend, Amazon Forecast, Amazon Personalize, and Amazon Rekognition to discover insights from your unstructured datasets, get accurate forecasts, create recommendation machines, and analyze images and videos stored in S3. You can also deploy Amazon Sagemaker to build, train, and deploy ML models quickly with your datasets stored in S3.

Query data in place quickly with S3 Select

S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to S3. With S3 Select, you can query object metadata without moving the object to another data store. By reducing the volume of data that has to be loaded and processed by your applications, S3 Select can improve the performance of most applications that frequently access data from S3 by up to 400% and reduce querying costs as much as 80%.

You can use S3 Select with Spark, Hive and Presto in Amazon EMR, Amazon Athena, Amazon Redshift, as well as APN partners.

Learn more about S3 Select »

Query data in place with S3 Select (3:51)

Connect data to file systems for high-performance workloads

Amazon FSx for Lustre provides a high-performance file system that works natively with your S3 data lake and is optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA). In minutes you can launch a file system that provides sub-millisecond access latency to your S3 data and allows you to read and write data at speeds of up to hundreds of gigabytes per second (GBps) of throughput and millions of IO per second (IOPS). When linked to an S3 bucket, an FSx for Lustre file system transparently presents S3 objects as files and allows you to write results back to S3.

Learn more about Amazon FSx for Lustre »

Introduction to Amazon FSx for Lustre (45:48)

Cost-effectively manage your data lake with S3 features

With a wide range of features, Amazon S3 is the ideal service to build (or re-platform) and manage a data lake of any size and purpose. It is the only cloud storage service that lets you: manage data at the object, bucket, and account levels; make changes across tens to billions of objects with a few clicks; configure granular data access policies; save costs by storing objects across numerous storage classes; and audit all activities across your S3 resources.

Manage data at every level across your data lake

Amazon S3 lets you manage data with object-level granularity, as well as at the bucket and account levels. You can append metadata tags to an object and use them to organize data in ways that work for your business. You can also organize objects by prefixes and buckets. With these capabilities, quickly point to one or a group of objects to replicate across regions, restrict access, transfer to cheaper storage classes, among other tasks.

Take action on billions of objects with just a few clicks

With S3 Batch Operations, you can take action across billions of objects with a single API request or a few clicks in the S3 Management Console, and audit the progress of your requests. Modify object properties and metadata, copy objects between buckets, replace tag sets, configure access controls, restore archives from S3 Glacier, and invoke AWS Lambda functions – in minutes instead of months.

Configure finely-tuned access policies to sensitive data

Use bucket policies, object tags, and access control lists (ACLs) to restrict access to specific buckets and objects. You can also use AWS Identity and Access Management to define user access within an AWS account. Organizations that need to block all access requests to their data can configure S3 Block Public Access to enforce a “no public access” policy for a specific bucket of objects or an entire AWS account.

Cost-effectively store objects across the S3 Storage Classes

All S3 customers can store data across 6 distinct storage classes that are designed to accommodate different access requirements at corresponding costs. Use S3 Storage Class Analysis to learn the access patterns to your data. Then, configure lifecycle policies to transfer less frequently accessed objects to cheaper classes or archive them in S3 Glacier or S3 Glacier Deep Archive for maximum savings.

Audit all access requests to S3 resources and other activities

With S3 reporting tools, quickly discover who is requesting access to what data and from where, audit object metadata (such as storage class, retention date, business unit, and encryption status), monitor usage and costs, learn access patterns, among other activities related to your S3 resources. With these insights, make changes to optimize your data lake and the applications that rely on it, and reduce costs.

More data lakes built on AWS than anywhere else

Ready to get started?

Learn more about Amazon S3
Start using Amazon S3

Learn more about Amazon S3 »

Sign up for an AWS account
Sign up for an AWS account
Instantly get access to the AWS Free Tier »
Read data lakes deployment guide
Deploy a data lake on AWS

Get started building your data lake on Amazon S3

Build a data lake
Have more questions?
Contact us