Data Lake Storage on AWS
Amazon Simple Storage Service (S3) is the largest and most performant object storage service for structured and unstructured data and the storage service of choice to build a data lake. With Amazon S3, you can cost-effectively build and scale a data lake of any size in a secure environment where data is protected by 99.999999999% (11 9s) of durability.
With a data lake built on Amazon S3, you can use native AWS services to run big data analytics, artificial intelligence (AI), machine learning (ML), high-performance computing (HPC) and media data processing applications to gain insights from your unstructured data sets. Using Amazon FSx for Lustre, you can launch file systems for HPC and ML applications, and process large media workloads directly from your data lake. You also have the flexibility to use your preferred analytics, AI, ML, and HPC applications from the Amazon Partner Network (APN). Because Amazon S3 supports a wide range of features, IT managers, storage administrators, and data scientists are empowered to enforce access policies, manage objects at scale and audit activities across their S3 data lakes.
Amazon S3 hosts tens of thousands of data lakes for household brands such as Netflix, Airbnb, Sysco, Expedia, GE, and FINRA, who are using them to securely scale with their needs and to discover business insights every minute.
Why build a data lake on Amazon S3?
Amazon S3 is designed for 99.999999999% (11 9s) of data durability. With that level of durability, you can expect that if you store 10,000,000 objects in Amazon S3, you should only expect to lose a single object every 10,000 years! The service automatically creates and stores copies of all uploaded S3 objects across multiple systems. This means your data is available when needed and protected against failures, errors, and threats.
Scalability on demand
Instantly scale up storage capacity, without lengthy resource procurement cycles
Durable against the failure of an entire AWS Availability Zone
Automatically store copies of data across a minimum of three Availability Zones (AZs). To provide fault tolerance, Availability Zones are separated by several miles—but no more than a hundred to ensure low latencies.
AWS services for analytics, HPC, AI, ML, and media data processing
Use AWS native services to run applications on your data lake
Integrations with third-party service providers
Bring preferred analytics platforms to your S3 data lake from the APN.
Wide range of data management features
Comprehensive flexibility to operate at an object level while managing at scale, configure access, enable cost efficiencies, and audit data across an S3 data lake.
Solving big-data challenges with data lakes
Organizations of all sizes, in all industries, are using data lakes to transform data from a cost that must be managed, to a valuable business asset. Data lakes are foundational for making sense of data at an organizational level. Data lakes remove data silos, making it easier to analyze diverse datasets, while keeping data secure, and incorporating machine learning.
In his article, “How Amazon is solving big-data challenges with data lakes,” Dr. Werner Vogels, AWS CTO, explains, “A major reason companies choose to create data lakes is to break down data silos. Having pockets of data in different places, controlled by different groups, inherently obscures data.”
Amazon S3 allows you to migrate, store, manage, and secure all structured and unstructured data at unlimited scale, breaking down data silos.
Moving data to the cloud
AWS provides a portfolio of data transfer services to provide the right solution for any data migration project. The level of connectivity is a major factor in data migration, and AWS has offerings that can address your hybrid cloud storage, online data transfer, and offline data transfer needs.
Hybrid cloud storage
AWS Storage Gateway is a hybrid cloud storage service that lets you seamlessly connect and extend your on-premises applications to AWS Storage. Customers use Storage Gateway to seamlessly replace tape libraries with cloud storage, provide cloud storage-backed file shares, or create a low-latency cache to access data in AWS for on-premises applications. Using AWS Direct Connect, you can establish private connectivity between AWS and your data center, office, or colocation environment, which can reduce your network costs, increase throughput, and provide a more consistent network experience than public internet connections.
Online data transfer
AWS DataSync makes it easy and efficient to transfer hundreds of terabytes and millions of files into Amazon S3, Amazon EFS, or Amazon FSx for Windows File Server, up to 10x faster than open-source tools. DataSync automatically handles or eliminates many manual tasks, including scripting copy jobs, scheduling and monitoring transfers, validating data, and optimizing network utilization. Amazon S3 Transfer Acceleration enables fast transfers of files over long distances between your client and your Amazon S3 bucket. Amazon Kinesis and AWS IoT Core make it simple and secure to capture and load streaming data from IoT devices to Amazon S3.
Offline data transfer
The AWS Snow Family is purpose-built for use in edge locations where network capacity is constrained or nonexistent and provides storage and computing capabilities in harsh environments. The AWS Snowball service uses ruggedized, portable storage and edge computing devices for data collection, processing, and migration. Customers can ship the physical Snowball device for offline data migration to AWS. AWS Snowmobile is an exabyte-scale data transfer service used to move massive volumes of data to the cloud, including video libraries, image repositories, or even a complete data center migration.
Learn more about AWS cloud data migration services »
Use AWS services across your data lake
S3 data lake customers have access to numerous AWS analytics applications, AI/ML services and high-performance file systems. This means you can run numerous workloads across your data lake, without additional data processing or transfers to other stores. You can also bring your preferred third-party analytics and machine learning tools to your S3 data lake.
Build a data lake in days instead of months with AWS Lake Formation
AWS Lake Formation lets you create a secure data lake in days instead of months and is as simple as defining where data resides and what data access and security policies to apply. Lake Formation then collects data from different sources and moves it into a new data lake in Amazon S3. The service cleans, catalogs, and classifies data using machine learning algorithms and enables you to define access control policies. Users can then access a centralized catalog of data which lists available data sets and their usage terms.
Learn more about AWS Lake Formation and sign up »
Run AWS analytics applications with no data movement
Once data resides in an S3 data lake, you can use any of the following purpose-built analytics services for a range of use cases, from analyzing petabyte-scale data sets to querying the metadata of a single object. With an S3 data lake these can be done without resource- and time-intensive extract, transform, and load (ETL) jobs. You can also bring your preferred analytics platforms to your S3 data lake.
Quickly query datasets in your S3 data lake with simple SQL expressions and get results in seconds. Athena is ideal for ad-hoc querying and doesn’t require cluster management, but it can also handle complex analyses, such as large joins, window functions, and arrays.
Analyze S3 data with your choice of open source distributed frameworks, like Spark and Hadoop. Spin up and scale an EMR cluster in minutes—without node provisioning, cluster setup and tuning, and Hadoop setup—and run multiple clusters in parallel over the same data set.
Simplify ETL jobs across your S3 data lake to make your data searchable and queryable. With a few clicks in the AWS console, register your data sources and then AWS Glue will crawl them to construct a data catalog using metadata (for table definitions and schemas).
Amazon Redshift Spectrum
Run fast, complex queries using SQL expressions across exabytes of S3 data without moving to Redshift. You can run multiple clusters in parallel across the same data sets. Existing Redshift customers can use this feature to extend analytics to their unstructured data in Amazon S3.
Learn more about the above AWS analytics services for data lakes »
Launch AI and Machine Learning jobs with your data stored in S3
You can quickly launch AWS AI services such as Amazon Comprehend, Amazon Forecast, Amazon Personalize, and Amazon Rekognition to discover insights from your unstructured datasets, get accurate forecasts, create recommendation machines, and analyze images and videos stored in S3. You can also deploy Amazon Sagemaker to build, train, and deploy ML models quickly with your datasets stored in S3.
Query data in place quickly with S3 Select
S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to S3. With S3 Select, you can query object metadata without moving the object to another data store. By reducing the volume of data that has to be loaded and processed by your applications, S3 Select can improve the performance of most applications that frequently access data from S3 by up to 400% and reduce querying costs as much as 80%.
You can use S3 Select with Spark, Hive and Presto in Amazon EMR, Amazon Athena, Amazon Redshift, as well as APN partners.
Learn more about S3 Select »
Connect data to file systems for high-performance workloads
Amazon FSx for Lustre provides a high-performance file system that works natively with your S3 data lake and is optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA). In minutes you can launch a file system that provides sub-millisecond access latency to your S3 data and allows you to read and write data at speeds of up to hundreds of gigabytes per second (GBps) of throughput and millions of IO per second (IOPS). When linked to an S3 bucket, an FSx for Lustre file system transparently presents S3 objects as files and allows you to write results back to S3.
Learn more about Amazon FSx for Lustre »
Cost-effectively manage your data lake with S3 features
With a wide range of features, Amazon S3 is the ideal service to build (or re-platform) and manage a data lake of any size and purpose. It is the only cloud storage service that lets you: manage data at the object, bucket, and account levels; make changes across tens to billions of objects with a few clicks; configure granular data access policies; save costs by storing objects across numerous storage classes; and audit all activities across your S3 resources.
Manage data at every level across your data lake
Amazon S3 lets you manage data with object-level granularity, as well as at the bucket and account levels. You can append metadata tags to an object and use them to organize data in ways that work for your business. You can also organize objects by prefixes and buckets. With these capabilities, quickly point to one or a group of objects to replicate across regions, restrict access, transfer to cheaper storage classes, among other tasks.
Take action on billions of objects with just a few clicks
With S3 Batch Operations, you can take action across billions of objects with a single API request or a few clicks in the S3 Management Console, and audit the progress of your requests. Modify object properties and metadata, copy objects between buckets, replace tag sets, configure access controls, restore archives from S3 Glacier, and invoke AWS Lambda functions – in minutes instead of months.
Configure finely-tuned access policies to sensitive data
Use bucket policies, object tags, and access control lists (ACLs) to restrict access to specific buckets and objects. You can also use AWS Identity and Access Management to define user access within an AWS account. Organizations that need to block all access requests to their data can configure S3 Block Public Access to enforce a “no public access” policy for a specific bucket of objects or an entire AWS account.
Cost-effectively store objects across the S3 Storage Classes
All S3 customers can store data across 6 distinct storage classes that are designed to accommodate different access requirements at corresponding costs. Use S3 Storage Class Analysis to learn the access patterns to your data. Then, configure lifecycle policies to transfer less frequently accessed objects to cheaper classes or archive them in S3 Glacier or S3 Glacier Deep Archive for maximum savings.
Audit all access requests to S3 resources and other activities
With S3 reporting tools, quickly discover who is requesting access to what data and from where, audit object metadata (such as storage class, retention date, business unit, and encryption status), monitor usage and costs, learn access patterns, among other activities related to your S3 resources. With these insights, make changes to optimize your data lake and the applications that rely on it, and reduce costs.