Amazon Web Services (AWS) delivers an integrated suite of services that provide everything needed to quickly and easily build and manage a data lake for analytics. AWS-powered data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights, in ways that traditional data silos and data warehouses cannot. AWS gives customers the widest array of analytics and machine learning services, for easy access to all relevant data, without compromising on security or governance.
There are more organizations with data lakes and analytics on AWS than anywhere else. Customers like NASDAQ, Zillow, Yelp, iRobot, and FINRA trust AWS to run their business critical analytics workloads.
Data Lakes and Analytics on AWS
To build your data lakes and analytics solution, AWS provides the most comprehensive set of services to move, store, and analyze your data.
Import your data from on premises, and in real-time.
Store any type of data securely, from gigabytes to exabytes.
Analyze your data with the broadest selection of analytics services.
Predict future outcomes, and prescribe actions for rapid response.
The first step to building data lakes on AWS is to move data to the cloud. The physical limitations of bandwidth and transfer speeds restrict the ability to move data without major disruption, high costs, and time. To make data transfer easy and flexible, AWS provides the widest range of options to transfer data to the cloud.
To build ETL jobs and ML Transforms for your data lake, learn about AWS Lake Formation.
On-premises data movement
AWS provides multiple ways to move data from your datacenter to AWS. To establish a dedicated network connection between your network and AWS, you can use AWS Direct Connect. To move petabytes to exabytes of data to AWS using physical appliances, you can use AWS Snowball and AWS Snowmobile. To have your on-premises applications store data directly into AWS, you can use AWS Storage Gateway.
Real-time data movement
AWS provides multiple ways to ingest real-time data generated from new sources such as websites, mobile apps, and internet-connected devices. To make it simple to capture and load streaming data or IoT device data, you can use Amazon Kinesis Data Firehose, Amazon Kinesis Video Streams, and AWS IoT Core.
Once data is ready for the cloud, AWS makes it easy to store data in any format, securely, and at massive scale with Amazon Simple Storage Service (Amazon S3) and Amazon Simple Storage Service Glacier (Amazon S3 Glacier). To make it easy for end users to discover the relevant data to use in their analysis, AWS Glue automatically creates a single catalog that is searchable, and queryable by users.
To build a secure data lake faster, learn more about AWS Lake Formation.
Amazon S3 is secure, highly scalable, durable object storage with millisecond latency for data access. Amazon S3 is built to store any type of data from anywhere – web sites and mobile apps, corporate applications, and data from IoT sensors or devices. It is built to store and retrieve any amount of data, with unmatched availability, and built from the ground up to deliver 99.999999999% (11 nines) of durability. Amazon S3 Select focuses data read and retrieval, reducing response times up to 400%. Amazon S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements.
Backup and Archive
Amazon S3 Glacier
Amazon S3 Glacier is secure, durable, and extremely low cost storage for long-term backup and archive that can access data in minutes, and similarly Amazon S3 Glacier Select reads and retrieves only the data needed. It is designed to deliver 99.999999999% durability (11 nines), and provides comprehensive security and compliance capabilities that can help meet even the most stringent regulatory requirements. Customers can store data for as little as $0.004 per gigabyte per month, a significant savings compared to on-premises solutions.
AWS Glue is a fully managed service that provides a data catalog to make data in the data lake discoverable, and has the ability to do extract, transform, and load (ETL) to prepare data for analysis. The data catalog is automatically created as a persistent metadata store for all data assets, making all of the data searchable, and queryable in a single view.
AWS provides the broadest, and most cost-effective set of analytic services that run on the data lake. Each analytic service is purpose-built for a wide range of analytics use cases such as interactive analysis, big data processing using Apache Spark and Hadoop, data warehousing, real-time analytics, operational analytics, dashboards, and visualizations.
To manage secure, self-service access to data in a data lake for analytics services, learn more about AWS Lake Formation.
For interactive analysis, Amazon Athena makes it easy to analyze data directly in Amazon S3 and Amazon S3 Glacier using standard SQL queries. Athena is serverless, so there is no infrastructure to setup or manage. You can start querying data instantly, get results in seconds and pay only for the queries you run. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds.
Big Data Processing
For big data processing using the Spark and Hadoop frameworks, Amazon EMR provides a managed service that makes it easy, fast, and cost-effective to process vast amounts data. Amazon EMR supports 19 different open-source projects including Hadoop, Spark, HBase, and Presto, with managed EMR Notebooks for data engineering, data science development, and collaboration. Each project is updated in EMR within 30 days of a version release, ensuring you have the latest and greatest from the community, effortlessly.
For data warehousing, Amazon Redshift provides the ability to run complex, analytic queries against petabytes of structured data, and includes Redshift Spectrum that runs SQL queries directly against Exabytes of structured or unstructured data in Amazon S3 without the need for unnecessary data movement. Amazon Redshift is less than a tenth of the cost of traditional solutions. Start small for just $0.25 per hour, and scale out to petabytes of data for $1,000 per terabyte per year.
For real-time analytics, Amazon Kinesis makes it easy to collect, process and analyze streaming data such as IoT telemetry data, application logs, and website clickstreams. This enable you to process, and analyze data as it arrives in your data lake, and respond in real-time instead of having to wait until all your data is collected before the processing can begin.
Amazon Elasticsearch Service
For operational analytics such as application monitoring, log analytics and clickstream analytics, Amazon Elasticsearch Service allows you to search, explore, filter, aggregate, and visualize your data in near real-time. Amazon Elasticsearch Service delivers Elasticsearch’s easy-to-use APIs and real-time analytics capabilities alongside the availability, scalability, and security that production workloads require.
Dashboards and Visualizations
For dashboards and visualizations, Amazon QuickSight provides you a fast, cloud-powered business analytics service, that that makes it easy to build stunning visualizations and rich dashboards that can be accessed from any browser or mobile device.
For predictive analytics use cases, AWS provides a broad set of machine learning services, and tools that run on your data lake on AWS. Our services come from the knowledge and capability we’ve built up at Amazon, where ML has powered Amazon.com’s recommendation engines, supply chain, forecasting, fulfillment centers, and capacity planning.
Frameworks and Interfaces
For expert machine learning practitioners and data scientists, AWS provides AWS Deep Learning AMIs that make it easy to build deep learning models, and build clusters with ML and DL optimized GPU instances. AWS supports all the major machine learning frameworks, including Apache MXNet, TensorFlow, and Caffe2 so that you can bring or develop any model you choose. These capabilities provide unmatched power, speed, and efficiency that deep learning and machine learning workloads require.
For developers who want to get deep with ML, Amazon SageMaker is a platform service that makes the entire process of building, training, and deploying ML models easy by providing everything you need to connect to your training data, select, and optimize the best algorithm and framework, and deploy your model on auto-scaling clusters of Amazon EC2. SageMaker also includes hosted Jupyter notebooks that make it is easy to explore, and visualize your training data stored in Amazon S3.
For developers who want to plug-in pre-built AI functionality into their apps, AWS provides solution-oriented APIs for computer vision, and natural language processing. These application services lets developers add intelligence to their applications without developing and training their own models.
Data lakes & analytics built on AWS
Why data lakes and analytics on AWS?
Flexibility and choice
AWS offers the broadest set of analytic tools and engines that analyzes data using open formats and open standards. You get to store your data in the standards-based data format of your choice such as CSV, ORC, Grok, Avro, and Parquet, and the flexibility to analyze the day in a variety of ways such as data warehousing, interactive SQL queries, real-time analytics, and big data processing. The breadth of analytics services that you can use with your data in AWS, ensures that your needs will be met for your existing and future analytics use cases.
Unmatched scalability and availability
Amazon S3 is built to store and retrieve any amount of data, with unmatched availability, and built from the ground up to deliver 99.999999999% (11 nines) of durability. It is the only storage offering that can store your data in multiple data centers across three availability zones within a single AWS Region for unmatched resilience to single data center issues, and the only storage offering that seamlessly replicates data between any regions.
Amazon S3 is the only cloud storage platform that allows you to apply access, log, and audit policies at the account and object level. Amazon S3 provides automatic server-side encryption, encryption with keys managed by the AWS Key Management Service (KMS), and encryption with keys that you manage. Amazon S3 encrypts data in transit when replicating across regions, and lets you use separate accounts for source and destination regions to protect against malicious insider deletions. To proactively detect early stages of an attack, Amazon Macie, an ML powered security service monitors data access activity for anomalies, and generates detailed alerts when it detects risk of unauthorized access or inadvertent data leaks.
Data lakes built on AWS are the most cost-effective. Data that is infrequently used can be moved to Amazon S3 Glacier which provides long-term backup and archive at very low costs. Amazon S3 management capabilities can analyze object access patterns to move infrequently used data to Amazon S3 Glacier on-demand or automatically with lifecycle policies. You can begin querying the data with Amazon Athena for as little as $0.005/GB queried. Other analytics and machine learning services are priced with a pay-as-you-go approach for the resources you consume.
AWS analytic services like Amazon Redshift and Amazon Athena were built for fast interactive query performance to support large numbers of concurrent interactive queries. When running AWS' broad portfolio of analytics and machine learning services using Amazon S3 Select, only the subsets of data that are needed within objects are returned, leading to much faster queries up to 400% faster, and at a dramatically lower cost. Amazon S3 Glacier Select provides a similar capability allowing you to retrieve archived data faster, and allowing you to extend your analytical capability over your data lake to include archival storage.
The largest partner network
The AWS Partner Network (APN) has twice as many partner integrations than anyone else, with tens of thousands of partners, including consulting and independent software vendors, from all across the globe. This makes it easy to work and integrate with many of the same tools you use and love today. Data Lake Quick Starts, developed by AWS solution architects and partners, help you build, test, and deploy Data Lake solutions based on AWS best practices for security and high availability, in a few simple steps.