Data Lakes and Analytics on AWS

The most comprehensive, secure, scalable, and cost-effective portfolio of services to build your data lake and analytics solutions

Get started with data lakes on AWS

Backup & Disaster Recovery Business Continuity Planning Data Lakes and Analytics Infrastructure Modernization

Amazon Web Services (AWS) delivers an integrated suite of services that provide everything needed to quickly and easily build and manage a data lake for analytics. AWS-powered data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights, in ways that traditional data silos and data warehouses cannot. AWS gives customers the widest array of analytics and machine learning services, for easy access to all relevant data, without compromising on security or governance.

There are more organizations with data lakes and analytics on AWS than anywhere else. Customers like NASDAQ, Zillow, Yelp, iRobot, and FINRA trust AWS to run their business critical analytics workloads.

Data Lakes and Analytics on AWS

Data Lakes and Analytics on AWS

To build your data lakes and analytics solution, AWS provides the most comprehensive set of services to move, store, and analyze your data.

Data Movement

Import your data from on premises, and in real-time.

Data Lake

Store any type of data securely, from gigabytes to exabytes.

Analytics

Analyze your data with the broadest selection of analytics services.

Machine Learning

Predict future outcomes, and prescribe actions for rapid response.

Data Movement

The first step to building data lakes on AWS is to move data to the cloud. The physical limitations of bandwidth and transfer speeds restrict the ability to move data without major disruption, high costs, and time. To make data transfer easy and flexible, AWS provides the widest range of options to transfer data to the cloud.

To build ETL jobs and ML Transforms for your data lake, learn about AWS Lake Formation.

On-premises data movement

AWS provides multiple ways to move data from your datacenter to AWS. To establish a dedicated network connection between your network and AWS, you can use AWS Direct Connect. To move petabytes to exabytes of data to AWS using physical appliances, you can use AWS Snowball and AWS Snowmobile. To have your on-premises applications store data directly into AWS, you can use AWS Storage Gateway.

Real-time data movement

AWS provides multiple ways to ingest real-time data generated from new sources such as websites, mobile apps, and internet-connected devices. To make it simple to capture and load streaming data or IoT device data, you can use Amazon Kinesis Data Firehose, Amazon Kinesis Video Streams, and AWS IoT Core.

Data Lake

Once data is ready for the cloud, AWS makes it easy to store data in any format, securely, and at massive scale with Amazon Simple Storage Service (Amazon S3) and Amazon Simple Storage Service Glacier (Amazon S3 Glacier). To make it easy for end users to discover the relevant data to use in their analysis, AWS Glue automatically creates a single catalog that is searchable, and queryable by users.

To build a secure data lake faster, learn more about AWS Lake Formation.

Object Storage

Amazon S3

Amazon S3 is secure, highly scalable, durable object storage with millisecond latency for data access. Amazon S3 is built to store any type of data from anywhere – web sites and mobile apps, corporate applications, and data from IoT sensors or devices. It is built to store and retrieve any amount of data, with unmatched availability, and built from the ground up to deliver 99.999999999% (11 nines) of durability. Amazon S3 Select focuses data read and retrieval, reducing response times up to 400%. Amazon S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements.

Backup and Archive

Amazon S3 Glacier

Amazon S3 Glacier is secure, durable, and extremely low cost storage for long-term backup and archive that can access data in minutes, and similarly Amazon S3 Glacier Select reads and retrieves only the data needed. It is designed to deliver 99.999999999% durability (11 nines), and provides comprehensive security and compliance capabilities that can help meet even the most stringent regulatory requirements. Customers can store data for as little as $0.004 per gigabyte per month, a significant savings compared to on-premises solutions.

Data Catalog

AWS Glue

AWS Glue is a fully managed service that provides a data catalog to make data in the data lake discoverable, and has the ability to do extract, transform, and load (ETL) to prepare data for analysis. The data catalog is automatically created as a persistent metadata store for all data assets, making all of the data searchable, and queryable in a single view.

Analytics

AWS provides the broadest, and most cost-effective set of analytic services that run on the data lake. Each analytic service is purpose-built for a wide range of analytics use cases such as interactive analysis, big data processing using Apache Spark and Hadoop, data warehousing, real-time analytics, operational analytics, dashboards, and visualizations.

To manage secure, self-service access to data in a data lake for analytics services, learn more about AWS Lake Formation.

Interactive Analytics

Amazon Athena

For interactive analysis, Amazon Athena makes it easy to analyze data directly in Amazon S3 and Amazon S3 Glacier using standard SQL queries. Athena is serverless, so there is no infrastructure to setup or manage. You can start querying data instantly, get results in seconds and pay only for the queries you run. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds.

Big Data Processing

Amazon EMR

For big data processing using the Spark and Hadoop frameworks, Amazon EMR provides a managed service that makes it easy, fast, and cost-effective to process vast amounts data. Amazon EMR supports 19 different open-source projects including Hadoop, Spark, HBase, and Presto, with managed EMR Notebooks for data engineering, data science development, and collaboration. Each project is updated in EMR within 30 days of a version release, ensuring you have the latest and greatest from the community, effortlessly.

Data Warehousing

Amazon Redshift

For data warehousing, Amazon Redshift provides the ability to run complex, analytic queries against petabytes of structured data, and includes Redshift Spectrum that runs SQL queries directly against Exabytes of structured or unstructured data in Amazon S3 without the need for unnecessary data movement. Amazon Redshift is less than a tenth of the cost of traditional solutions. Start small for just $0.25 per hour, and scale out to petabytes of data for $1,000 per terabyte per year.

Real-Time Analytics

Amazon Kinesis

For real-time analytics, Amazon Kinesis makes it easy to collect, process and analyze streaming data such as IoT telemetry data, application logs, and website clickstreams. This enable you to process, and analyze data as it arrives in your data lake, and respond in real-time instead of having to wait until all your data is collected before the processing can begin.

Operational Analytics

Amazon Elasticsearch Service

For operational analytics such as application monitoring, log analytics and clickstream analytics, Amazon Elasticsearch Service allows you to search, explore, filter, aggregate, and visualize your data in near real-time. Amazon Elasticsearch Service delivers Elasticsearch’s easy-to-use APIs and real-time analytics capabilities alongside the availability, scalability, and security that production workloads require.

Dashboards and Visualizations

Amazon QuickSight

For dashboards and visualizations, Amazon QuickSight provides you a fast, cloud-powered business analytics service, that that makes it easy to build stunning visualizations and rich dashboards that can be accessed from any browser or mobile device.

Machine Learning

For predictive analytics use cases, AWS provides a broad set of machine learning services, and tools that run on your data lake on AWS. Our services come from the knowledge and capability we’ve built up at Amazon, where ML has powered Amazon.com’s recommendation engines, supply chain, forecasting, fulfillment centers, and capacity planning.

Frameworks and Interfaces

For expert machine learning practitioners and data scientists, AWS provides AWS Deep Learning AMIs that make it easy to build deep learning models, and build clusters with ML and DL optimized GPU instances. AWS supports all the major machine learning frameworks, including Apache MXNet, TensorFlow, and Caffe2 so that you can bring or develop any model you choose. These capabilities provide unmatched power, speed, and efficiency that deep learning and machine learning workloads require.

Platform Services

For developers who want to get deep with ML, Amazon SageMaker is a platform service that makes the entire process of building, training, and deploying ML models easy by providing everything you need to connect to your training data, select, and optimize the best algorithm and framework, and deploy your model on auto-scaling clusters of Amazon EC2. SageMaker also includes hosted Jupyter notebooks that make it is easy to explore, and visualize your training data stored in Amazon S3.

Application Services

For developers who want to plug-in pre-built AI functionality into their apps, AWS provides solution-oriented APIs for computer vision, and natural language processing. These application services lets developers add intelligence to their applications without developing and training their own models.

Data lakes & analytics built on AWS

Why data lakes and analytics on AWS?

Flexibility and choice

AWS offers the broadest set of analytic tools and engines that analyzes data using open formats and open standards. You get to store your data in the standards-based data format of your choice such as CSV, ORC, Grok, Avro, and Parquet, and the flexibility to analyze the day in a variety of ways such as data warehousing, interactive SQL queries, real-time analytics, and big data processing. The breadth of analytics services that you can use with your data in AWS, ensures that your needs will be met for your existing and future analytics use cases.

Unmatched scalability and availability

Amazon S3 is built to store and retrieve any amount of data, with unmatched availability, and built from the ground up to deliver 99.999999999% (11 nines) of durability. It is the only storage offering that can store your data in multiple data centers across three availability zones within a single AWS Region for unmatched resilience to single data center issues, and the only storage offering that seamlessly replicates data between any regions.

Highly secure

Amazon S3 is the only cloud storage platform that allows you to apply access, log, and audit policies at the account and object level. Amazon S3 provides automatic server-side encryption, encryption with keys managed by the AWS Key Management Service (KMS), and encryption with keys that you manage. Amazon S3 encrypts data in transit when replicating across regions, and lets you use separate accounts for source and destination regions to protect against malicious insider deletions. To proactively detect early stages of an attack, Amazon Macie, an ML powered security service monitors data access activity for anomalies, and generates detailed alerts when it detects risk of unauthorized access or inadvertent data leaks.

Cost-effective

Data lakes built on AWS are the most cost-effective. Data that is infrequently used can be moved to Amazon S3 Glacier which provides long-term backup and archive at very low costs. Amazon S3 management capabilities can analyze object access patterns to move infrequently used data to Amazon S3 Glacier on-demand or automatically with lifecycle policies. You can begin querying the data with Amazon Athena for as little as $0.005/GB queried. Other analytics and machine learning services are priced with a pay-as-you-go approach for the resources you consume.

Fast performance

AWS analytic services like Amazon Redshift and Amazon Athena were built for fast interactive query performance to support large numbers of concurrent interactive queries. When running AWS' broad portfolio of analytics and machine learning services using Amazon S3 Select, only the subsets of data that are needed within objects are returned, leading to much faster queries up to 400% faster, and at a dramatically lower cost. Amazon S3 Glacier Select provides a similar capability allowing you to retrieve archived data faster, and allowing you to extend your analytical capability over your data lake to include archival storage.

The largest partner network

The AWS Partner Network (APN) has twice as many partner integrations than anyone else, with tens of thousands of partners, including consulting and independent software vendors, from all across the globe. This makes it easy to work and integrate with many of the same tools you use and love today. Data Lake Quick Starts, developed by AWS solution architects and partners, help you build, test, and deploy Data Lake solutions based on AWS best practices for security and high availability, in a few simple steps.

Get started with AWS

Step 1 - Sign up for an Amazon Web Services account

Sign up for an AWS account

Instantly get access to the AWS Free Tier

Build a secure data lake in days

Read about AWS Lake Formation

Start building with AWS

Read about deploying data lakes on AWS

Get started with data lakes on AWS

Deploy a data lake with AWS Quick Starts

Have more questions?

Contact us