Data Lakes and Analytics on AWS

The most comprehensive, secure, scalable, and cost-effective portfolio of services to build your data lake and analytics solutions

 

The size and complexity of the data that needs to be analyzed today, means the same technology and approaches that worked in the past, don’t work anymore. To get the most value from your data, AWS provides the most comprehensive, secure, scalable, and cost-effective portfolio of services that enables you to build your data lake in the cloud, analyze all of your data, including data from IoT devices with a variety of analytical approaches including machine learning.

There are more organizations running their data lakes and analytics on AWS than anywhere else with customers like NASDAQ, Zillow, Yelp, iRobot, and FINRA trusting AWS to run their business critical analytics workloads.

Data Lakes and Analytics on AWS

Data Lakes and Analytics on AWS

To build your data lakes and analytics solution, AWS provides the most comprehensive set of services to move, store and analyze your data.

aws-datalake-diagram-simplified

Data Movement

Import your data from on-premises, and in real-time.

Data Lake

Store any type of data securely, from gigabytes to exabytes.

Analytics

Analyze your data with a broad selection of analytic tools and engines.

Machine Learning

Forecast future outcomes, and prescribe actions.

Data Movement

The first step to building data lakes on AWS is to move data to the cloud. The physical limitations of bandwidth and transfer speeds restrict the ability to move data without major disruption, high costs, and time. To make data transfer easy and flexible, AWS provides the widest range of options to transfer data to the cloud.

On-premises data movement

AWS provides multiple ways to move data from your datacenter to AWS. To establish a dedicated network connection between your network and AWS, you can use AWS Direct Connect. To move petabytes to exabytes of data to AWS using physical appliances, you can use AWS Snowball and AWS Snowmobile. To have your on-premises applications store data directly into AWS, you can use AWS Storage Gateway.  

Real-time data movement

AWS provides multiple ways to ingest real-time data generated from new sources such as websites, mobile apps, and internet-connected devices. To make it simple to capture and load streaming data or IoT device data, you can use Amazon Kinesis Data Firehose, Amazon Kinesis Video Streams, and AWS IoT Core.  

Data Lake

Once data is ready for the cloud, AWS makes it easy to store data in any format, securely, and at massive scale with Amazon S3 and Amazon Glacier.  To make it easy for end users to discover the relevant data to use in their analysis, AWS Glue automatically creates a single catalog that is searchable, and queryable by users.

Object Storage

Amazon S3

Amazon S3 is secure, highly scalable, durable object storage with millisecond latency for data access. S3 is built to store any type of data from anywhere – web sites and mobile apps, corporate applications, and data from IoT sensors or devices. It is built to store and retrieve any amount of data, with unmatched availability, and built from the ground up to deliver 99.999999999% (11 nines) of durability. S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements.  

Backup and Archive

Amazon Glacier

Amazon Glacier is secure, durable, and extremely low cost storage for long-term backup and archive that can access data in minutes.  It is designed to deliver 99.999999999% durability (11 nines), and provides comprehensive security and compliance capabilities that can help meet even the most stringent regulatory requirements. Customers can store data for as little as $0.004 per gigabyte per month, a significant savings compared to on-premises solutions.

Data Catalog

AWS Glue

AWS Glue is a fully managed service that provides a data catalog to make data in the data lake discoverable, and has the ability to do extract, transform, and load (ETL) to prepare data for analysis. The data catalog is automatically created as a persistent metadata store for all data assets, making all of the data searchable, and queryable in a single view.

Analytics

AWS provides the broadest, and most cost-effective set of analytic services that run on the data lake. Each analytic service is purpose-built for a wide range of analytics use cases such as interactive analysis, big data processing using Hadoop and Spark, data warehousing, real-time analytics, operational analytics, dashboards, and visualizations.

Interactive Analytics

Amazon Athena

For interactive analysis, Amazon Athena makes it easy to analyze data directly in S3 and Glacier using standard SQL queries. Athena is serverless, so there is no infrastructure to setup or manage. You can start querying data instantly, get results in seconds and pay only for the queries you run. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds.  

Big Data Processing

Amazon EMR

For big data processing using the Hadoop and Spark frameworks, Amazon EMR provides a managed service that makes it easy, fast and cost-effective to process vast amounts data. Amazon EMR supports 19 different open-source projects including Hadoop, Spark, HBase, Presto, and more. Each project is updated in EMR within 30 days of a version release, ensuring you have the latest and greatest from the community.

Data Warehousing

Amazon Redshift

For data warehousing, Amazon Redshift provides the ability to run complex, analytic queries against petabytes of structured data, and includes Redshift Spectrum that runs SQL queries directly against Exabytes of structured or unstructured data in S3 without the need for unnecessary data movement. Amazon Redshift is less than a tenth of the cost of traditional solutions. Start small for just $0.25 per hour, and scale out to petabytes of data for $1,000 per terabyte per year.

Real-Time Analytics

Amazon Kinesis

For real-time analytics, Amazon Kinesis makes it easy to collect, process and analyze streaming data such as IoT telemetry data, application logs, and website clickstreams. This enable you to process, and analyze data as it arrives in your data lake, and respond in real-time instead of having to wait until all your data is collected before the processing can begin.

Operational Analytics

Amazon Elasticsearch Service

For operational analytics such as application monitoring, log analytics and clickstream analytics, Amazon Elasticsearch Service allows you to search, explore, filter, aggregate, and visualize your data in near real-time. Amazon Elasticsearch Service delivers Elasticsearch’s easy-to-use APIs and real-time analytics capabilities alongside the availability, scalability, and security that production workloads require.

 

Dashboards and Visualizations

Amazon QuickSight

For dashboards and visualizations, Amazon QuickSight provides you a fast, cloud-powered business analytics service, that that makes it easy to build stunning visualizations and rich dashboards that can be accessed from any browser or mobile device.

 

Machine Learning

For predictive analytics use cases, AWS provides a broad set of machine learning services, and tools that run on your data lake on AWS. Our services come from the knowledge and capability we’ve built up at Amazon, where ML has powered Amazon.com’s recommendation engines, supply chain, forecasting, fulfillment centers, and capacity planning.  

 

Application Services

For developers who want to plug-in pre-built AI functionality into their apps, AWS provides solution-oriented APIs for computer vision, and natural language processing.

Amazon Rekognition

For computer vision, Amazon Rekognition allows your developers to easily build intelligent video, and image analysis into their applications.  

Amazon Transcribe

Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for your developers to add speech to text capability to their applications.  

Amazon Translate

Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation.  

Amazon Polly

Amazon Polly allows your developers to easily turn text into life-like speech across a large number of voices and languages.

Amazon Comprehend

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. It can identify the language of the text; extract key phrases, places, people, brands, or events; understand how positive or negative the text is; and automatically organize a collection of text files by topic.

Amazon Lex

Amazon Lex uses the automatic speech recognition and natural language understanding technology that fuels Amazon Alexa to allow your developers to quickly build intelligent conversational applications.

Frameworks and Interfaces

AWS Deep Learning AMIs

For expert machine learning practitioners and data scientists, AWS provides AWS Deep Learning AMIs  that make it easy to build deep learning models, and build clusters with ML and DL optimized GPU instances. AWS supports all the major machine learning frameworks, including TensorFlow, Caffe2, and Apache MXNet, so that you can bring or develop any model you choose. These capabilities provide unmatched power, speed, and efficiency that deep learning and machine learning workloads require.

Platform Services

Amazon SageMaker

For developers who want to get deep with ML, Amazon SageMaker is a platform service that makes the entire process of building, training, and deploying ML models easy by providing everything you need to connect to your training data, select, and optimize the best algorithm and framework, and deploy your model on auto-scaling clusters of Amazon EC2. SageMaker also includes hosted Jupyter notebooks that make it is easy to explore, and visualize your training data stored in Amazon S3.

More data lakes & analytics built on AWS than anywhere else

Why data lakes and analytics on AWS?

Flexibility and choice

AWS offers the broadest set of analytic tools and engines that analyzes data using open formats and open standards. You get to store your data in the standards-based data format of your choice such as CSV, ORC, Grok, Avro, and Parquet, and the flexibility to analyze the day in a variety of ways such as data warehousing, interactive SQL queries, real-time analytics, and big data processing. The breadth of analytics services that you can use with your data in AWS, ensures that your needs will be met for your existing and future analytics use cases.

Unmatched scalability and availability

Amazon S3 is built to store and retrieve any amount of data, with unmatched availability, and built from the ground up to deliver 99.999999999% (11 nines) of durability. It is the only storage offering that can store your data in multiple data centers across three availability zones within a single AWS Region for unmatched resilience to single data center issues, and the only storage offering that seamlessly replicates data between any regions.

Highly secure

S3 is the only cloud storage platform that allows you to apply access, log, and audit policies at the account and object level. S3 provides automatic server-side encryption, encryption with keys managed by the AWS Key Management Service (KMS), and encryption with keys that you manage. S3 encrypts data in transit when replicating across regions, and lets you use separate accounts for source and destination regions to protect against malicious insider deletions. To proactively detect early stages of an attack, Amazon Macie, an ML powered security service monitors data access activity for anomalies, and generates detailed alerts when it detects risk of unauthorized access or inadvertent data leaks.

Cost-effective

Data lakes built on AWS are the most cost-effective. Data that is infrequently used can be moved to Amazon Glacier which provides long-term backup and archive at very low costs. Amazon S3 management capabilities can analyze object access patterns to move infrequently used data to Glacier on-demand or automatically with lifecycle policies. You can begin querying the data with Amazon Athena for as little as $0.005/GB queried. Other analytics and machine learning services are priced with a pay-as-you-go approach for the resources you consume.

Fast performance

AWS analytic services like Amazon Redshift and Amazon Athena were built for fast interactive query performance to support large numbers of concurrent interactive queries. When running AWS' broad portfolio of analytics and machine learning services using Amazon S3 Select, only the subsets of data that are needed within objects are returned, leading to much faster queries up to 400% faster, and at a dramatically lower cost. Glacier Select provides a similar capability allowing you to retrieve archived data faster, and allowing you to extend your analytical capability over your data lake to include archival storage.  

 

The largest partner network

The AWS Partner Network (APN) has twice as many partner integrations than anyone else, with tens of thousands of partners, including consulting and independent software vendors, from all across the globe. This makes it easy to work and integrate with many of the same tools you use and love today. Data Lake Quick Starts, developed by AWS solution architects and partners, help you build, test, and deploy data lake solutions based on AWS best practices for security and high availability, in a few simple steps. 

 

Get started with AWS

icon1

Sign up for an AWS account

Instantly get access to the AWS Free Tier.
Learn more: What is a data lake?
icon2

Learn more about data lakes on AWS

Read more about deploying data lakes on AWS here.
Watch a session on Architecting a Data Lake here and big data architectural patterns here.
Watch customer sessions on how they have built a Data Lake including FINRA, Amazon.com, Rovio, and Sysco Foods
 
icon3

Start building with AWS

Upload your data on Amazon S3, Catalog your data with AWS Glue, and begin querying it with Amazon Athena. Run data warehousing queries with Amazon Redshift Spectrum, Hadoop and Spark with Amazon EMR, and Machine Learning with Amazon Sagemaker.
 
Have a POC and want to talk to someone? Contact us or deploy through our AWS Quick Starts
 

Get started with data lakes on AWS

Deploy a data lake with AWS Quick Starts
Have more questions?
Contact us