AWS delivers an integrated suite of services that provide everything needed to quickly and easily build and manage a data lake for analytics. AWS-powered data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights, in ways that traditional data silos and data warehouses cannot. AWS gives customers the widest array of analytics and machine learning services, for easy access to all relevant data, without compromising on security or governance.
There are more organizations with data lakes and analytics on AWS than anywhere else. Customers like NASDAQ, Zillow, Yelp, iRobot, and FINRA trust AWS to run their business critical analytics workloads.
Data Lakes and Analytics on AWS
To build your data lakes and analytics solution, AWS provides the most comprehensive set of services to move, store, and analyze your data.
Import your data from on premises, and in real-time.
Store any type of data securely, from gigabytes to exabytes.
Analyze your data with the broadest selection of analytics services.
Predict future outcomes, and prescribe actions for rapid response.
Why data lakes and analytics on AWS?
Easiest to build data lakes
Build a secure data lake in days instead of months. Our experience working with tens of thousands of customers to build productive data lakes has allowed us to make every aspect of analyzing data in the cloud easier. For example, AWS Lake Formation automates the manual steps required to build a data lake and provides a single security mechanism across all of your data, so you spend less time on the undifferentiated heavy lifting required to build a data lake and more time exploring your data to get answers to your most important questions.
Best performance at the lowest cost
AWS is the fastest and most cost-effective place to store and analyze data. For example, Amazon S3 provides five storage classes and automatic data lifecycle management so you pay only what’s needed for your data based on how that data is used. Amazon Redshift is 3x faster than any other cloud data warehouse and gets faster every year. Amazon EMR provides the fastest place to run Apache Spark and Apache HIVE workloads in the cloud. EMR’s deep integration with the rest of AWS makes it easy to take advantage of cost-saving features, such as EC2 Spot instances, to reduce costs by up to 90%.
Most comprehensive and open
Having all your data locked in a single siloed analytics service doesn’t work anymore. Modern analytics requires a collection of different tools and approaches, including SQL, R, Scala, Jupyter, and Python, to get to the right insights and answers using a variety of languages. AWS provides a mature and comprehensive set of analytics services that run against the open data lake so you can use the right tool for the right job without needing to move or transform data for each different analytics approach. All of our services support accessing data stored in a single object store (S3) with open APIs, in open formats (eg: Apache Paquet, Apache ORC, Apache Avro), and using both proprietary (Redshift for data warehousing) and open engines (eg: Spark, Hive).
Keeping your data secure and complying with relevant regulations is essential. AWS provides a comprehensive set of tools that goes beyond standard security functionality like encryption and access control to pro-active monitoring and unified management of security policies. For example, Amazon Macie helps monitor your data lake to ensure you are not accidentally exposing credentials or personally identifiable information (PII). Amazon Inspector helps to enforce best practices and identify configuration issues that could be exploited, and AWS Lake Formation allows you to consistently control access to data in your data lake across all analytics services.
The first step to building data lakes on AWS is to move data to the cloud. The physical limitations of bandwidth and transfer speeds restrict the ability to move data without major disruption, high costs, and time. To make data transfer easy and flexible, AWS provides the widest range of options to transfer data to the cloud.
To build ETL jobs and ML Transforms for your data lake, learn about AWS Lake Formation.
On-premises data movement
AWS provides multiple ways to move data from your datacenter to AWS. To establish a dedicated network connection between your network and AWS, you can use AWS Direct Connect. To move petabytes to exabytes of data to AWS using physical appliances, you can use AWS Snowball and AWS Snowmobile. To have your on-premises applications store data directly into AWS, you can use AWS Storage Gateway.
Real-time data movement
AWS provides multiple ways to ingest real-time data generated from new sources such as websites, mobile apps, and internet-connected devices. To make it simple to capture and load streaming data or IoT device data, you can use Amazon Kinesis Data Firehose, Amazon Kinesis Video Streams, and AWS IoT Core.
Once data is ready for the cloud, AWS makes it easy to store data in any format, securely, and at massive scale with Amazon S3 and Amazon Glacier. To make it easy for end users to discover the relevant data to use in their analysis, AWS Glue automatically creates a single catalog that is searchable, and queryable by users.
To build a secure data lake faster, learn more about AWS Lake Formation.
Amazon S3 is secure, highly scalable, durable object storage with millisecond latency for data access. S3 is built to store any type of data from anywhere – web sites and mobile apps, corporate applications, and data from IoT sensors or devices. It is built to store and retrieve any amount of data, with unmatched availability, and built from the ground up to deliver 99.999999999% (11 nines) of durability. S3 Select focuses data read and retrieval, reducing response times up to 400%. S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements.
Backup and Archive
Amazon Glacier is secure, durable, and extremely low cost storage for long-term backup and archive that can access data in minutes, and similarly Glacier Select reads and retrieves only the data needed. It is designed to deliver 99.999999999% durability (11 nines), and provides comprehensive security and compliance capabilities that can help meet even the most stringent regulatory requirements. Customers can store data for as little as $0.004 per gigabyte per month, a significant savings compared to on-premises solutions.
AWS Glue is a fully managed service that provides a data catalog to make data in the data lake discoverable, and has the ability to do extract, transform, and load (ETL) to prepare data for analysis. The data catalog is automatically created as a persistent metadata store for all data assets, making all of the data searchable, and queryable in a single view.
AWS provides the broadest, and most cost-effective set of analytic services that run on the data lake. Each analytic service is purpose-built for a wide range of analytics use cases such as interactive analysis, big data processing using Apache Spark and Hadoop, data warehousing, real-time analytics, operational analytics, dashboards, and visualizations.
To manage secure, self-service access to data in a data lake for analytics services, learn more about AWS Lake Formation.
For interactive analysis, Amazon Athena makes it easy to analyze data directly in S3 and Glacier using standard SQL queries. Athena is serverless, so there is no infrastructure to setup or manage. You can start querying data instantly, get results in seconds and pay only for the queries you run. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds.
Big Data Processing
For big data processing using the Spark and Hadoop frameworks, Amazon EMR provides a managed service that makes it easy, fast, and cost-effective to process vast amounts data. Amazon EMR supports 19 different open-source projects including Hadoop, Spark, HBase, and Presto, with managed EMR Notebooks for data engineering, data science development, and collaboration. Each project is updated in EMR within 30 days of a version release, ensuring you have the latest and greatest from the community, effortlessly.
For data warehousing, Amazon Redshift provides the ability to run complex, analytic queries against petabytes of structured data, and includes Redshift Spectrum that runs SQL queries directly against Exabytes of structured or unstructured data in S3 without the need for unnecessary data movement. Amazon Redshift is less than a tenth of the cost of traditional solutions. Start small for just $0.25 per hour, and scale out to petabytes of data for $1,000 per terabyte per year.
For real-time analytics, Amazon Kinesis makes it easy to collect, process and analyze streaming data such as IoT telemetry data, application logs, and website clickstreams. This enable you to process, and analyze data as it arrives in your data lake, and respond in real-time instead of having to wait until all your data is collected before the processing can begin.
Amazon Elasticsearch Service
For operational analytics such as application monitoring, log analytics and clickstream analytics, Amazon Elasticsearch Service allows you to search, explore, filter, aggregate, and visualize your data in near real-time. Amazon Elasticsearch Service delivers Elasticsearch’s easy-to-use APIs and real-time analytics capabilities alongside the availability, scalability, and security that production workloads require.
Dashboards and Visualizations
For dashboards and visualizations, Amazon QuickSight provides you a fast, cloud-powered business analytics service, that that makes it easy to build stunning visualizations and rich dashboards that can be accessed from any browser or mobile device.
For predictive analytics use cases, AWS provides a broad set of machine learning services, and tools that run on your data lake on AWS. Our services come from the knowledge and capability we’ve built up at Amazon, where ML has powered Amazon.com’s recommendation engines, supply chain, forecasting, fulfillment centers, and capacity planning.
Frameworks and Interfaces
For expert machine learning practitioners and data scientists, AWS provides AWS Deep Learning AMIs that make it easy to build deep learning models, and build clusters with ML and DL optimized GPU instances. AWS supports all the major machine learning frameworks, including Apache MXNet, TensorFlow, and Caffe2 so that you can bring or develop any model you choose. These capabilities provide unmatched power, speed, and efficiency that deep learning and machine learning workloads require.
For developers who want to get deep with ML, Amazon SageMaker is a platform service that makes the entire process of building, training, and deploying ML models easy by providing everything you need to connect to your training data, select, and optimize the best algorithm and framework, and deploy your model on auto-scaling clusters of Amazon EC2. SageMaker also includes hosted Jupyter notebooks that make it is easy to explore, and visualize your training data stored in Amazon S3.
For developers who want to plug-in pre-built AI functionality into their apps, AWS provides solution-oriented APIs for computer vision, and natural language processing. These application services lets developers add intelligence to their applications without developing and training their own models.