What does this AWS Solutions Implementation do?
Many Amazon Web Services (AWS) customers require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. A data lake is a new and increasingly popular way to store and analyze data because it allows companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository.
The AWS Cloud provides many of the building blocks required to help customers implement a secure, flexible, and cost-effective data lake. These include AWS managed services that help ingest, store, find, process, and analyze both structured and unstructured data. To support our customers as they build data lakes, AWS offers the data lake solution, which is an automated reference implementation that deploys a highly available, cost-effective data lake architecture on the AWS Cloud along with a user-friendly console for searching and requesting datasets.
Version 2.2 of the solution uses the most up-to-date Node.js runtime. Version 2.1 uses the Node.js 8.10 runtime, which reaches end-of-life on December 31, 2019. To upgrade to version 2.2, you must deploy the solution as a new stack. For more information, see the deployment guide.
AWS Solutions Implementation overview
AWS offers a data lake solution that automatically configures the core AWS services necessary to easily tag, search, share, transform, analyze, and govern specific subsets of data across a company or with other external users. The solution deploys a console that users can access to search and browse available datasets for their business needs. The solution also includes a federated template that allows you to launch a version of the solution that is ready to integrate with Microsoft Active Directory.
The diagram below presents the data lake architecture you can deploy in minutes using the solution's implementation guide and accompanying AWS CloudFormation template.

Data Lake on AWS solution architecture
The AWS CloudFormation template configures the solution's core AWS services, which includes a suite of AWS Lambda microservices (functions), Amazon Elasticsearch for robust search capabilities, Amazon Cognito for user authentication, AWS Glue for data transformation, and Amazon Athena for analysis.
The solution leverages the security, durability, and scalability of Amazon S3 to manage a persistent catalog of organizational datasets, and Amazon DynamoDB to manage corresponding metadata. Once a dataset is cataloged, its attributes and descriptive tags are available to search on. Users can search and browse available datasets in the solution console, and create a list of data they require access to.
The solution keeps track of the datasets a user selects and generates a manifest file with secure access links to the desired content when the user checks out.
Data Lake on AWS
Version 2.2
Last updated: 12/2019
Author: AWS
Estimated deployment time: 30 min
Deployment resources
Note: To subscribe to RSS updates, you must have an RSS plug-in enabled for the browser you are using.
Features
Data lake reference implementation
Data access flexibility
Federation sign in
Managed storage layer
Command line interface
User interface

Browse our library of AWS Solutions Implementations to get answers to common architectural problems.

Find AWS certified consulting and technology partners to help you get started.

Browse our portfolio of Consulting Offers to get AWS-vetted help with solution deployment.