AWS Solutions Library

Data Lake on AWS

Many Amazon Web Services (AWS) customers require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. A data lake is a new and increasingly popular way to store and analyze data because it allows companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository.

The AWS Cloud provides many of the building blocks required to help customers implement a secure, flexible, and cost-effective data lake. These include AWS managed services that help ingest, store, find, process, and analyze both structured and unstructured data. To support our customers as they build data lakes, AWS offers Data Lake on AWS, which deploys a highly available, cost-effective data lake architecture on the AWS Cloud along with a user-friendly console for searching and requesting datasets.

Overview

Data Lake on AWS automatically configures the core AWS services necessary to easily tag, search, share, transform, analyze, and govern specific subsets of data across a company or with other external users. The Guidance deploys a console that users can access to search and browse available datasets for their business needs. It also includes a federated template that allows you to launch a version of the solution that is ready to integrate with Microsoft Active Directory.

The diagram below presents the data lake architecture you can build using the example code on GitHub.

Data Lake Solution | Architecture Diagram

Data Lake on AWS architecture

The code configures a suite of AWS Lambda microservices (functions), Amazon OpenSearch Service for robust search capabilities, Amazon Cognito for user authentication, AWS Glue for data transformation, and Amazon Athena for analysis.

Data Lake on AWS leverages the security, durability, and scalability of Amazon S3 to manage a persistent catalog of organizational datasets, and Amazon DynamoDB to manage corresponding metadata. Once a dataset is cataloged, its attributes and descriptive tags are available to search on. Users can search and browse available datasets in the console, and create a list of data they require access to. It keeps track of the datasets a user selects and generates a manifest file with secure access links to the desired content when the user checks out.

Show less

Data Lake on AWS

Version 2.2
Last updated: 04/2023
Author: AWS

Example code on GitHub

Additional resources

Resources & FAQ »
Contact us »

Did this Guidance help you?

Yes

Provide feedback

Features

Data access ﬂexibility

Leverage pre-signed Amazon S3 URLs, or use an appropriate AWS Identity and Access Management (IAM) role for controlled yet direct access to datasets in Amazon S3.

Managed storage layer

Secure and manage the storage and retrieval of data in a managed Amazon S3 bucket, and use a solution-speciﬁc AWS Key Management Service (KMS) key to encrypt data at rest.

Federation sign-in

As an option, you can allow users to sign in through a SAML identity provider (IdP) such as Microsoft Active Directory Federation Services (AD FS).

Command line interface

Use the provided CLI or API to easily automate data lake activities or integrate this Guidance into existing data automation for dataset ingress, egress, and analysis.

User interface

Data Lake on AWS provides an intuitive, web-based console UI hosted on Amazon S3 and delivered by Amazon CloudFront. Access the console to easily manage data lake users, data lake policies, add or remove data packages, search data packages, and create manifests of datasets for additional analysis.

Deploy an AWS Solution yourself

Browse our library of AWS Solutions to get answers to common architectural problems.

Learn more

Find an AWS Partner Solution

Find AWS Partners to help you get started.

Learn more

Explore Guidance

Find prescriptive architectural diagrams, sample code, and technical content for common use cases.

Learn more

Data Lake on AWS

Overview

Data Lake on AWS architecture

Data Lake on AWS

Additional resources

Features

Data access ﬂexibility

Managed storage layer

Federation sign-in

Command line interface

User interface

Ending Support for Internet Explorer