Many Amazon Web Services (AWS) customers require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. A data lake is a new and increasingly popular way to store and analyze data because it allows companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository.

The AWS Cloud provides many of the building blocks required to help customers implement a secure, flexible, and cost-effective data lake. These include AWS managed services that help ingest, store, find, process, and analyze both structured and unstructured data. To support our customers as they build data lakes, AWS offers the data lake solution, which is an automated reference implementation that deploys a highly available, cost-effective data lake architecture on the AWS Cloud along with a user-friendly console for searching and requesting datasets.

This webpage provides high-level best practices and guidance for building data lakes on AWS and introduces the data lake solution.

Many companies leverage a data lake to complement, rather than replace, existing Data Warehouses. A data lake can be used as a source for both structured and unstructured data, which can be easily converted into a well-defined schema for ingestion into a Data Warehouse, or analyzed ad hoc to quickly explore unknown datasets and discover new insights. With this in mind, consider the following best practices when building a data lake solution:

  • Configure your data lake to be flexible and scalable so that you can collect and store all types of data as your company grows. Include design components that support data encryption, search, analysis and querying.
  • Implement granular access-control policies and data security mechanisms to protect all data stored in the data lake.
  • Provide mechanisms that enable users to quickly and easily search and retrieve relevant data, and perform new types of data analysis.
  • Leverage managed services for multiple methods of data ingestion and analysis. For example, use Amazon Kinesis, AWS Snowball, or AWS Direct Connect to transfer large amounts of data. Then use powerful services such as Amazon EMR, AWS Data Pipeline, and Amazon Elasticsearch Service to process that data for meaningful analysis.

AWS offers a data lake solution that automatically configures the core AWS services necessary to easily tag, search, share, transform, analyze, and govern specific subsets of data across a company or with other external users. The solution deploys a console that users can access to search and browse available datasets for their business needs. The solution also includes federated template that allows you to launch a version of the solution that is ready to integrate with Microsoft Active Directory.

The diagram below presents the data lake architecture you can deploy in minutes using the solution's implementation guide and accompanying AWS CloudFormation template.

  1. The AWS CloudFormation template configures the solution's core AWS services, which includes a suite of AWS Lambda microservices (functions), Amazon Elasticsearch for robust search capabilities, Amazon Cognito for user authentication, AWS Glue for data transformation, and Amazon Athena for analysis.
  2. The solution leverages the security, durability, and scalability of Amazon S3 to manage a persistent catalog of organizational datasets, and Amazon DynamoDB to manage corresponding metadata.
  3. Once a dataset is cataloged, its attributes and descriptive tags are available to search on. Users can search and browse available datasets in the solution console, and create a list of data they require access to.
  4. The solution keeps track of the datasets a user selects and generates a manifest file with secure access links to the desired content when the user checks out.
Deploy Solution
Implementation Guide

What you'll accomplish:

Configure a data lake to create a flexible, scalable, and cost effective centralized data repository that complements and expands upon your existing data warehouse.

Deploy a user-friendly console that allows you to easily share your data for business purposes, both within and outside your company.

Secure access to your data using Amazon S3 encryption, access keys, and Amazon Cognito.

Automatically catalog your data using scripts that take advantage of the data lake solution Command Line Interface (CLI) or RESTful API.

Leverage a deep suite of Big Data analytics services using self-generated manifest files to drive new insights that help you create business value.

Transform and analyze your searchable metadata using AWS Glue and Amazon Athena to crawl your data sources, identify data formats, and suggest schemas and transformations.

Enable federated sign in by enabling users to sign in through a SAML identity provider (IdP) such as Microsoft Active Directory Federation Services (AD FS).

What you'll need before starting:

An AWS account: You will need an AWS account to begin provisioning resources. Sign up for AWS.

Skill level: This solution is intended for IT Infrastructure Architects, Administrators, and DevOps professionals who have practical experience architecting on the AWS Cloud.

Q: What can the data lake solution manage on my behalf?

The solution manages a persistent catalog of organizational datasets in Amazon S3 and business-relevant tags associated with each dataset. It allows companies to create simple governance policies that require specific tags when datasets are stored in the data lake solution.

Q: What type of datasets does the data lake solution support?

You can register existing or new datasets of any file type or size because the solution leverages the flexibility of Amazon S3.

Q: How do I upload my data to the data lake?

You can upload data files from the data lake solution console, or directly to an Amazon S3 bucket and then register them in the data lake.

Q: Can I use the data lake if I have existing data in Amazon S3?

Yes. You can register datasets with descriptive tags of your choice that point to existing objects in Amazon S3.

Q: How can I monitor the data lake?

The data lake logs API calls, latency, and error rates to Amazon CloudWatch in your AWS account. Additionally, you can turn on audit logging for your data lake deployment to monitor all user activity for compliance tracking.

Q: How quickly are solution logs available?

Logs, alarms, error rates and other metrics are stored in Amazon CloudWatch and are available near real-time.

Q: How do I add and manage users in the data lake solution?

After the data lake solution is deployed, you can invite users to self-register to start using the data lake. You can continue to manage users, groups, and permissions to the data lake in the Administration section of the solution console.

Q: How is data transmitted to the data lake?

You have several options to add data to the data lake solution: use the data lake console or data lake CLI to upload files, or link to existing content in Amazon S3.

Q: Can I deploy the data lake solution in any AWS Region?

You can deploy the solution’s AWS CloudFormation template only in AWS Regions where Amazon Cognito, Amazon Athena, and AWS Glue are available. However, once deployed, you can invite users from around the globe to access the solution.

In addition to service-availability requirements, we recommend you deploy the data lake in the AWS Region where your data is stored for better performance and user interactivity.

Need more resources to get started with AWS? Visit the Getting Started Resource Center to find tutorials, projects and videos to get started with AWS.

Tell us what you think