AWS Quick Starts — Customer Ready Solutions

Data Lake Foundation on AWS

Using AWS services, including Amazon Redshift, Amazon Kinesis, AWS Glue, and Amazon SageMaker

This Quick Start deploys a data lake foundation that integrates Amazon Web Services (AWS) services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Kinesis, Amazon Athena, AWS Glue, Amazon Elasticsearch Service (Amazon ES), Amazon SageMaker, and Amazon QuickSight.

The data lake foundation uses these AWS services to provide capabilities such as data submission, ingest processing, dataset management, data transformation and analysis, building and deploying machine learning tools, search, publishing, and visualization. Once this foundation is in place, you may choose to augment the data lake with ISV and SaaS tools.

The deployment also includes an optional wizard and a sample dataset that is loaded into Amazon Redshift and Kinesis streams to demonstrate data lake capabilities.

This reference architecture is automated by AWS CloudFormation templates that you can customize to meet your specific requirements.

See also: If this architecture doesn't meet your specific requirements, see the other data lake deployments in the Quick Start catalog.

datalake_icon_crs

This Quick Start was developed by 47Lining in partnership with AWS. 47Lining is an
APN Partner.

  •  What you'll build
  •  How to deploy
  •  Cost and licenses
  •  Resources
  •  What you'll build
  • The Quick Start architecture for the data lake includes the following infrastructure:

    • A virtual private cloud (VPC) that spans two Availability Zones and includes two public and two private subnets.*
    • An internet gateway to allow access to the internet.*
    • In the public subnets, managed NAT gateways to allow outbound Internet access for resources in the private subnets.*
    • In the public subnets, Linux bastion hosts in an Auto Scaling group to allow inbound Secure Shell (SSH) access to EC2 instances in public and private subnets.*
    • In a private subnet, a web application instance that hosts an optional wizard, which guides you through the data lake architecture and functionality.
    • IAM roles to provide permissions to access AWS resources; for example, to permit Amazon Redshift and Amazon Athena to read and write curated datasets.
    • In the private subnets, Amazon Redshift for data aggregation, analysis, transformation, and creation of new curated and published datasets. When you launch the Quick Start with the optional wizard and sample data, Amazon Redshift is launched in a public subnet.
    • An Amazon SageMaker instance, which you can access by using AWS authentication. This instance is created only if you deploy the optional wizard and upload sample data.
    • Integration with other Amazon services such as Amazon S3, Amazon Athena, AWS Glue, AWS Lambda, Amazon ES with Kibana, Amazon Kinesis, and Amazon QuickSight.

    *  The template that deploys the Quick Start into an existing VPC skips the tasks marked by asterisks and prompts you for your existing VPC configuration.

  •  How to deploy
  • To build your data lake environment on AWS, follow the instructions in the deployment guide. The deployment process includes these steps:

    1. If you don't already have an AWS account, sign up at https://aws.amazon.com.
    2. Launch the Quick Start. The deployment takes about 50 minutes. You can choose from two options:
    3. Test the deployment by checking the resources created by the Quick Start.
    4. If you've included the wizard and the sample dataset in your deployment, use the wizard to explore data lake features.

    The Quick Start includes parameters that you can customize. For example, you can configure your network or customize the Amazon Redshift, Kinesis, and Elasticsearch settings. You can also extend the sample dataset or use your own dataset.

  •  Cost and licenses
  • You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start.

    The AWS CloudFormation templates for this Quick Start include configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. See the pricing pages for each AWS service you will be using for cost estimates.

    Because this Quick Start uses AWS-native solution components, there are no costs or license requirements beyond AWS infrastructure costs. This Quick Start also deploys Kibana, which is an open-source tool that’s included with Amazon ES.

  •  Resources
  • This Quick Start reference deployment is related to a solution featured in Solution Space that includes a solution brief, optional consulting offers crafted by AWS Competency Partners, and AWS co-investment in proof-of-concept (PoC) projects. To learn more about these resources, visit Solution Space.