AWS Quick Starts — Customer Ready Solutions

Data Lake Foundation on AWS

Using AWS services, including Amazon Redshift, Amazon Kinesis, AWS Glue, and Amazon SageMaker

This Quick Start deploys a data lake foundation that integrates Amazon Web Services (AWS) services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Kinesis, Amazon Athena, AWS Glue, Amazon Elasticsearch Service (Amazon ES), Amazon SageMaker, and Amazon QuickSight.

The data lake foundation uses these AWS services to provide capabilities such as data submission, ingest processing, dataset management, data transformation and analysis, building and deploying machine learning tools, search, publishing, and visualization. Once this foundation is in place, you may choose to augment the data lake with ISV and SaaS tools.

This reference architecture is automated by AWS CloudFormation templates that you can customize to meet your specific requirements.

See also: If this architecture doesn't meet your specific requirements, see the other data lake deployments in the Quick Start catalog.

datalake_icon_crs

This Quick Start was developed by 47Lining in partnership with AWS. 47Lining is an
APN Partner.

  •  What you'll build
  •  How to deploy
  •  Cost and licenses
  •  Resources
  •  Demo
  •  What you'll build
  • The Quick Start architecture for the data lake includes the following infrastructure:

    • A virtual private cloud (VPC) that spans two Availability Zones and includes two public and two private subnets.*
    • An internet gateway to allow access to the internet.*
    • In the public subnets, managed NAT gateways to allow outbound Internet access for resources in the private subnets.*
    • In the public subnets, Linux bastion hosts in an Auto Scaling group to allow inbound Secure Shell (SSH) access to EC2 instances in public and private subnets.*
    • AWS Identity and Access Management (IAM) roles to provide permissions to access AWS resources; for example, to permit Amazon Redshift and Amazon Athena to read and write curated datasets.
    • In the private subnets, Amazon Redshift for data aggregation, analysis, transformation, and creation of new curated and published datasets.
    • An Amazon SageMaker instance, which you can access by using AWS authentication.
    • Integration with other Amazon services such as Amazon S3, Amazon Athena, AWS Glue, AWS Lambda, Amazon ES with Kibana, Amazon Kinesis, and Amazon QuickSight.

    *  The template that deploys the Quick Start into an existing VPC skips the tasks marked by asterisks and prompts you for your existing VPC configuration.

  •  How to deploy
  • To build your data lake environment on AWS, follow the instructions in the deployment guide. The deployment process includes these steps:

    1. If you don't already have an AWS account, sign up at https://aws.amazon.com.
    2. Launch the Quick Start. The deployment takes about 50 minutes. You can choose from two options:
    3. Test the deployment by checking the resources created by the Quick Start.

    The Quick Start includes parameters that you can customize. For example, you can configure your network or customize the Amazon Redshift, Kinesis, and Elasticsearch settings.  

  •  Cost and licenses
  • You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start.

    The AWS CloudFormation templates for this Quick Start include configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. See the pricing pages for each AWS service you will be using for cost estimates.

    Because this Quick Start uses AWS-native solution components, there are no costs or license requirements beyond AWS infrastructure costs. This Quick Start also deploys Kibana, which is an open-source tool that’s included with Amazon ES.

  •  Resources
  • This Quick Start reference deployment is related to a solution featured in Solution Space that includes a solution brief, optional consulting offers crafted by AWS Competency Partners, and AWS co-investment in proof-of-concept (PoC) projects. To learn more about these resources, visit Solution Space.

  •  Demo
  • This demo was created by 47Lining and solutions architects at AWS for evaluation or proof-of-concept (POC) purposes on the AWS Cloud. For production-ready deployments, use the Data Lake Foundation on AWS Quick Start.

    This demo deploys a simplified Quick Start data lake foundation architecture into your AWS account with sample data. After the demo is up and running, you can use the demo walkthrough guide for a tour of product features. The demo helps you explore foundational data lake capabilities such as search, transforms, queries, analytics, and visualization by using AWS services.

    To deploy:

    1. If you don't already have an AWS account, sign up at https://aws.amazon.com.
    2. Sign up to launch the demo. After you answer a few questions and submit the sign-up form, the AWS CloudFormation console will launch.
    3. In the console, provide the requested information to launch the demo.
     
    Estimated time: 50 minutes for deployment, 20 minutes for walkthrough
     
    Cost: You are responsible for the cost of the AWS services used while running this demo. There is no additional cost.