AWS Quick Starts — Customer Ready Solutions

Hybrid Data Lake on AWS

With WANdisco Fusion, Amazon S3, and Amazon Athena

This Quick Start deploys a hybrid cloud environment that integrates on-premises Hadoop clusters with a data lake on the Amazon Web Services (AWS) Cloud. The deployment includes WANdisco Fusion, Amazon Simple Storage Service (Amazon S3), and Amazon Athena, and supports cloud migration and burst-out processing scenarios.

The Quick Start provides the option to deploy a Docker container, which represents your on-premises Hadoop cluster for demonstration purposes, and helps you gain hands-on experience with the hybrid data lake architecture. WANdisco Fusion replicates data from Docker to Amazon S3 continuously, ensuring strong consistency between data residing on premises and data in the cloud. You can use Amazon Athena to analyze and view the data that has been replicated.

See also: If this architecture doesn't meet your specific requirements, see the other data lake deployments in the Quick Start catalog.

datalake_icon_crs

This Quick Start was developed by Sturdy and WANdisco in collaboration with AWS. Sturdy and WANdisco are
AWS Competency Partners.

  •  What you'll build
  •  How to deploy
  •  Cost and licenses
  •  Resources
  •  What you'll build
  • The Quick Start architecture for the hybrid data lake includes the following:

    • A virtual private cloud (VPC) that spans two Availability Zones and includes two public subnets.*
    • An internet gateway to provide access to the internet.*
    • In the public subnets, WANdisco Fusion server instances in an Auto Scaling group, functioning as a single clustered service.
    • (Optional) An on-premises WANdisco server deployed in a Docker container, to demonstrate the synchronization from HDFS to the S3 bucket in the cloud. The Quick Start uses a sample open dataset consisting of publicly available NYC taxi data.
    • (Optional) Amazon Athena to query and analyze the data from the local WANdisco Fusion server, which is synchronized with Amazon S3.
    • (Optional) An S3 bucket to store the content that is being synchronized by WANdisco Fusion and the analysis information processed by Athena.

    * The template that deploys the Quick Start into an existing VPC skips the components marked by asterisks.

  •  How to deploy
  • You can build your data lake environment on AWS in about 15 minutes, by following a few simple steps:

    1. If you don't already have an AWS account, sign up at https://aws.amazon.com.
    2. Subscribe to the Amazon Machine Image (AMI) for WANdisco Fusion in AWS Marketplace.
    3. Launch the Quick Start. You can choose from two options:
    4. (Optional) Deploy an on-premises WANdisco server in a Docker container, and set up replication to see the synchronization ability from HDFS to Amazon S3.

    The Quick Start includes parameters that you can customize. For example, you can configure your network or customize the WANdisco Fusion and Amazon Athena settings.

  •  Cost and licenses
  • You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start.

    The AWS CloudFormation templates for this Quick Start include configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. See the pricing pages for each AWS service you will be using for cost estimates.

    The Quick Start requires a subscription to the WANdisco Fusion AMI in the AWS Marketplace. The WANdisco Fusion software is provided with the Bring Your Own License model. If no license is provided, the Quick Start will configure the application with a trial key. To continue using WANdisco Fusion beyond the 14-day trial period, you must purchase a license by contacting WANdisco at http://www.wandisco.com/contact.

  •  Resources
  • This Quick Start reference deployment is related to a solution featured in Solution Space that includes a solution brief, optional consulting offers crafted by AWS Competency Partners, and AWS co-investment in proof-of-concept (PoC) projects. To learn more about these resources, visit Solution Space.