AWS Quick Starts — Customer Ready Solutions

Data Lake with Talend Big Data Platform

Using Talend Big Data Platform, AWS services, and Cognizant best practices

This Quick Start builds a data lake environment on the Amazon Web Services (AWS) Cloud by deploying Talend Big Data Platform components and AWS services such as Amazon EMR, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon Relational Database Service (Amazon RDS).

The Quick Start also provides an optional sample dataset and Talend jobs developed by Cognizant Technology Solutions to illustrate big data practices for integrating Apache Spark, Apache Hadoop, Amazon EMR, Amazon Redshift, and Amazon S3 technologies into the data lake implementation.

The Quick Start is for users who are evaluating big data in the cloud or looking to accelerate their big data initiative through the adoption of best practices for big data integration.

You can choose to build a new virtual private cloud (VPC) infrastructure that’s configured for security, scalability, and high availability, or use your existing VPC infrastructure for the data lake.

datalake_icon_crs_talend

This Quick Start was developed by Cognizant Technology Solutions and Talend Inc. in partnership with AWS. Cognizant and Talend are
AWS Competency Partners.

  •  What you'll build
  •  How to deploy
  •  Cost and licenses
  •  Get assistance
  •  What you'll build
  • The Quick Start architecture for the data lake includes the following:

    • A VPC that spans two Availability Zones. Each Availability Zone contains two subnets: a public subnet to allow connecting over the internet and a private subnet for Talend job servers, Amazon Redshift, Amazon RDS, and Amazon EMR. (The private subnet in the second Availability Zone contains only the job servers.)*
    • An internet gateway to allow access to the internet. This gateway is used by the bastion hosts to send and receive traffic.*
    • In the public subnets, managed network address translation (NAT) gateways to allow outbound internet access for resources in the private subnets.*
    • In one or both public subnets, Linux bastion hosts to allow inbound Secure Shell (SSH) access to the resources in the private subnets. You can choose the number of bastion hosts when you launch the Quick Start.*
    • In the public subnet in the first Availability Zone:
      • Talend public servers that host the Talend Administration Center (TAC) for administering Talend jobs via the browser.
      • A Talend Studio remote desktop instance available through an X2Go client for users who do not want to run Talend Studio on their laptops.
      • A Nexus artifact repository and Git servers for binary and source configuration management.
      • A Talend log server using Amazon Elasticsearch Service (Amazon ES), Logstash, and Kibana.
    • In the private subnet in the first Availability Zone:
      • An Amazon RDS MySQL DB instance to host Talend metadata.
      • An Amazon EMR cluster with Pig, Hive, and Spark that integrates closely with the Talend Big Data Platform and provides Hadoop capability in the data lake.
      • An Amazon Redshift cluster for use as a data warehouse or data mart.
    • In the private subnets, Talend job server instances running Talend jobs scheduled by the TAC, in an Auto Scaling group. Auto Scaling allows EC2 instances to be automatically spun up or down to respond to the demand on the Talend job servers. You can configure the desired and maximum number of instances during deployment.
    • In the public subnets, Talend distant run job server instances running Talend jobs on behalf of Talend Studio users, in an Auto Scaling group. You can run Talend jobs locally on Talend Studio or on these servers. The Auto Scaling group allows EC2 instances to be automatically spun up or down to respond to the demand on the Talend job servers. You can set the desired and maximum number of instances during deployment.
    • Amazon S3 to ingest data for the data lake.

     

    *  The template that deploys the Quick Start into an existing VPC skips the tasks marked by asterisks and prompts you for your existing VPC configuration.

  •  How to deploy
  • You can build your data lake environment on AWS in about an hour by following a few simple steps:

    1. If you don't already have an AWS account, sign up at https://aws.amazon.com.
    2. Upload your Talend Big Data Platform license to a private S3 bucket. You can sign up for a 30-day free trial license on the Talend website.
    3. Launch the Quick Start. You can choose from two options:
    4. Test the deployment by opening the Talend Administration Center (TAC) and checking the servers deployed by the Quick Start. You can also run the optional Talend jobs to test end-to-end data integration, by following the steps in the user guide provided by Talend and Cognizant.  

    The Quick Start includes parameters that you can customize. For example, you can configure your network or customize TAC, Amazon Redshift, Nexus, and Git server settings.

  •  Cost and licenses
  • You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start.

    The AWS CloudFormation templates for this Quick Start include configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. See the pricing pages for each AWS service you will be using for cost estimates.

    You will need to provide your own Talend Big Data Platform license. To request a 30-day free trial license, please fill out the registration form on the Talend website. You’ll receive a unique license key from Talend, which you’ll use during the Quick Start deployment process.

    The code for all Talend jobs included in the Quick Start are released under the Apache License.

  •  Get assistance
  • AWS big data competency partners offer consulting services to help you quickly discover value from this data lake solution. To learn more about these partners and their consulting offers, and to request more information or support, see the Data Lake on AWS with Talend webpage on the Solution Space website.