Deploy on AWS into a new VPC

or deploy into your existing VPC
(deployment requires a Talend license; see guide)

This reference architecture is automated by AWS CloudFormation templates that you can customize to meet your specific requirements. Sign up for a free trial license from Talend, and follow the deployment guide for step-by-step instructions.

If you need assistance: For help setting up your data lake, see the Jumpstart consulting offer from Cognizant for this Quick Start.

See also: If this architecture doesn't meet your specific requirements, see the other data lake deployments in the Quick Start catalog.


This Quick Start builds a data lake environment on the Amazon Web Services (AWS) Cloud by deploying Talend Big Data Platform components and AWS services such as Amazon EMR, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon Relational Database Service (Amazon RDS).

The Quick Start also provides an optional sample dataset and Talend jobs developed by Cognizant Technology Solutions to illustrate big data practices for integrating Apache Spark, Apache Hadoop, Amazon EMR, Amazon Redshift, and Amazon S3 technologies into the data lake implementation.

The Quick Start is for users who are evaluating big data in the cloud or looking to accelerate their big data initiative through the adoption of best practices for big data integration. 

You can choose to build a new virtual private cloud (VPC) infrastructure that’s configured for security, scalability, and high availability, or use your existing VPC infrastructure for the data lake.

  • What you'll build

    The Quick Start architecture for the data lake includes the following:
    • A VPC that spans two Availability Zones. Each Availability Zone contains two subnets: a public subnet to allow connecting over the internet and a private subnet for Talend job servers, Amazon Redshift, Amazon RDS, and Amazon EMR. (The private subnet in the second Availability Zone contains only the job servers.)*
    • An internet gateway to allow access to the internet. This gateway is used by the bastion hosts to send and receive traffic.*
    • In the public subnets, managed network address translation (NAT) gateways to allow outbound internet access for resources in the private subnets.*
    • In one or both public subnets, Linux bastion hosts to allow inbound Secure Shell (SSH) access to the resources in the private subnets. You can choose the number of bastion hosts when you launch the Quick Start.*
    • In the public subnet in the first Availability Zone:
      • Talend public servers that host the Talend Administration Center (TAC) for administering Talend jobs via the browser.
      • A Talend Studio remote desktop instance available through an X2Go client for users who do not want to run Talend Studio on their laptops.
      • A Nexus artifact repository and Git servers for binary and source configuration management.
      • A Talend log server using Amazon Elasticsearch Service (Amazon ES), Logstash, and Kibana.
    • In the private subnet in the first Availability Zone:
      • An Amazon RDS MySQL DB instance to host Talend metadata.
      • An Amazon EMR cluster with Pig, Hive, and Spark that integrates closely with the Talend Big Data Platform and provides Hadoop capability in the data lake.
      • An Amazon Redshift cluster for use as a data warehouse or data mart.
    • In the private subnets, Talend job server instances running Talend jobs scheduled by the TAC, in an Auto Scaling group. Auto Scaling allows EC2 instances to be automatically spun up or down to respond to the demand on the Talend job servers. You can configure the desired and maximum number of instances during deployment.
    • In the public subnets, Talend distant run job server instances running Talend jobs on behalf of Talend Studio users, in an Auto Scaling group. You can run Talend jobs locally on Talend Studio or on these servers. The Auto Scaling group allows EC2 instances to be automatically spun up or down to respond to the demand on the Talend job servers. You can set the desired and maximum number of instances during deployment.
    • Amazon S3 to ingest data for the data lake.

    * You can choose to create a new VPC for the data lake deployment or use your existing VPC on AWS. The template that deploys the Quick Start into an existing VPC skips the components marked by asterisks.

    For details, see the Quick Start deployment guide.
  • Deployment details

    You can build your data lake environment on AWS in about an hour by following a few simple steps:

    1. Sign up for an AWS account, if you don't already have one, at
    2. Upload your Talend Big Data Platform license to a private S3 bucket. You can sign up for a 30-day free trial license on the Talend website.
    3. Launch the Quick Start into a new VPC, if you want to build a new AWS infrastructure.
      Launch the Quick Start into an existing VPC, if you already have your AWS environment set up.
    4. Test the deployment by opening the Talend Administration Center (TAC) and checking the servers deployed by the Quick Start. You can also run the optional Talend jobs to test end-to-end data integration, by following the steps in the user guide provided by Talend and Cognizant

    The Quick Start includes parameters that you can customize. For example, you can configure your network or customize TAC, Amazon Redshift, Nexus, and Git server settings. 

    For complete details, see the Quick Start deployment guide.

  • Cost and licenses

    You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start. 

    The AWS CloudFormation templates for this Quick Start include configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. See the pricing pages for each AWS service you will be using for cost estimates.

    You will need to provide your own Talend Big Data Platform license. To request a 30-day free trial license, please fill out the registration form on the Talend website. You’ll receive a unique license key from Talend, which you’ll use during the Quick Start deployment process.

    The code for all Talend jobs included in the Quick Start are released under the Apache License.

AWS competency partners offer consulting services to help you quickly discover value from this data lake solution. Follow these links to find more about these partners and their consulting offers, and to request more information or support.