Deploy on AWS into a new VPC

or deploy into your existing VPC
(deployment requires an Informatica license; see guide)


You can choose to build a new virtual private cloud (VPC) infrastructure that’s configured for security, scalability, and high availability, or use your existing VPC infrastructure for the data lake.

This reference architecture is automated by AWS CloudFormation templates that you can customize to meet your specific requirements. For detailed information about the architecture and step-by-step instructions, see the deployment guide.

If this architecture doesn't meet your specific requirements, see the other data lake deployments in the Quick Start catalog.

If you need assistance
To get assistance setting up your data lake, see the consulting offers by APN consulting partners.

datalake_icon_crs_informatica

This Quick Start builds a data lake environment on the Amazon Web Services (AWS) Cloud by deploying the Informatica Data Lake Management solution and AWS services such as Amazon EMR, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon Relational Database Service (Amazon RDS).

A data lake uses a single, Hadoop-based data repository that helps you manage data supply and demand. Informatica’s solution on AWS integrates, organizes, administers, governs, and secures large volumes of both structured and unstructured data. The solution delivers actionable fit-for-purpose, reliable, and secure information for business insights.

The Quick Start configures the AWS infrastructure, deploys the Informatica Data Lake Management components, and automatically embeds Hadoop clusters in the virtual private cloud (VPC) for metadata storage and processing. It assigns the connection to the Amazon EMR cluster for the Hadoop Distributed File System (HDFS) and Hive. It also sets up connections to enable scanning of Amazon S3 and Amazon Redshift environments as part of the data lake.

  • What you'll build

    If you choose to deploy the Quick Start in a new VPC, it sets up the following AWS infrastructure for the data lake:

    • A VPC configured with public and private subnets, which spans two Availability Zones.
    • An internet gateway to allow access to the internet.
    • In the public subnets, managed network address translation (NAT) gateways configured with an Elastic IP address for outbound internet connectivity.


    The Quick Start also installs and configures the following Informatica components: 

    • Informatica domain, which is the fundamental administrative unit of the Informatica platform.
    • Model Repository Service, which is a relational database that stores all the metadata for projects created using Informatica client tools. The Informatica domain and the Informatica Model Repository databases are hosted on Amazon RDS using Oracle, which handles management tasks such as backups, patch management, and replication.
    • Data Integration Service, which manages requests to submit big data integration, big data quality, and profiling jobs to the Hadoop cluster for processing.
    • Content Management Service, which manages reference data. It provides reference data information to the Data Integration Service and Informatica Developer.
    • Analyst Service, which runs the Analyst tool in the Informatica domain. The Analyst Service manages the connections between the service components and the users who log in to the Analyst tool.
    • Profiling, which helps you find the content, quality, and structure of data sources of an application, schema, or enterprise.
    • Business Glossary, which consists of online glossaries of business terms and policies that define important concepts within an organization.
    • Catalog Service, which runs Enterprise Data Catalog and manages connections between service components and external applications.
    • An embedded Hadoop cluster that uses Hortonworks, running HDFS, Hbase, Yarn, and Solr.
    • Informatica Cluster Service, which runs and manages all Hadoop services, Apache Ambari server, and Apache Ambari agents on the embedded Hadoop cluster.
    • Metadata and Catalog, which include the metadata persistence store, search index, and graph database in an embedded Hadoop cluster.


    For details, see the Quick Start deployment guide.

  • Deployment details

    You can build your data lake environment on AWS by following these steps:

    1. Sign up for an AWS account, if you don't already have one, at https://aws.amazon.com.
    2. Upload your Informatica license to an S3 bucket. To sign up for a demo license, contact Informatica.
    3. Launch the Quick Start into a new VPC, if you want to build a new AWS infrastructure.
      —or—
      Launch the Quick Start into an existing VPC, if you already have your AWS environment set up.

      Each deployment takes about two hours.
    4. Monitor the creation of the cluster instance and Informatica domain.
    5. Use the Quick Start output links to download and install Informatica Developer for your data integration tasks.


    The Quick Start includes parameters that you can customize. For example, you can configure your network or customize Amazon EMR, Amazon Redshift, Amazon RDS, and Informatica software settings. 

    For complete details, see the Quick Start deployment guide.

  • Cost and licenses

    You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start. 

    The AWS CloudFormation templates for this Quick Start include configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. See the pricing pages for each AWS service you will be using for cost estimates.

    This Quick Start requires a license to deploy the Informatica Data Lake Management solution. To sign up for a demo license, contact Informatica.

AWS competency partners offer consulting services to help you quickly discover value from this data lake solution. Follow these links to find more about these partners and their consulting offers, and to request more information or support. 

NGData - logo -237x71
hitachi-crs-logo