AWS Quick Starts — Customer Ready Solutions

Informatica Data Lake Management on AWS

Build a data lake environment with Informatica technologies and AWS services

This Quick Start builds a data lake environment on the Amazon Web Services (AWS) Cloud by deploying the Informatica Data Lake Management solution and AWS services such as Amazon EMR, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Amazon Relational Database Service (Amazon RDS).

A data lake uses a single, Hadoop-based data repository that helps you manage data supply and demand. Informatica’s solution on AWS integrates, organizes, administers, governs, and secures large volumes of both structured and unstructured data. The solution delivers actionable fit-for-purpose, reliable, and secure information for business insights.

The Quick Start configures the AWS infrastructure, deploys the Informatica Data Lake Management components, and automatically embeds Hadoop clusters in the virtual private cloud (VPC) for metadata storage and processing. It assigns the connection to the Amazon EMR cluster for the Hadoop Distributed File System (HDFS) and Hive. It also sets up connections to enable scanning of Amazon S3 and Amazon Redshift environments as part of the data lake.

datalake_icon_crs_informatica

This Quick Start was developed by Informatica in collaboration with AWS. Informatica is an
AWS Competency Partner.

  •  What you'll build
  •  How to deploy
  •  Cost and licenses
  •  Get assistance
  •  What you'll build
  • If you choose to deploy the Quick Start in a new VPC, it sets up the following AWS infrastructure for the data lake:

    • A VPC configured with public and private subnets, which spans two Availability Zones.
    • An internet gateway to allow access to the internet.
    • In the public subnets, managed network address translation (NAT) gateways configured with an Elastic IP address for outbound internet connectivity.

    The Quick Start also installs and configures the following Informatica components:

    • Informatica domain, which is the fundamental administrative unit of the Informatica platform.
    • Model Repository Service, which is a relational database that stores all the metadata for projects created using Informatica client tools. The Informatica domain and the Informatica Model Repository databases are hosted on Amazon RDS using Oracle, which handles management tasks such as backups, patch management, and replication.
    • Data Integration Service, which manages requests to submit big data integration, big data quality, and profiling jobs to the Hadoop cluster for processing.
    • Content Management Service, which manages reference data. It provides reference data information to the Data Integration Service and Informatica Developer.
    • Analyst Service, which runs the Analyst tool in the Informatica domain. The Analyst Service manages the connections between the service components and the users who log in to the Analyst tool.
    • Profiling, which helps you find the content, quality, and structure of data sources of an application, schema, or enterprise.
    • Business Glossary, which consists of online glossaries of business terms and policies that define important concepts within an organization.
    • Catalog Service, which runs Enterprise Data Catalog and manages connections between service components and external applications.
    • An embedded Hadoop cluster that uses Hortonworks, running HDFS, Hbase, Yarn, and Solr.
    • Informatica Cluster Service, which runs and manages all Hadoop services, Apache Ambari server, and Apache Ambari agents on the embedded Hadoop cluster.
    • Metadata and Catalog, which include the metadata persistence store, search index, and graph database in an embedded Hadoop cluster.
  •  How to deploy
  • You can build your data lake environment on AWS by following these steps:

    1. If you don't already have an AWS account, sign up at https://aws.amazon.com.
    2. Upload your Informatica license to an S3 bucket. To sign up for a demo license, contact Informatica.
    3. Launch the Quick Start. Each deployment takes about two hours. You can choose from two options:
    4. Monitor the creation of the cluster instance and Informatica domain.
    5. Use the Quick Start output links to download and install Informatica Developer for your data integration tasks.

    The Quick Start includes parameters that you can customize. For example, you can configure your network or customize Amazon EMR, Amazon Redshift, Amazon RDS, and Informatica software settings.

  •  Cost and licenses
  • You are responsible for the cost of the AWS services used while running this Quick Start reference deployment. There is no additional cost for using the Quick Start.

    The AWS CloudFormation templates for this Quick Start include configuration parameters that you can customize. Some of these settings, such as instance type, will affect the cost of deployment. See the pricing pages for each AWS service you will be using for cost estimates.

    This Quick Start requires a license to deploy the Informatica Data Lake Management solution. To sign up for a demo license, contact Informatica.

  •  Get assistance
  • AWS big data competency partners offer consulting services to help you quickly discover value from this data lake solution. Take advantage of the consulting offers from NGDATA, Hitachi, and Cognizant. To learn more about these partners and their consulting offers, and to request more information or support, see the Informatica Data Lake Management on AWS webpage on the Solution Space website.