Unify On-Premises and Cloud-Hosted Data Assets Using Informatica Enterprise Data Catalog
By Jobin George, Sr. Partner Solutions Architect at AWS
By Deepak Ram, Director, Strategic Solutions at Informatica
By Louis-Noël Trapadoux, Principal Product Manager at Informatica
Systems are growing more complex, cloud applications are growing in adoption, and cloud data lakes are being increasingly deployed. At the same time, organizations need to implement data cataloging solutions to provide data governance, data analytics, or pure metadata management.
These cataloging solutions should be able to quickly start providing insights so they can drive adoption. This adoption would, in turn, provide additional insight into the data assets that can be found in other parts of an enterprise’s information system.
Unfortunately, data assets are increasingly distributed between the cloud and on-premises. So, there is a critical need for a data catalog that can acquire metadata from both on-premises applications and cloud services, and unify them.
Informatica Enterprise Data Catalog (EDC) scans and catalogs an enterprise’s data assets, whether hosted on the cloud or stored on-premises. Informatica is an AWS Advanced Technology Partner with the AWS Data & Analytics Competency.
In this post, we’ll explain how to use Informatica EDC with AWS Glue to scan and catalog all your enterprise data assets, regardless of where they are.
Informatica Enterprise Data Catalog (EDC) uses artificial intelligence (AI) to provide a machine learning (ML) discovery engine to scan and catalog data assets across the enterprise, both in the cloud and on-premises, as shown in Figures 1 and 2 below.
Informatica EDC enables both business and IT users to easily discover and understand relevant and trusted data with powerful semantic search, end-to-end data lineage, and automatic domain discovery.
It also provides integrated data quality and profiling statistics, holistic relationship views, intelligent recommendations, and an integrated business glossary.
Populating Enterprise Data Catalog with AWS Glue Metadata
Let’s look at an example in which users are seeking new data sets to consume for data analysis.
Understanding what datasets are available in the cloud or on-premises systems is key to exploiting data for analysis. One of the critical parts of a data catalog implementation is the need for a rapid ingestion of the metadata to build relevant content and drive quick user adoption.
Figure 1 – Informatica EDC solution architecture.
AWS customers use AWS Glue to crawl technical information for the datasets available in the AWS ecosystem. This allows authoring and maintenance of jobs to load the data lake and build new data sets.
At the same time, AWS Glue Data Catalog can be used as a source of metadata in AWS environments to help accelerate the documentation of services like Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon DynamoDB, and Amazon Relational Database Services (Amazon RDS) databases at once.
The catalog can then be used to build an initial snapshot of the cloud deployment landscape or enrich the catalog with information that cannot be accessed easily.
Figure 2 – Informatica EDC reference architecture.
Steps to Synchronize AWS Glue Catalog and EDC
To use Informatica EDC to achieve a preliminary population of the metadata from the AWS ecosystems, you must first create a resource. This resource points to the AWS Glue service for a specific region. You can either provide a secret and access key or use the role-based authentication of the EDC server to access the service.
Figure 3 – Creating a resource in Informatica EDC.
To properly document data from the AWS environment, the catalog must be able to surface the detail of the object referenced in AWS Glue. Those objects correspond to a tabular translation of the underlying object.
When Amazon S3 is the source, the catalog should be able to represent the file or the directory the AWS Glue table points to. If Amazon Redshift is the source, the catalog should be able to represent the source object in a table, view, or other Redshift element.
To do this, we can request the AWS Glue resource to create reference resources. This means that for any systems that AWS Glue references, a new virtual resource representing the source system will be created.
In the same way, as the AWS Glue object become available in Amazon Athena, we can automate the creation of a reference resource for Athena.
Figure 4 – Automating the creation of a reference resource.
Once the resource is executed, the enterprise data catalog starts showing the object definitions coming from AWS Glue. And without having to scan S3, Amazon Redshift, Amazon Dynamo DB, or Amazon RDS, users of the catalog also start to see objects coming from those different systems.
For example, here is a table in AWS Glue:
Figure 5 – Table in AWS Glue showing object definitions.
The object in the Informatica EDC offers the same level of detail, plus it enables users to curate the ownership of a dataset, certify an asset, collaborate by entering reviews, or ask a question.
Figure 6 – The object in Informatica EDC.
This next level of information shows the dependent objects in the AWS ecosystem:
- Amazon S3 object along with the information about the location.
- Table available in Amazon Athena, which can be a source for the reporting tool.
Viewing End-to-End Lineage Across Different AWS Services
The end-to-end lineage across the different services in AWS is displayed like this:
Figure 7 – End-to-end lineage graph.
On the right side is a report coming from the extraction of the metadata from the reporting tool of choice connecting to Amazon Athena, be it Tableau, MicroStrategy, etc.
By taking those few steps with the Informatica EDC, we were able to gather metadata that describes with precision how the data is stored and consumed within the AWS environment. By simply extracting the metadata available in the AWS Glue catalog, we were able to accelerate the documentation of the different services within the AWS ecosystem.
Additional Insights by Directly Connecting to the AWS Services
To provide additional insight for users who are looking for new datasets to use, we need to harvest metadata from the sources of AWS Glue (Amazon S3, Amazon Redshift, and Amazon RDS). To do this, we create resources that will directly connect to those services.
In addition to metadata harvesting, we can add data profiling to provide more insight on the content of the data available in the data asset, data patterns, uniqueness and presence of values, value frequency distribution, and business meaning discovery.
Here’s an example of an Amazon Athena table that includes data profiling results:
Figure 8 – Data profiling results.
Informatica Enterprise Data Catalog (EDC) is an AI-powered data catalog that provides a machine learning-based discovery engine to scan and catalog data assets across the enterprise-across cloud and on-premises.
You can begin the journey of cataloging metadata from an AWS environment and leverage AWS Glue as an accelerator to ingest cloud system metadata into the Informatica EDC without having to re-crawl or re-scan the entire data.
Informatica EDC is available via AWS Marketplace for users to try out—even for customers with current licenses purchased through other channels. To learn more about Informatica’s cloud-native data management solutions for AWS, go to informatica.com/aws.
Informatica – AWS Partner Spotlight
Informatica is an AWS Competency Partner whose Enterprise Data Catalog (EDC) scans and catalogs an enterprise’s data assets, whether hosted on the cloud or stored on-premises.
*Already worked with Informatica? Rate the Partner
*To review an AWS Partner, you must be a customer that has worked with them directly on a project.