AWS Storage Blog

Streamline data sharing and access control with Informatica Cloud Data Marketplace and Amazon S3 Access Grants

Organizations are modernizing their data lakes on Amazon Simple Storage Service (Amazon S3) to handle the ever-growing data volume and speed while meeting the demands of analytics, machine learning (ML), artificial intelligence (AI), and generative AI applications. To enable a data-driven culture and remain innovative, the data platform must allow for data-centric collaboration across business and IT personas in the organization. This means enabling prompt access and secure delivery, while adhering to the regulatory and compliance standards, and meeting the user’s privacy expectations.

However, many organizations find themselves at a crossroads between providing timely access and maintaining the right level of access to authorized users for handling sensitive and other personally identifiable data. To effectively address this challenge, organizations need to transition away from complex, inconsistent, and manual data sharing processes. Additionally, they need to re-evaluate policies that were initially established solely to regulate access. To overcome these obstacles and to lay the foundation for more efficient and secure data management practices, organizations need a modern, cloud-native, automated data governance and access control framework that

  • Fosters collaboration across data producers, owners, and consumers through a self-service request and approval workflow for data sharing and delivery.
  • Supports fine-grained access controls to protect sensitive information in a data lake at all levels – objects, rows, columns, and even individual cells.
  • Scales complex or large permission and access control configurations for data in Amazon S3 across users, roles, and applications.

The post details how organizations can use the integration of the Informatica Intelligent Data Management Cloud™ (IDMC) with Amazon S3 Access Grants to streamline the sharing and access to their data lakes on Amazon S3, while making sure the right set of guardrails are in place to protect sensitive information.

IDMC is an AI-powered, metadata-driven, persona-based, cloud-native platform built to empower data professionals with comprehensive and cohesive cloud data management capabilities to discover, catalog, ingest, cleanse, integrate, govern, secure, prepare, and master data.

Amazon S3 Access Grants

Amazon S3 Access Grants helps you manage Amazon S3 permissions for your data lakes at scale. With S3 Access Grants, you specify permissions in a scalable and intuitive grant-style. Thereafter, when users or applications want to access Amazon S3, they can request temporary, least-privilege credentials from S3 Access Grants. They can then use the S3 Access Grants-vended credentials to access Amazon S3. Additionally, S3 Access Grants log the end-user identity, as well as the application used to access Amazon S3 data, in AWS CloudTrail. This helps provide a detailed audit history for all access to the data in your S3 buckets.

With S3 Access Grants, users can enforce granular, least-privilege Amazon S3 permissions at scale, serving as an easy and scalable way to complement existing resource-level controls such as S3 bucket policies.

As an AWS Data & Analytics partner, Informatica offers simplified and streamlined data sharing and access control that now integrates with Amazon S3 Access Grants. This integration at the launch of the S3 Access Grants feature, highlights their commitment to enhance cloud data management and to provide solutions that address key aspects of data sharing and governance on AWS.

Solution overview

Figure 1- Architecture Overview

Figure 1: Architecture overview

As shown in Figure 1, this solution involves the following IDMC services:

This solution enables data owners set up fine-grained access rules with Informatica’s policy builder for their data assets, share metadata and other data asset related information (e.g., data quality, data usage policy), and allow data consumers to browse and request access through CDMP. Upon receiving approval, which can happen either through an automated approval process or by authorization from the data owner, a secure version of the dataset is automatically delivered in accordance with the configured policies. Then, the permissions for the newly provisioned data are configured with Amazon S3 Access Grants APIs, promptly fulfilling the data consumer’s request. In CDMP, data owners can get a single pane view of consumers who have access to their data and withdraw access, if needed, through an automated workflow.

Solution walkthrough

Figure 2- Solution Workflow

Figure 2: Solution workflow

In this scenario (as shown in the Figure 2), we have three data community personas:

  • A data steward who centrally defines and establishes protection and usage policy across the organization.
  • A data owner responsible for data quality, enrichment, and curation of the billing dataset to maximize its usefulness for the company, while making sure that only authorized users have the right level of access for the necessary duration.
  • A data scientist (consumer) requesting access to the sensitive billing data, available in an Amazon S3 data lake, to analyze and predict user behavior based on changes in the pricing model.

Figure 3- Informatica Data Access Policy Builder

Figure 3: Informatica Data Access Policy Builder

The data steward defines the protection policy (as shown in the Figure 3) for their organization for the dataset pertaining to the user billing. When new billing data is added to the data lake and classified in the billing domain by Informatica’s AI engine, the defined policies are automatically associated to the dataset based on the underlying data model and data column classification. Based on the defined policy, the ‘Email’, ‘First Name’, and ‘Social Security Number’ are tokenized using consistent hashing to preserve referential integrity.

Figure 4- Informatica Cloud Data Marketplace

Figure 4: Informatica Cloud Data Marketplace

Using CDMP (shown in Figure 4), the data owner shares detailed information about the billing dataset, such as its quality and lineage. This helps data scientists and other data consumers in the organization quickly understand the dataset’s lifecycle and features, making it easier for them to decide whether to request access. Additionally, the data owner also defines a set of delivery targets specifying how and where the dataset is provisioned for the consumer. In this case, given the sensitive information in the billing dataset, the data owner declares the delivery target as ‘CDAM – Amazon S3 Access Grants’ for the billing dataset to be consumed subject to both Informatica’s protection policy and the access permission defined in S3 Access Grants.

Figure 5- Data access request workflow (Informatica Cloud Data Marketplace)

Figure 5: Data access request workflow (CDMP)

The data scientist explores categories of data assets in CDMP. Upon finding the needed billing dataset for training a model to predict consumer behavior with pricing changes, they submit an order to access the sensitive dataset. As part of this order, the data scientist also declares their intended use, provides a business justification, and selects ‘CDAM – Amazon S3 Access Grants’ as the delivery target (as shown in Figure 5).

Figure 6- Data access approval workflow (Informatica Cloud Data Marketplace)

Figure 6: Data access approval workflow (CDMP)

The billing data owner gets a notification for a new order pending approval. Once approved, an automated workflow kicks in to enforce data access protection policy as outlined by the data steward. Following this, an unidentified and protected copy of the dataset is provisioned for the data scientist. Finally, the needed object level access is granted with Amazon S3 Access Grants APIs, allowing the data consumer to access the unidentified copy of the billing dataset. After the fulfillment process concludes, the data scientist receives a notification confirming the order’s completion. The notification includes details about the provisioned dataset. The entire timeline (as shown in Figure 6) of the access fulfillment process is maintained for audit. For a dataset without any sensitive information, the data owner can also configure automatic approval workflow within CDMP.

Figure 7-Access Data form Amazon SageMaker using Amazon S3 Access Grant APIs

Figure 7: Access data from Amazon SageMaker using Amazon S3 Access Grants

The data scientist can now start training the model within Amazon SageMaker using the billing dataset. The data scientist uses the Amazon S3 Access Grants SDK in SageMaker (as shown in Figure 7) to receive the necessary credentials for reading their unidentified data. The data is also subject to the protections of the policy defined by the data steward in Informatica’s data access management. The first name, email addresses, and social security number in the billing dataset are tokenized.

Figure 8- Data Access Withdrawn (Informatica Cloud Data Marketplace)

Figure 8: Data Access Withdrawn (CDMP)

After the data scientist finishes training the model in Amazon SageMaker and no longer needs access to the data, they can request a withdrawal of access through CDMP. Additionally, the data owner can revoke access at any time, if necessary.

Figure 9-Data Access Denied within Amazon SageMaker

Figure 9: Data access denied within Amazon SageMaker

Conclusion

In this post, we illustrated a streamlined, self-service data access management solution, granting data stewards the ability to enforce data protection measures. This makes sure of appropriate data usage and controlled access for authorized data consumers, all without sacrificing the prompt access and delivery of data. This approach plays a crucial role in fostering collaboration and data sharing across organization and building a data-driven culture.

Miguel Cunhal

Miguel Cunhal

Miguel Cunhal is a Principal Product Manager at Informatica. He works with customers and engineering teams to build new policy enforcement features that empower organizations to protect their valuable data assets and creating enhanced value for their end users. He is passionate about finding new ways to protect and respect data privacy and security.

Huey Han

Huey Han

Huey Han is a Senior Product Manager for Amazon S3. He focuses on data lake, analytics, and data governance at S3. He is based in New York City. In his spare time, Huey enjoys martial arts.

Rajeev Srinivasan

Rajeev Srinivasan

Rajeev Srinivasan is a Director of Technical Alliance, Strategic Ecosystem at Informatica. He leads the strategic technical partnership with AWS to bring needed and innovative solutions and capabilities into the hands of the customers. Along with customer obsession, he has a passion for data and cloud technologies, and riding his Harley.

Weifan Liang

Weifan Liang

Weifan Liang is a Senior Partner Solutions Architect at AWS. He works closely with AWS top strategic data analytics software partners to drive product integration, build optimized architecture, develop long-term strategy, and provide thought leadership. Innovating together with partners, Weifan strives to help customers accelerate business outcomes with AI-powered digital transformation.