AWS Big Data Blog

Apply enterprise data governance and management using AWS Lake Formation and AWS IAM Identity Center

In today’s rapidly evolving digital landscape, enterprises across regulated industries face a critical challenge as they navigate their digital transformation journeys: effectively managing and governing data from legacy systems that are being phased out or replaced. This historical data, often containing valuable insights and subject to stringent regulatory requirements, must be preserved and made accessible to authorized users throughout the organization.

Failure to address this issue can lead to significant consequences, including data loss, operational inefficiencies, and potential compliance violations. Moreover, organizations are seeking solutions that not only safeguard this legacy data but also provide seamless access based on existing user entitlements, while maintaining robust audit trails and governance controls. As regulatory scrutiny intensifies and data volumes continue to grow exponentially, enterprises must develop comprehensive strategies to tackle these complex data management and governance challenges, making sure they can use their historical information assets while remaining compliant and agile in an increasingly data-driven business environment.

In this post, we explore a solution using AWS Lake Formation and AWS IAM Identity Center to address the complex challenges of managing and governing legacy data during digital transformation. We demonstrate how enterprises can effectively preserve historical data while enforcing compliance and maintaining user entitlements. This solution enables your organization to maintain robust audit trails, enforce governance controls, and provide secure, role-based access to data.

Solution overview

This is a comprehensive AWS based solution designed to address the complex challenges of managing and governing legacy data during digital transformation.

In this blog post, there are three personas:

  1. Data Lake Administrator (with admin level access)
  2. User Silver from the Data Engineering group
  3. User Lead Auditor from the Auditor group.

You will see how different personas in an organization can access the data without the need to modify their existing enterprise entitlements.

Note: Most of the steps here are performed by Data Lake Administrator, unless specifically mentioned for other federated/user logins. If the text specifies “You” to perform this step, then it assumes that you are a Data Lake administrator with admin level access.

In this solution you move your historical data into Amazon Simple Storage Service (Amazon S3) and apply data governance using Lake Formation. The following diagram illustrates the end-to-end solution.

The workflow steps are as follows:

  1. You will use IAM Identity Center to apply fine-grained access control through permission sets. You can integrate IAM Identity Center with an external corporate identity provider (IdP). In this post, we have used Microsoft Entra ID as an IdP, but you can use another external IdP like Okta.
  2. The data ingestion process is streamlined through a robust pipeline that combines AWS Database Migration service (AWS DMS) for efficient data transfer and AWS Glue for data cleansing and cataloging.
  3. You will use AWS LakeFormation to preserve existing entitlements during the transition. This makes sure the workforce users retain the appropriate access levels in the new data store.
  4. User personas Silver and Lead Auditor can use their existing IdP credentials to securely access the data using Federated access.
  5. For analytics, Amazon Athena provides a serverless query engine, allowing users to effortlessly explore and analyze the ingested data. Athena workgroups further enhance security and governance by isolating users, teams, applications, or workloads into logical groups.

The following sections walk through how to configure access management for two different groups and demonstrate how the groups access data using the permissions granted in Lake Formation.

Prerequisites

To follow along with this post, you should have the following:

  • An AWS account with IAM Identity Center enabled. For more information, see Enabling AWS IAM Identity Center.
  • Set up IAM Identity Center with Entra ID as an external IdP.
  • In this post, we use users and groups in Entra ID. We have created two groups: Data Engineering and Auditor. The user Silver belongs to the Data Engineering and Lead Auditor belongs to the Auditor.

Configure identity and access management with IAM Identity Center

Entra ID automatically provisions (synchronizes) the users and groups created in Entra ID into IAM Identity Center. You can validate this by examining the groups listed on the Groups page on the IAM Identity Center console. The following screenshot shows the group Data Engineering, which was created in Entra ID.

If you navigate to the group Data Engineering in IAM Identity Center, you should see the user Silver. Similarly, the group Auditor has the user Lead Auditor.

You now create a permission set, which will align to your workforce job role in IAM Identity Center. This makes sure that your workforce operates within the boundary of the permissions that you have defined for the user.

  1. On the IAM Identity Center console, choose Permission sets in the navigation pane.
  2. Click Create Permission set. Select Custom permission set and then click Next. In the next screen you will need to specify permission set details.
  3. Provide a permission set a name (for this post, Data-Engineer) while keeping rest of the option values to its default selection.
  4. To enhance security controls, attach the inline policy text described here to Data-Engineer permission set, to restrict the users’ access to certain Athena workgroups. This additional layer of access management makes sure that users can only operate within the designated workgroups, preventing unauthorized access to sensitive data or resources.

For this post, we are using separate Athena workgroups for Data Engineering and Auditors. Pick a meaningful workgroup name (for example, Data-Engineer, used in this post) which you will use during the Athena setup. Provide the AWS Region and account number in the following code with the values relevant to your AWS account.

arn:aws:athena:<region>:<youraccountnumber>:workgroup/Data-Engineer

Edit the inline policy for Data-Engineer permission set. Copy and paste the following JSON policy text, replace parameters for the arn as suggested earlier and save the policy.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "athena:ListEngineVersions",
        "athena:ListWorkGroups",
        "athena:ListDataCatalogs",
        "athena:ListDatabases",
        "athena:GetDatabase",
        "athena:ListTableMetadata",
        "athena:GetTableMetadata"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "athena:BatchGetQueryExecution",
        "athena:GetQueryExecution",
        "athena:ListQueryExecutions",
        "athena:StartQueryExecution",
        "athena:StopQueryExecution",
        "athena:GetQueryResults",
        "athena:GetQueryResultsStream",
        "athena:CreateNamedQuery",
        "athena:GetNamedQuery",
        "athena:BatchGetNamedQuery",
        "athena:ListNamedQueries",
        "athena:DeleteNamedQuery",
        "athena:CreatePreparedStatement",
        "athena:GetPreparedStatement",
        "athena:ListPreparedStatements",
        "athena:UpdatePreparedStatement",
        "athena:DeletePreparedStatement",
        "athena:UpdateNamedQuery",
        "athena:UpdateWorkGroup",
        "athena:GetWorkGroup",
        "athena:CreateWorkGroup"
      ],
      "Resource": [
        "arn:aws:athena:<region>:<youraccountnumber>:workgroup/Data-Engineer"
      ]
    },
    {
      "Sid": "BaseGluePermissions",
      "Effect": "Allow",
      "Action": [
        "glue:CreateDatabase",
        "glue:DeleteDatabase",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:UpdateDatabase",
        "glue:CreateTable",
        "glue:DeleteTable",
        "glue:BatchDeleteTable",
        "glue:UpdateTable",
        "glue:GetTable",
        "glue:GetTables",
        "glue:BatchCreatePartition",
        "glue:CreatePartition",
        "glue:DeletePartition",
        "glue:BatchDeletePartition",
        "glue:UpdatePartition",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:BatchGetPartition",
        "glue:StartColumnStatisticsTaskRun",
        "glue:GetColumnStatisticsTaskRun",
        "glue:GetColumnStatisticsTaskRuns"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "BaseQueryResultsPermissions",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:ListMultipartUploadParts",
        "s3:AbortMultipartUpload",
        "s3:CreateBucket",
        "s3:PutObject",
        "s3:PutBucketPublicAccessBlock"
      ],
      "Resource": [
        "arn:aws:s3:::aws-athena-query-results-Data-Engineer"
      ]
    },
    {
      "Sid": "BaseSNSPermissions",
      "Effect": "Allow",
      "Action": [
        "sns:ListTopics",
        "sns:GetTopicAttributes"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "BaseCloudWatchPermissions",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricAlarm",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:DeleteAlarms",
        "cloudwatch:GetMetricData"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "BaseLakeFormationPermissions",
      "Effect": "Allow",
      "Action": [
        "lakeformation:GetDataAccess"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}

The preceding inline policy restricts anyone mapped to Data-Engineer permission sets to only the Data-Engineer workgroup in Athena. The users with this permission set will not be able to access any other Athena workgroup.

Next, you assign the Data-Engineer permission set to the Data Engineering group in IAM Identity Center.

  1. Select AWS accounts in the navigation pane and then select the AWS account (for this post, workshopsandbox).
  2. Select Assign users and groups to choose your groups and permission sets. Choose the group Data Engineering from the list of Groups, then select Next. Choose the permission set Data-Engineer from the list of permission sets, then select Next. Finally review and submit.
  3. Follow the previous steps to create another permission set with the name Auditor.
  4. Use an inline policy similar to the preceding one to restrict access to a specific Athena workgroup for Auditor.
  5. Assign the permission set Auditor to the group Auditor.

This completes the first section of the solution. In the next section, we create the data ingestion and processing pipeline.

Create the data ingestion and processing pipeline

In this step, you create a source database and move the data to Amazon S3. Although the enterprise data often resides on premises, for this post, we create an Amazon Relational Database Service (Amazon RDS) for Oracle instance in a separate virtual private cloud (VPC) to mimic the enterprise setup.

  1. Create an RDS for Oracle DB instance and populate it with sample data. For this post, we use the HR schema, which you can find in Oracle Database Sample Schemas.
  2. Create source and target endpoints in AWS DMS:
    • The source endpoint demo-sourcedb points to the Oracle instance.
    • The target endpoint demo-targetdb is an Amazon S3 location where the relational database will be stored in Apache Parquet format.

The source database endpoint will have the configurations required to connect to the RDS for Oracle DB instance, as shown in the following screenshot.

The target endpoint for the Amazon S3 location will have an S3 bucket name and folder where the relational database will be stored. Additional connection attributes, like DataFormat, can be provided on the Endpoint settings tab. The following screenshot shows the configurations for demo-targetdb.

Set the DataFormat to Parquet for the stored data in the S3 bucket. Enterprise users can use Athena to query the data held in Parquet format.

Next, you use AWS DMS to transfer the data from the RDS for Oracle instance to Amazon S3. In large organizations, the source database could be located anywhere, including on premises.

  1. On the AWS DMS console, create a replication instance that will connect to the source database and move the data.

You need to carefully select the class of the instance. It should be proportionate to the volume of the data. The following screenshot shows the replication instance used in this post.

  1. Provide the database migration task with the source and target endpoints, which you created in the previous steps.

The following screenshot shows the configuration for the task datamigrationtask.

  1. After you create the migration task, select your task and start the job.

The full data load process will take a few minutes to complete.

You have data available in Parquet format, stored in an S3 bucket. To make this data accessible for analysis by your users, you need to create an AWS Glue crawler. The crawler will automatically crawl and catalog the data stored in your Amazon S3 location, making it available in Lake Formation.

  1. When creating the crawler, specify the S3 location where the data is stored as the data source.
  2. Provide the database name myappdb for the crawler to catalog the data into.
  3. Run the crawler you created.

After the crawler has completed its job, your users will be able to access and analyze the data in the AWS Glue Data Catalog with Lake Formation securing access.

  1. On the Lake Formation console, choose Databases in the navigation pane.

You will find mayappdb in the list of databases.

Configure data lake and entitlement access

With Lake Formation, you can lay the foundation for a robust, secure, and compliant data lake environment. Lake Formation plays a crucial role in our solution by centralizing data access control and preserving existing entitlements during the transition from legacy systems. This powerful service enables you to implement fine-grained permissions, so your workforce users retain appropriate access levels in the new data environment.

  1. On the Lake Formation console, choose Data lake locations in the navigation pane.
  2. Choose Register location to register the Amazon S3 location with Lake Formation so it can access Amazon S3 on your behalf.
  3. For Amazon S3 path, enter your target Amazon S3 location.
  4. For IAM role¸ keep the IAM role as AWSServiceRoleForLakeFormationDataAccess.
  5. For the Permission mode, select Lake Formation option to manage access.
  6. Choose Register location.

You can use tag-based access control to manage access to the database myappdb.

  1. Create an LF-Tag data classification with the following values:
    • General – To imply that the data is not sensitive in nature.
    • Restricted – To imply generally sensitive data.
    • HighlyRestricted – To imply that the data is highly restricted in nature and only accessible to certain job functions.

  2. Navigate to the database myappdb and on the Actions menu, choose Edit LF-Tags to assign an LF-Tag to the database. Choose Save to apply the change.

As shown in the following screenshot, we have assigned the value General to the myappdb database.

The database myappdb has 7 tables. For simplicity, we work with the table jobs in this post. We apply restrictions to the columns of this table so that its data is visible to only the users who are authorized to view the data.

  1. Navigate to the jobs table and choose Edit schema to add LF-Tags at the column level.
  2. Tag the value HighlyRestricted to the two columns min_salary and max_salary.
  3. Choose Save as new version to apply these changes.

The goal is to restrict access to these columns for all users except Auditor.

  1. Choose Databases in the navigation pane.
  2. Select your database and on the Actions menu, choose Grant to provide permissions to your enterprise users.
  3. For IAM users and roles, choose the role created by IAM Identity Center for the group Data Engineer.  Choose the IAM role with prefix AWSResrevedSSO_DataEngineer from the list. This role is created as a result of creating permission sets in IAM identity Center.
  4. In the LF-Tags section, select option Resources matched by LF-Tags. The choose Add LF-Tag key-value pair. Provide the LF-Tag key data classification and the values as General and Restricted. This grants the group of users (Data Engineer) to the database myappdb as long as the group is tagged with the values General and Restricted.
  5. In the Database permissions and Table permissions sections, select the specific permissions you want to give to the users in the group Data Engineering. Choose Grant to apply these changes.
  6. Repeat these steps to grant permissions to the role for the group Auditor. In this example, choose IAM role with prefix AWSResrevedSSO_Auditor and give the data classification LF-tag to all possible values.
  7. This grant implies that the personas logging in with the Auditor permission set will have access to the data that is tagged with the values General, Restricted, and Highly Restricted.

You have now completed the third section of the solution. In the next sections, we demonstrate how the users from two different groups—Data Engineer and Auditor—access data using the permissions granted in Lake Formation.

Log in with federated access using Entra ID

Complete the following steps to log in using federated access:

  1. On the IAM Identity Center console, choose Settings in the navigation pane.
  2. Locate the URL for the AWS access portal.
  3. Log in as the user Silver.
  4. Choose your job function Data-Engineer (this is the permission set from IAM Identity Center).

Perform data analytics and run queries in Athena

Athena serves as the final piece in our solution, working with Lake Formation to make sure individual users can only query the datasets they’re entitled to access. By using Athena workgroups, we create dedicated spaces for different user groups or departments, further reinforcing our access controls and maintaining clear boundaries between different data domains.

You can create Athena workgroup by navigating to Amazon Athena in AWS console.

  • Select Workgroups from left navigation and choose Create Workgroup.
  • On the next screen, provide workgroup name Data-Engineer and leave other fields as default values.
    • For the query result configuration, select the S3 location for the Data-Engineer workgroup.
  • Chose Create workgroup.

Similarly, create a workgroup for Auditors. Choose a separate S3 bucket for Athena Query results for each workgroup. Ensure that the workgroup name matches with the name used in arn string of the inline policy of the permission sets.

In this setup, users can only view and query tables that align with their Lake Formation granted entitlements. This seamless integration of Athena with our broader data governance strategy means that as users explore and analyze data, they’re doing so within the strict confines of their authorized data scope.

This approach not only enhances our security posture but also streamlines the user experience, eliminating the risk of inadvertent access to sensitive information while empowering users to derive insights efficiently from their relevant data subsets.

Let’s explore how Athena provides this powerful, yet tightly controlled, analytical capability to our organization.

When user Silver accesses Athena, they’re redirected to the Athena console. According to the inline policy in the permission set, they have access to the Data-Engineer workgroup only.

After they select the correct workgroup Data-Engineer from the Workgroup drop-down menu and the myapp database, it displays all columns except two columns. The min_sal and max_sal columns that were tagged as HighlyRestricted are not displayed.

This outcome aligns with the permissions granted to the Data-Engineer group in Lake Formation, making sure that sensitive information remains protected.

If you repeat the same steps for federated access and log in as Lead Auditor, you’re similarly redirected to the Athena console. In accordance with the inline policy in the permission set, they have access to the Auditor workgroup only.

When they select the correct workgroup Auditor from the Workgroup dropdown menu and the myappdb database, the job table will display all columns.

This behavior aligns with the permissions granted to the Auditor workgroup in Lake Formation, making sure all information is accessible to the group Auditor.

Enabling users to access only the data they are entitled to based on their existing permissions is a powerful capability. Large organizations often want to store data without having to modify queries or adjust access controls.

This solution enables seamless data access while maintaining data governance standards by allowing users to use their current permissions. The selective accessibility helps balance organizational needs for storage and data compliance. Companies can store data without compromising different environments or sensitive information.

This granular level of access within data stores is a game changer for regulated industries or businesses seeking to manage data responsibly.

Clean up

To clean up the resources that you created for this post and avoid ongoing charges, delete the following:

  • IAM Identity Center application in Entra ID
  • IAM Identity Center configurations
  • RDS for Oracle and DMS replication instances.
  • Athena workgroups and the query results in Amazon S3
  • S3 buckets

Conclusion

This AWS powered solution tackles the critical challenges of preserving, safeguarding, and scrutinizing historical data in a scalable and cost-efficient way. The centralized data lake, reinforced by robust access controls and self-service analytics capabilities, empowers organizations to maintain their invaluable data assets while enabling authorized users to extract valuable insights from them.

By harnessing the combined strength of AWS services, this approach addresses key difficulties related to legacy data retention, security, and analysis. The centralized repository, coupled with stringent access management and user-friendly analytics tools, enables enterprises to safeguard their critical information resources while simultaneously empowering sanctioned personnel to derive meaningful intelligence from these data sources.

If your organization grapples with similar obstacles surrounding the preservation and management of data, we encourage you to explore this solution and evaluate how it could potentially benefit your operations.

For more information on Lake Formation and its data governance features, refer to AWS Lake Formation Features.


About the authors

Manjit Chakraborty is a Senior Solutions Architect at AWS. He is a Seasoned & Result driven professional with extensive experience in Financial domain having worked with customers on advising, designing, leading, and implementing core-business enterprise solutions across the globe. In his spare time, Manjit enjoys fishing, practicing martial arts and playing with his daughter.

Neeraj Roy is a Principal Solutions Architect at AWS based out of London. He works with Global Financial Services customers to accelerate their AWS journey. In his spare time, he enjoys reading and spending time with his family.

Evren Sen is a Principal Solutions Architect at AWS, focusing on strategic financial services customers. He helps his customers create Cloud Center of Excellence and design, and deploy solutions on the AWS Cloud. Outside of AWS, Evren enjoys spending time with family and friends, traveling, and cycling.