AWS Machine Learning Blog

Control and audit data exploration activities with Amazon SageMaker Studio and AWS Lake Formation

May 2024: This post was reviewed and updated to use a new dataset, reflect the updated Studio experience and AWS IAM Identity Center.

Certain industries are required to audit all access to their data. This includes auditing exploratory activities performed by data scientists, who usually query data from within machine learning (ML) notebooks.

This post walks you through the steps to implement access control and auditing capabilities on a per-user basis, using Amazon SageMaker Studio notebooks and AWS Lake Formation access control policies. This how-to guide is based on the Machine Learning Lens for the AWS Well-Architected Framework, following the design principles described in the Security Pillar:

  • Restrict access to ML systems
  • Ensure data governance
  • Enforce data lineage
  • Enforce regulatory compliance

This post provides guidance for customers already using AWS Identity and Access Management (IAM) users and groups to manage identities, and also for customer using AWS IAM Identity Center. Please note, however, that our best practice for identity management is to use IAM Identity Center, or federation with IAM roles, so that users access AWS accounts using temporary credentials.

Overview of solution

This implementation uses Amazon Athena and the PyAthena client on a Studio notebook to query data on a data lake registered with Lake Formation.

Studio is the first fully integrated development environment (IDE) for ML. Studio provides a single, web-based visual interface where you can perform all the steps required to build, train, and deploy ML models. Studio notebooks are collaborative notebooks that you can launch quickly, without setting up compute instances or file storage beforehand.

Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run.

Lake Formation is a fully managed service that makes it easier for you to build, secure, and manage data lakes. Lake Formation simplifies and automates many of the complex manual steps that are usually required to create data lakes, including securely making that data available for analytics and ML.

For an existing data lake registered with Lake Formation, the following diagram illustrates the proposed implementation.

The workflow includes the following steps:

  1. Data scientists access the AWS Management Console using their identity, which can be an IAM Identity Center user name, a federated identity with an IAM role (both options align with our best practice of using temporary credentials to access AWS accounts), or an IAM user account belonging to an IAM group. On the console, data scientists open Studio using individual user profiles. Each user profile has an associated execution role, which the user assumes while working on a Studio notebook.

The diagram depicts two data scientists that require different permissions over data in the data lake. For example, in a data lake containing personally identifiable information (PII), Data Scientist 1 has full access to every table in the Data Catalog, whereas Data Scientist 2 has limited access to a subset of tables (or columns) containing non-PII data.

  1. The Studio notebook is associated with a SageMaker JupyterLab application. The PyAthena client allows you to run exploratory ANSI SQL queries on the data lake through Athena, using the execution role assumed by the user while working with Studio.
  2. Athena sends a data access request to Lake Formation, with the user profile execution role as principal. Data permissions in Lake Formation offer database-, table-, and column-level access control, restricting access to metadata and the corresponding data stored in Amazon S3. Lake Formation generates short-term credentials to be used for data access, and informs Athena what columns the principal is allowed to access.
  3. Athena uses the short-term credential provided by Lake Formation to access the data lake storage in Amazon S3, and retrieves the data matching the SQL query. Before returning the query result, Athena filters out columns that aren’t included in the data permissions informed by Lake Formation.
  4. Athena returns the SQL query result to the Studio notebook.
  5. Lake Formation records data access requests and other activity history for the registered data lake locations. AWS CloudTrail also records these and other API calls made to AWS during the entire flow, including Athena query requests.

Walkthrough overview

In this walkthrough, we show you how to implement access control and audit using a Studio notebook and Lake Formation. You perform the following activities:

  1. Register a new database in Lake Formation.
  2. Create the required IAM resources.
  3. Grant data permissions with Lake Formation.
  4. Set up Studio.
  5. Test Lake Formation access control policies using a Studio notebook.
  6. Audit data access activity with Lake Formation and CloudTrail.

If you prefer to skip the initial setup activities and jump directly to testing and auditing, you can deploy a AWS CloudFormation template in a Region that supports Studio and Lake Formation, by downloading the CloudFormation template. When deploying the template, provide the following parameters:

  • Authentication method for Studio:
    • IAM with IAM users – Suitable when using IAM users and groups to manage identities.
    • IAM with AWS account federation (external IdP) –Suitable when using AWS Identity Center to manage access into AWS accounts with temporary credentials, which aligns with our best practices for managing identities.
  • Studio profile name for a data scientist with full access to the dataset. The default name is data-scientist-full. If you’re using IAM users as your authentication method, an IAM user with the same name is also created. The password for the user is created automatically and stored as a secret in AWS Secrets Manager.
  • Studio profile name for a data scientist with limited access to the dataset. The default user name is data-scientist-limited. If you’re using IAM users for authentication, an IAM user with the same name is also created. The password for the user is created automatically and stored as a secret in Secrets Manager.
  • Names for the database and table to be created for the dataset. The default names are ecommerce_reviews_db and ecommerce_reviews, respectively.
  • VPC and subnets that are used by Studio.

If you use IAM Identity Center and decide to deploy the CloudFormation template, after the CloudFormation stack is complete, you must follow the sections IAM resources for authentication using federation and Create the required IAM Identity Center permission set in this post. Then you can go directly to the section Test Lake Formation access control policies.

If you use IAM users and groups to manage identities and deploy the CloudFormation template, after the CloudFormation stack is complete, you can go directly to the section Test Lake Formation access control policies in this post.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account.
  • A data lake set up in Lake Formation with a Lake Formation administrator. For general guidance on how to set up Lake Formation, see Getting started with AWS Lake Formation.
  • Basic knowledge on creating IAM policies, roles, users, and groups.
  • If using AWS Identity Center, knowledge on creating AWS Identity Center permission sets and assigning them to users and groups

Register a new database in Lake Formation

For this post, we use the Women’s E-Commerce Clothing Reviews Dataset  to demonstrate how to provide granular access to the data lake for different data scientists. If you already have a dataset registered with Lake Formation that you want to use, you can skip this section and go to Create required IAM roles and users for data scientists.

To register the Women’s E-Commerce Clothing Reviews Dataset in Lake Formation, complete the following steps:

  1. Sign in to the console with the credentials associated to a Lake Formation administrator, based on your authentication method (IAM, IAM Identity Center, or federation with an external IdP).
  2. On the Lake Formation console, open the navigation panel on the left by clicking on the three horizontal lines on the left at the top of the console and in the navigation pane, under Data catalog, choose Databases.
  3. Choose Create Database.
  4. In Database details, select Database to create the database in your own account.
  5. For Name, enter a name for the database, such as ecommerce_reviews_db.
  6. For Location, enter s3://sagemaker-example-files-prod-us-east-1/datasets/tabular/Women’s_clothing_ecommerce/.
  7. Under Default permissions for newly created tables, make sure to clear the option Use only IAM access control for new tables in this database.

  1. Choose Create database.

The Women’s E-Commerce Clothing Reviews Dataset is currently available in CSV format. To create a table in the data lake for the CSV dataset, you can use an AWS Glue crawler or manually create the table using queries in Athena.

  1. On the Athena console, select Query your data with Trino SQL, then choose Launch query editor.

If you haven’t specified a query result location before, follow the instructions in Specifying a Query Result Location.

  1. Choose the data source AwsDataCatalog.
  2. Choose the database created in the previous step.
  3. In the query editor, enter the following query:
    CREATE EXTERNAL TABLE IF NOT EXISTS ecommerce_reviews(
     ID string,
     Clothing_ID string,
     Age string,
     Title string,
     Review_Text string,
     Rating string,
     Recommended_IND string,
     Positive_Feedback_Count string,
     Division_Name string,
     Department_Name string,
     Class_Name string)
    ROW FORMAT SERDE 
     'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
    WITH SERDEPROPERTIES ( 
     'escapeChar'='\\', 
     'separatorChar'=',') 
    LOCATION 's3://sagemaker-example-files-prod-us-east-1/datasets/tabular/Women’s_clothing_ecommerce/' 
    TBLPROPERTIES (
     'has_encrypted_data'='false',
     'skip.header.line.count'='1'
    );
  4. Choose Run query.

You should receive a Query successful response when the table is created.

  1. On the Lake Formation console, in the navigation pane, under Data catalog, choose Tables.
  2. Under Tables, in the search bar enter the table name ecommerce_reviews.
  3. Verify that you can see the table details.

  1. Scroll down to see the table schema

Finally, you register the database location with Lake Formation so the service can start enforcing data permissions on the database.

  1. On the Lake Formation console, in the navigation pane, under Administration, choose Data lake locations.
  2. On the Data lake locations page, choose Register location.
  3. For Amazon S3 path, enter s3://sagemaker-example-files-prod-us-east-1/datasets/tabular/Women’s_clothing_ecommerce/.
  4. For IAM role, you can keep the default role.
  5. Under Permission mode leave Hybrid access mode – new
  6. Choose Register location.

Create required IAM resources

To demonstrate how you can provide differentiated access to the dataset registered in the previous step, you first need to create IAM policies and roles. If you’re using IAM users for authentication, you also need to create a group and users. The implementation leverages attribute-based access control (ABAC) to define IAM permissions.

IAM resources for authentication using federation

The following diagram illustrates the resources you configure in this section if using federated identities with IAM Identity Center (aligned with our best practice of using temporary credentials to access AWS accounts).

In this section, you complete the following high-level steps for users authenticated into IAM Identity Center, assuming the users are utilizing their Microsoft Active Directory (AD) email credentials:

  1. Create a permission set named DataScientist and assign IAM Identity Center access into your AWS account to the users data-scientist-full@domain and data-scientist-limited@domain to control their federated access to the console and to Studio.
  2. Add a custom inline policy to the permission set.

The policy allows users in the group to access Studio, but only using a SageMaker user profile with a tag that matches their AD user name. The AD user name can be sent as an attribute from an external identity provider into Identity Center, and then used for access control. The policy also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. For each Active Directory user, create individual IAM roles, which are used as user profile execution roles in Studio later.

The naming convention for these roles consists of a common prefix followed by the corresponding AD user name. This allows you to audit activities on Studio notebooks – which are logged using Studio’s execution roles—and trace them back to the individual users who performed the activities. For this post, we use the prefix SageMakerStudioExecutionRole_.

  1. Create a managed policy named SageMakerUserProfileExecutionPolicy and assign it to each of the IAM roles.

The policy establishes coarse-grained access permissions to the data lake.

IAM resources for authentication using IAM

The following diagram illustrates the resources you configure in this section if using IAM users for authentication.

In this section, you complete the following high-level steps:

  1. Create an IAM group named DataScientists containing two users, data-scientist-full and data-scientist-limited, to control their access to the console and to Studio.
  2. Create a managed policy named DataScientistGroupPolicy and assign it to the group.

The policy allows users in the group to access Studio, but only using a SageMaker user profile with a tag that matches their IAM user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. For each IAM user, create individual IAM roles, which are used as user profile execution roles in Studio later.

The naming convention for these roles consists of a common prefix followed by the corresponding IAM user name. This allows you to audit activities on Studio notebooks—which are logged using Studio’s execution roles—and trace them back to the individual IAM users who performed the activities. For this post, we use the prefix SageMakerStudioExecutionRole_.

  1. Create a managed policy named SageMakerUserProfileExecutionPolicy and assign it to each of the IAM roles.

The policy establishes coarse-grained access permissions to the data lake.

Follow the remainder of this section to create the IAM resources described, depending on whether you use federated identities with Identity Center or IAM users. The permissions configured in this section grant common, coarse-grained access to data lake resources for all the IAM roles. In a later section, you use Lake Formation to establish fine-grained access permissions to Data Catalog resources and Amazon S3 locations for individual roles.

Create the required IAM Identity Center permission set (only for authentication using federation)

To create your permission set and assign it to your data scientists, complete the following steps:

  1. Sign in to the console using an IAM principal with permissions to create permission sets and assign access to users and groups into your AWS account.
  2. If using AWS Managed Microsoft AD directory, on the IAM Identity Center console, verify that the user attribute email is mapped to the attribute ${dir:windowsUpn} in Active Directory.
  3. On the IAM Identity Center console, enable attributes for access control and select the mapped attribute.
    1. On the Attributes for access control page, for Key, enter studiouserid.
    2. For Value (optional), choose or enter ${user:email}.
  4. Create a custom permission set named DataScientist, based on custom permissions. Under Create a custom permissions policy, use the following JSON policy document to provide permissions:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "sagemaker:DescribeDomain",
                    "sagemaker:ListDomains",
                    "sagemaker:ListUserProfiles",
                    "sagemaker:ListApps"
                ],
                "Resource": "*",
                "Effect": "Allow",
                "Sid": "AmazonSageMakerStudioReadOnly"
            },
            {
                "Action": "sagemaker:AddTags",
                "Resource": "*",
                "Effect": "Allow",
                "Sid": "AmazonSageMakerAddTags"
            },
            {
                "Condition": {
                    "StringEquals": {
                        "sagemaker:ResourceTag/studiouserid": "${aws:PrincipalTag/studiouserid}"
                    }
                },
                "Action": [
                    "sagemaker:CreatePresignedDomainUrl",
                    "sagemaker:DescribeUserProfile"
                ],
                "Resource": "*",
                "Effect": "Allow",
                "Sid": "AmazonSageMakerAllowedUserProfile"
            },
            {
                "Condition": {
                    "StringNotEquals": {
                        "sagemaker:ResourceTag/studiouserid": "${aws:PrincipalTag/studiouserid}"
                    }
                },
                "Action": [
                    "sagemaker:CreatePresignedDomainUrl",
                    "sagemaker:DescribeUserProfile"
                ],
                "Resource": "*",
                "Effect": "Deny",
                "Sid": "AmazonSageMakerDeniedUserProfiles"
            },
            {
                "Action": [
                    "sagemaker:CreatePresignedNotebookInstanceUrl",
                    "sagemaker:*NotebookInstance",
                    "sagemaker:*NotebookInstanceLifecycleConfig",
                    "sagemaker:CreateUserProfile",
                    "sagemaker:DeleteDomain",
                    "sagemaker:DeleteUserProfile"
                ],
                "Resource": "*",
                "Effect": "Deny",
                "Sid": "AmazonSageMakerDeniedServices"
            }
        ]
    }
    

The policy allows users to access Studio, but only using a SageMaker user profile with a tag that matches their Active Directory user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. Assign access into your AWS account to a group containing the data scientist users.
    1. On the Select users or groups page, enter a group name containing the data scientist users in your connected directory.
    2. On the Select permission sets page, select the DataScientist permission set.

Create the required IAM group and users (only for authentication using IAM)

To create your group and users, complete the following steps:

  1. Sign in to the console using an IAM user with permissions to create groups, users, roles, and policies.
  2. On the IAM console, create policies on the JSON tab to create a new IAM managed policy named DataScientistGroupPolicy. Use the following JSON policy document to provide permissions:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "AllowStudioReadOnly",
                "Action": [
                    "sagemaker:DescribeDomain",
                    "sagemaker:ListDomains",
                    "sagemaker:ListUserProfiles",
                    "sagemaker:ListImage*",
                    "sagemaker:DescribeSpace",
                    "sagemaker:ListSpaces",
                    "sagemaker:ListTags",
                    "sagemaker:DescribeApp",
                    "sagemaker:ListApps"
                ],
                "Resource": "*",
                "Effect": "Allow"
            },
            {
                "Sid" : "AllowAddTagsForApp",
                "Effect" : "Allow",
                "Action" : [
                  "sagemaker:AddTags"
                ],
                "Resource" : [
                  "arn:aws:sagemaker:*:*:app/*"
                ]
              },
            {
                "Sid": "AmazonSageMakerAllowedUserProfile",
                "Action": [
                    "sagemaker:DescribeUserProfile",
                    "sagemaker:CreatePresignedDomainUrl"
                ],
                "Resource": "arn:aws:sagemaker:*:*:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}",
                "Effect": "Allow"
            },
            {
                "Action": "sagemaker:DescribeUserProfile",
                "Effect": "Deny",
                "NotResource": "arn:aws:sagemaker:*:*:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}",
                "Sid": "AmazonSageMakerDeniedUserProfiles"
            },
            {
                "Sid" : "RestrictMutatingActionsOnSpacesToOwnerUserProfile",
                "Effect" : "Allow",
                "Action" : [
                  "sagemaker:CreateSpace",
                  "sagemaker:UpdateSpace",
                  "sagemaker:DeleteSpace"
                ],
                "Resource" : "arn:aws:sagemaker:*:*:space/${sagemaker:DomainId}/*",
                "Condition" : {
                  "ArnLike" : {
                    "sagemaker:OwnerUserProfileArn" : "arn:aws:sagemaker:*:*:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
                  },
                  "StringEquals" : {
                    "sagemaker:SpaceSharingType" : [
                      "Private",
                      "Shared"
                    ]
                  }
                }
              },
              {
                "Sid" : "RestrictMutatingActionsOnPrivateSpaceAppsToOwnerUserProfile",
                "Effect" : "Allow",
                "Action" : [
                  "sagemaker:CreateApp",
                  "sagemaker:DeleteApp"
                ],
                "Resource" : "arn:aws:sagemaker:*:*:app/${sagemaker:DomainId}/*/*/*",
                "Condition" : {
                  "ArnLike" : {
                    "sagemaker:OwnerUserProfileArn" : "arn:aws:sagemaker:*:*:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
                  },
                  "StringEquals" : {
                    "sagemaker:SpaceSharingType" : [
                      "Private"
                    ]
                  }
                }
              },
            {
                "Action": [
                    "lakeformation:GetDataAccess",
                    "glue:GetTable",
                    "glue:GetTables",
                    "glue:SearchTables",
                    "glue:GetDatabase",
                    "glue:GetDatabases",
                    "glue:GetPartitions"
                ],
                "Resource": "*",
                "Effect": "Allow",
                "Sid": "LakeFormationPermissions"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:CreateBucket",
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::sagemaker-audit-control-query-results-<aws region>-<account id>",
                    "arn:aws:s3:::sagemaker-audit-control-query-results-<aws region>-<account id>/*"
                ]
            },
            {
                "Action": "iam:PassRole",
                "Resource": "*",
                "Effect": "Allow",
                "Sid": "AmazonSageMakerStudioIAMPassRole"
            },
            {
                "Action": "sts:AssumeRole",
                "Resource": "*",
                "Effect": "Deny",
                "Sid": "DenyAssummingOtherIAMRoles"
            }
        ]
    }

The policy allows users in the group to access Studio, but only using a SageMaker user profile with a tag that matches their IAM user name. It also denies the use of SageMaker notebook instances, allowing Studio notebooks only.

  1. Create an IAM group.
    1. For Group name, enter DataScientists.
    2. Search and attach the AWS managed policy named DataScientist and the IAM policy created in the previous step.
  2. Create two IAM users named data-scientist-full and data-scientist-limited.

Alternatively, you can provide names of your choice, as long as they’re a combination of lowercase letters, numbers, and hyphen (-). Later, you also give these names to their corresponding SageMaker user profiles, which at the time of writing only support those characters.

Create the required IAM roles

To create your roles, complete the following steps:

Use the following JSON policy document to provide permissions, providing your AWS Region and AWS account ID:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowStudioReadOnly",
            "Action": [
                "sagemaker:DescribeDomain",
                "sagemaker:ListDomains",
                "sagemaker:ListUserProfiles",
                "sagemaker:ListImage*",
                "sagemaker:DescribeSpace",
                "sagemaker:ListSpaces",
                "sagemaker:ListTags",
                "sagemaker:DescribeApp",
                "sagemaker:ListApps"
            ],
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Sid" : "AllowAddTagsForApp",
            "Effect" : "Allow",
            "Action" : [
              "sagemaker:AddTags"
            ],
            "Resource" : [
              "arn:aws:sagemaker:*:*:app/*"
            ]
          },
        {
            "Sid": "AmazonSageMakerAllowedUserProfile",
            "Action": [
                "sagemaker:DescribeUserProfile",
                "sagemaker:CreatePresignedDomainUrl"
            ],
            "Resource": "arn:aws:sagemaker:*:*:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}",
            "Effect": "Allow"
        },
        {
            "Action": "sagemaker:DescribeUserProfile",
            "Effect": "Deny",
            "NotResource": "arn:aws:sagemaker:*:*:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}",
            "Sid": "AmazonSageMakerDeniedUserProfiles"
        },
        {
            "Sid" : "RestrictMutatingActionsOnSpacesToOwnerUserProfile",
            "Effect" : "Allow",
            "Action" : [
              "sagemaker:CreateSpace",
              "sagemaker:UpdateSpace",
              "sagemaker:DeleteSpace"
            ],
            "Resource" : "arn:aws:sagemaker:*:*:space/${sagemaker:DomainId}/*",
            "Condition" : {
              "ArnLike" : {
                "sagemaker:OwnerUserProfileArn" : "arn:aws:sagemaker:*:*:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
              },
              "StringEquals" : {
                "sagemaker:SpaceSharingType" : [
                  "Private",
                  "Shared"
                ]
              }
            }
          },
          {
            "Sid" : "RestrictMutatingActionsOnPrivateSpaceAppsToOwnerUserProfile",
            "Effect" : "Allow",
            "Action" : [
              "sagemaker:CreateApp",
              "sagemaker:DeleteApp"
            ],
            "Resource" : "arn:aws:sagemaker:*:*:app/${sagemaker:DomainId}/*/*/*",
            "Condition" : {
              "ArnLike" : {
                "sagemaker:OwnerUserProfileArn" : "arn:aws:sagemaker:*:*:user-profile/${sagemaker:DomainId}/${sagemaker:UserProfileName}"
              },
              "StringEquals" : {
                "sagemaker:SpaceSharingType" : [
                  "Private"
                ]
              }
            }
          },
        {
            "Action": [
                "lakeformation:GetDataAccess",
                "glue:GetTable",
                "glue:GetTables",
                "glue:SearchTables",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "glue:GetPartitions"
            ],
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "LakeFormationPermissions"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-audit-control-query-results-<aws region>-<account id>",
                "arn:aws:s3:::sagemaker-audit-control-query-results-<aws region>-<account id>/*"
            ]
        },
        {
            "Action": "iam:PassRole",
            "Resource": "*",
            "Effect": "Allow",
            "Sid": "AmazonSageMakerStudioIAMPassRole"
        },
        {
            "Action": "sts:AssumeRole",
            "Resource": "*",
            "Effect": "Deny",
            "Sid": "DenyAssummingOtherIAMRoles"
        }
    ]
}

This policy provides limited IAM permissions to Studio. For more information on recommended policies for team groups in Studio, see Configuring Amazon SageMaker Studio for teams and groups with complete resource isolation. The policy also provides common coarse-grained IAM permissions to the data lake, leaving Lake Formation permissions to control access to Data Catalog resources and Amazon S3 locations for individual users and roles. This is the recommended method for granting access to data in Lake Formation. For more information, see Methods for Fine-Grained Access Control.

  1. Create an IAM role for SageMaker for the first data scientist (data-scientist-full), which is used as the corresponding user profile’s execution role.
    1. On the Attach permissions policy page, the AWS managed policy AmazonSageMakerFullAccess is attached by default. You remove this policy later, to maintain minimum privilege.
    2. For Tags, add the key userprofilename and the value data-scientist-full.
    3. For Role name, use the naming convention introduced at the beginning of this section to name the role SageMakerStudioExecutionRole_data-scientist-full.
  2. To add the remaining policies, on the Roles page, choose the role name you just created.
  3. Under Permissions, remove the policy AmazonSageMakerFullAccess.
  4. Choose Attach policies;
  5. Search and select the SageMakerUserProfileExecutionPolicy and AmazonAthenaFullAccess policies.
  6. Choose Attach policy.
  7. Repeat the previous steps to create an IAM role for the second data scientist (data-scientist-limited).
    1. For Tags, add the key userprofilename and the value data-scientist-limited.
    2. For Role name, use the naming convention, such as SageMakerStudioExecutionRole_data-scientist-limited.

Grant data permissions with Lake Formation

Before data scientists are able to work on a Studio notebook, you grant the individual execution roles created in the previous section access to the Women’s Clothing E-Commerce Reviews Dataset (or your own dataset). For this post, we implement different data permission policies for each data scientist to demonstrate how to grant granular access using Lake Formation.

  1. Sign in to the console with the credentials associated to a Lake Formation administrator, based on your authentication method (IAM, IAM Identity Center, or federation with an external IdP).
  2. On the Lake Formation console, in the navigation pane, choose Tables.
  3. On the Tables page, select the table you created earlier (ecommerce_reviews).
  4. On the Actions menu, under Permissions, choose Grant.

We grant full access to the Women’s Clothing E-Commerce Reviews Dataset table for the first data scientist.

  1. Select My account.
  2. For IAM users and roles, choose the execution role associated to the first data scientist, SageMakerStudioExecutionRole_data-scientist-full.
  3. For Table permissions and Grantable permissions, select Select.
  4. Choose Grant.

We repeat the first step to grant limited access to the dataset for the second data scientist.

  1. On the Tables page, select the table you created earlier.
  2. On the Actions menu, under Permissions, choose Grant.
  3. Select My account.
  4. For IAM users and roles, choose the execution role associated to the second data scientist, SageMakerStudioExecutionRole_data-scientist-limited.
  5. For Columns, choose Include columns.
  6. Choose a subset of columns, such as Clothing_id, title, rating, division_name, department_name, class_name, recommended_ind, and positive_feedback_count.
  7. For Table permissions and Grantable permissions, select Select.
  8. Choose Grant.
  9. To verify the data permissions you have granted, on the Lake Formation console, in the navigation pane, choose Tables.
  10. On the Tables page, select your table.
  11. On the Actions menu, under Permissions, choose View permissions to open the Data permissions page.

You see a list of permissions granted for the table, including the permissions you just granted and permissions for the Lake Formation admin.

If you see the principal IAMAllowedPrincipals listed on the Data permissions menu for the table, you must remove it. Select the principal and choose Revoke. On the Revoke permissions page, choose Revoke.

Set up SageMaker Studio

You now onboard to Studio and create two user profiles, one for each data scientist.

When you onboard to Studio using IAM authentication, Studio creates a domain for your account. A domain consists of a list of authorized users, and a variety of security, application, policy, and Amazon Virtual Private Cloud (Amazon VPC)  configurations.

By default, all traffic goes over the internet through a SageMaker system VPC. Alternatively, instead of using the default SageMaker internet access, you could secure how Studio accesses resources by defining a VPC Only mode for the domain. This is beyond the scope of this post, but you can find additional details in Securing Amazon SageMaker Studio connectivity using a private VPC.

If you already have a Studio domain running, you can skip the onboarding process and follow the steps to create the SageMaker user profiles.

Onboard to Studio

To onboard to Studio, complete the following steps:

  1. Sign in to the console with the credentials of a user with service administrator permissions for SageMaker, based on your authentication method (IAM, IAM Identity Center, or federation with an external IdP).
  2. On the SageMaker console, in the navigation pane, choose Studio.
  3. On the Amazon SageMaker Studio menu, select Create a SageMaker domain under Get started, choose Set up for organizations.
  4. On Domain details page type your domain name, click Next.
  5. On Users and ML Activities, choose Login through IAM.

You have the option to create a new role for the Studio Domain. You’re not using this execution role for the SageMaker user profiles that you create later. SageMaker creates a new IAM AmazonSageMaker-ExecutionPolicy role with attached policies based on the ML activities to which you want the role to have access. You can select up to 10 ML activities.

  1. For S3 Bucket Access, enter sagemaker-example-files-prod-us-east-1.
  2. On Applications page, choose SageMaker Studio, then choose Next.
  3. Under Network, select Virtual Private Cloud (VPC) Only. Then choose the private VPC that is used for communication with the SageMaker API, SageMaker runtime and other AWS services.
  4. For Subnet(s), choose multiple subnets in the VPC from different Availability Zones.
  5. For Security Group(s), choose security groups will be associated with the RStudioServerPro App and Space Apps, then click Next.

You have the option to choose an encryption key for the Amazon Elastic File System (Amazon EFS) and Amazon Elastic Block Store (Amazon EBS) file systems used by the domain. You cannot change the encryption key after encrypting your Amazon EFS and Amazon EBS file systems.

Additionally, you can specify the default and maximum space size according to your requirements.

  1. Choose Next to move to Review and create step, choose Submit.
  2. On the Studio Control Panel, under Studio Summary, wait for the status to change to InService and the Add user button to be enabled.

Create the SageMaker user profiles

To create your SageMaker user profiles with the studiouserid tag, complete the following steps:

  1. On the SageMaker console, in the navigation pane, choose Domains under Admin configurations. Then choose your domain.
  2. On the Domain details page, choose Add user.
    1. For Name, enter the name of the first data scientist user, data-scientist-full.
    2. For Execution role, select the IAM role created for this user, SageMakerStudioExecutionRole_data-scientist-full.
    3. Under Tags, choose Add tag. For Key, enter studiouserid. For Value, enter the identity (IAM Identity Center, external IdP, or IAM) of the first data scientist, depending on your authentication method. Then choose Next.
    4. Leave the default values for Studio settings and RStudio settings
    5. You will not use SageMaker Canvas during this walkthrough. For Canvas settings, you can disable all the settings. Then choose Submit.
  1. Repeat the step to create a second user profile
    1. For Name, enter the name of the second data scientist user, data-scientist-limited.
    2. For Execution role, select the IAM role created for this user, SageMakerStudioExecutionRole_data-scientist-limited.
    3. Under Tags, choose Add tag. For Key, enter studiouserid. For Value, enter the identity (IAM Identity Center, external IdP, or IAM) of the second data scientist, depending on your authentication method. Then choose Next.
    4. Leave the default values for Studio settings and RStudio settings
    5. You will not use SageMaker Canvas during this walkthrough. For Canvas settings, you can disable all the settings. Then choose Submit.

Test Lake Formation access control policies

You now test the implemented Lake Formation access control policies by opening Studio using both user profiles. For each user profile, you run the same Studio notebook containing Athena queries. You should see different query outputs for each user profile, matching the data permissions implemented earlier.

  1. Sign in to the console with the credentials associated to the first data scientist (data-scientist-full), based on your authentication method (IAM, IAM Identity Center, or federation with an external IdP).
  2. On the SageMaker console, in the navigation pane, choose Amazon SageMaker Studio.
  3. On the Studio Control Panel, choose user name data-scientist-full.
  4. Choose Open Studio.
  5. Wait for Studio to load.

Due to the IAM policies attached to the IAM user, you can only open Studio with a user profile matching the IAM user name.

  1. In Studio, launch JupyterLab. Choose Create JupyterLab space, and provide a name such as jupyterlab-full-profile. Then choose Create Space, and wait for it to be created.
  2. For Image, select Latest, then choose Run space and wait for it to start.
  3. Choose Open JupyterLab and wait for JupyterLab to open in a new window.
  4. In the JupyterLab window, navigate to the launcher and select the Terminal from the Other
  5. At the command prompt, run the following command to import a sample notebook to test Lake Formation data permissions:
git clone https://github.com/aws-samples/amazon-sagemaker-studio-audit.git
  1. In the left sidebar, choose the file browser
  2. Navigate to amazon-sagemaker-studio-audit.
  3. Open the notebook folder.
  4. Choose sagemaker-studio-audit-control.ipynb to open the notebook.
  5. In the Select Kernel dialog, choose Python 3 (Data Science).
  6. Choose Select.
  7. Wait for the kernel to load.

  1. Starting from the first code cell in the notebook, press Shift+Enter to run the code cell.
  2. Continue running all the code cells, waiting for the previous cell to finish before running the following cell.

After running the last SELECT query, because the user has full SELECT permissions for the table, the query output includes all the columns in the ecommerce_reviews table.

  1. Go back to the SageMaker Studio and stop the jupyterlab-full-profile JupyterLab space, as part of the clean-up process.
  2. Close the Studio browser tab.
  3. Repeat the previous steps in this section, this time signing in with the credentials associated to the second data scientist (data-scientist-limited) and opening Studio with this user.
  4. Don’t run the code cell in the section Create S3 bucket for query output files.

For this user, after running the same SELECT query in the Studio notebook, the query output only includes a subset of columns for the ecommerce_reviews table.

Audit data access activity with Lake Formation and CloudTrail

In this section, we explore the events associated to the queries performed in the previous section. The Lake Formation console includes a dashboard where it centralizes all CloudTrail logs specific to the service, such as GetDataAccess. These events can be correlated with other CloudTrail events, such as Athena query requests, to get a complete view of the queries users are running on the data lake.

Alternatively, instead of filtering individual events in Lake Formation and CloudTrail, you could run SQL queries to correlate CloudTrail logs using Athena. Such integration is beyond the scope of this post, but you can find additional details in Using the CloudTrail Console to Create an Athena Table for CloudTrail Logs and Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena.

Audit data access activity with Lake Formation

To review activity in Lake Formation, complete the following steps:

  1. Sign out of the AWS account.
  2. Sign in to the console with the credentials associated to a Lake Formation administrator, based on your authentication method (IAM, IAM Identity Center, or federation with an external IdP).
  3. On the Lake Formation console, in the navigation pane, choose Dashboard.

Under Recent access activity, you can find the events associated to the data access for both users.

  1. Choose the most recent event with event name GetDataAccess.
  2. Choose View event.

Among other attributes, each event includes the following:

  • Event date and time
  • Event source (Lake Formation)
  • Athena query ID
  • Table being queried
  • IAM user embedded in the Lake Formation principal, based on the chosen role name convention

Audit data access activity with CloudTrail

To review activity in CloudTrail, complete the following steps:

  1. On the CloudTrail console, in the navigation pane, choose Event history.
  2. On the Event history menu, for Filter, choose Event name.
  3. Enter StartQueryExecution.
  4. Expand the most recent event.

This event includes additional parameters that are useful to complete the audit analysis, such as the following:

  • Event source (Athena).
  • Athena query ID, matching the query ID from Lake Formation’s GetDataAccess.
  • Query string. As of July 3, 2023, the query string has a value of ***OMITED***.
  • Output location. The query output is stored in CSV format in this Amazon S3 location. Files for each query are named using the query ID.

To obtain the query details of your query:

  1. Copy the Query Execution Id
  2. On the Athena console, under the Recent queries tab, enter the query Execution Id.

Clean up your resources

To avoid incurring future charges, delete the resources created during this walkthrough.

If you followed this walkthrough using the CloudFormation template, after shutting down the Studio apps for each user profile, delete the stack to delete the remaining resources.

If you encounter any errors, open the Studio Control Panel and verify that all the apps for every user profile are in Deleted state before deleting the stack.

If you didn’t use the CloudFormation template, you can manually delete the resources you created:

  1. On the Amazon SageMaker domain, click on each user profile, choose Edit.
  2. Choose Delete user.
  3. When all users are deleted, go to Space management tab and repeat the following for each shared space. Choose Confirm deletion under Delete app for every app.
  4. When all users and shared spaces are deleted, choose the domain settings. Choose Edit. On the General settings page, choose Delete domain.
  5. On the Lake Formation console, delete the table and the database created for the Women’s Clothing E-Commerce Reviews Dataset.
  6. Remove the data lake location for the dataset.
  7. On the IAM console, delete the IAM users, group, and roles created for this walkthrough.
  8. Delete the policies you created for these principals.
  9. On the Amazon S3 console, empty and delete the bucket created for storing Athena query results (starting with sagemaker-audit-control-query-results-), and the bucket created by Studio to share notebooks (starting with sagemaker-studio-).

Conclusion

This post described how to the implement access control and auditing capabilities on a per-user basis in ML projects, using Studio notebooks, Athena, and Lake Formation to enforce access control policies when performing exploratory activities in a data lake.

I thank you for following this walkthrough and I invite you to implement it using the associated CloudFormation template. You’re also welcome to visit the GitHub repo for the project.


About the Authors

Rodrigo Alarcon is a Principal ML Strategy Solutions Architect with AWS based out of Santiago, Chile. Rodrigo has over 10 years of experience in IT security and network infrastructure. His interests include machine learning and cybersecurity.

Yeimy Arevalo is a Sr. AI/ML Architect with AWS based out of Bogotá, Colombia. Yeimy has over 11 years of experience in Data Science. Her interests include machine learning and generative AI.

Francisco Fagas is a Sr. Solutions Architect with AWS based out of Santiago, Chile. Francisco has over 15 years of experience in IT. His interests include machine learning and analytics.

Alejandro Martínez is a Sr. Technical Account Manager with AWS based out of Mexico City, Mexico. Alejandro has over 25 years of experience in IT. His interests include machine learning, analytics and High Performance Computing.