Posted On: Dec 5, 2022

Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning. Studio comes with built-in integration with Amazon EMR so that data scientists can interactively prepare data at petabyte scale using frameworks such as Apache Spark right from Studio notebooks. We’re excited to announce that SageMaker Studio now supports applying fine-grained data access control with AWS Lake Formation when accessing data through Amazon EMR.

Until now, all jobs that you ran on the EMR cluster used the same IAM role- the cluster’s EC2 Instance Profile - to access data. Therefore, to run jobs that needed access to different data sources e.g. different S3 buckets, you had to configure the EC2 Instance Profile with policies that allowed access to the union of all such data sources. Additionally, to enable groups of users with differential access to data, you had to create separate clusters, one for each group, resulting in operational overhead. Separately, jobs submitted to EMR from Studio notebooks were unable to apply fine-grained data access control with AWS LakeFormation.

Starting today, when you connect to EMR clusters from SageMaker Studio notebooks, you can choose that IAM role (called runtime IAM Role) that you want to connect with. Apache Spark, Hive or Presto jobs created from Studio notebooks will access only the data and resources permitted by policies attached to the runtime role. Also, when data is accessed from data lakes managed with AWS LakeFormation, you can enforce table and column-level access using policies attached to the runtime role. With this new capability, multiple SageMaker Studio users can connect to the same EMR cluster, each using a runtime role scoped with customized data access permissions. User sessions are completely isolated from one another on the shared cluster. With this feature, customers can simplify provisioning of EMR clusters, thus reducing operational overhead and saving costs.

This feature is generally available in SageMaker Studio when connecting to Amazon EMR 6.9 in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Europe (Paris). To learn more, see this blog.