AWS for Industries

FSI Services Spotlight: Featuring AWS Lake Formation

Welcome back to the Financial Services Industry (FSI) Service Spotlight monthly blog series. Each month we look at five key considerations that FSI customers should focus on to help streamline cloud service approval for one particular service. Each of the five key considerations includes specific guidance, suggested reference architectures, and technical code that can be used to streamline service approval for the featured service. This guidance should be adapted to suit your own specific use case and environment.

For this edition of the Service Spotlight, we’re covering AWS Lake Formation, a fully managed service that helps you build, secure, and manage your data lake at scale. Lake Formation provides a central console where you can discover data sources, set up transformation jobs to move data to an Amazon Simple Storage Service (Amazon S3) data lake, remove duplicates and match records, catalog data for access by analytic tools, configure data access and security policies, and audit and control access from AWS analytic and machine learning (ML) services.

Lake Formation provides a unified security model for managing permissions to AWS data lake environments based on AWS Glue and Amazon S3. It provides granular access control at the catalog, database, table, and underlying data levels.

Security settings and access controls are defined and enforced at the table, column, row, and cell levels for all of the users and services that access your data. You can access the data in your data lake through various analytics services, such as AWS Glue, Amazon Athena, Amazon Redshift, Amazon QuickSight, and Amazon EMR. This can be done using Zeppelin notebooks with Apache Spark to make sure of compliance with your defined policies. Lake Formation lets you configure and manage permission for your data lake without manually integrating multiple underlying AWS services.

Many organizations are leveraging Lake Formation as part of their strategy to retire technical debt and modernize applications on AWS. For example, Southwest Airlines used Lake Formation, Amazon S3, and Athena to build its first cloud-native data lake. This provided Southwest with new analytics capabilities that delivered a competitive edge for their data scientists, improved flight-time predictions, and reduced airspace congestion.

JPMorgan Chase Bank, N.A. (JPMC) is a 200-year-old financial institution with holdings of approximately $3.2 trillion and operations worldwide. JPMC has leveraged Lake Formation as part of their data mesh architecture. JPMC’s data mesh architecture aligns their data technology solutions to their data product strategy. A blueprint is provided for instantiating data lakes that implements the mesh architecture in a standardized way using a defined set of cloud services. This lets them enable data sharing across the enterprise while giving data owners the control and visibility required to manage their data effectively. Read more about their experience in the post, “How JPMorgan Chase built a data mesh architecture to drive significant value to enhance their enterprise data platform”.

Achieving Compliance with Lake Formation

Lake Formation is an AWS managed service. Third-party auditors regularly assess the security and compliance of AWS services as part of multiple AWS compliance programs. As part of the AWS shared responsibility model, the Lake Formation and AWS Glue services are in the scope of the following compliance programs. You can obtain the corresponding compliance reports under an AWS non-disclosure agreement (NDA) through AWS Artifact.

AWS Glue

  • C5
  • DoD CC SRG (IL2-IL5)
  • ENS High
  • FINMA
  • GSMA (Ohio and Paris)
  • IRAP
  • K-ISMS
  • OSPAR
  • PiTuKri
  • SOC 1,2,3

AWS Glue (including Lake Formation)

  • CSA STAR CCM v3.0.1
  • HIPAA BAA
  • HITRUST CSF
  • ISMAP
  • ISO/IEC 27001:2013, 27017:2015, 27018:2019, and ISO/IEC 9001:2015
  • MTCS (Regions: US-East, US-West, Singapore, Seoul)
  • PCI
  • FedRAMP (Moderate and High)

Your scope of the shared responsibility model when using Lake Formation is determined by the sensitivity of your data, your organization’s compliance objectives, and the applicable laws and regulations. AWS provides several resources for meeting your compliance objectives.

Data Protection with Lake Formation

Data protection is the process of preventing critical information from being corrupted, compromised, or lost. Encryption is a recommended practice for making sure of the confidentiality and integrity of the data being processed – both in transit and at rest.

Encryption can be enabled on Amazon S3. Furthermore, AWS Glue Data Catalogs services individually with AWS Key Management Service (AWS KMS) and Lake Formation to provide permission management with encrypted Glue Data Catalogs and datasets stored on Amazon S3.

  • Encrypting the AWS Glue Data Catalog: Encryption for AWS Glue Data Catalog objects may be enabled in the Data Catalog Settings section of the AWS Glue interface by passing the symmetric AWS KMS key. The encrypted objects include the databases, tables, partitions, table versions, connections, and user-defined functions. For detailed steps, refer to the AWS Glue edition of the FSI Services Spotlight.
  • Encrypting the ETL Process: AWS Glue supports data encryption at rest for Authoring Jobs in AWS Glue and Developing Scripts using development endpoints with keys that you manage in AWS KMS. For detailed steps, refer to the AWS Glue edition of the FSI Services Spotlight.
  • Encrypting the Amazon S3 Bucket: Lake Formation supports permissions management on datasets stored on Amazon S3 when server-side AWS KMS encryption is used. This approach provides automatic server-side encryption with keys managed by AWS KMS. Amazon S3 encrypts data in transit when cross-region replicating. Moreover, it lets you use separate accounts for source and destination regions to protect against malicious insider deletions. These encryption capabilities provide a secure foundation for all of the data in your data lake. To enable encryption, refer to the Amazon S3 documentation on protecting data using server-side encryption. Both customer managed AWS KMS keys and AWS managed keys are supported. Note that client-side encryption/decryption is not supported.
  • Encrypting the Governed tables:
    • Governed tables work as expected for data encryption at rest with AWS Glue managing the encryption key. The AWS Identity and Access Management (IAM) role associated with the Amazon S3 location where the governed tables reside must have AWS KMS permission to encrypt/decrypt.
    • Governed tables work as expected with Data Catalog metadata encryption enabled. The IAM role associated with the Amazon S3 location where the governed tables reside must have AWS KMS permission to encrypt/decrypt. Additionally, you must grant permissions to encrypt or decrypt the key to the IAM role and Lake Formation service.
    • The default Lake Formation service-linked role (SLR) role cannot be used for encrypted governed tables. You must use a custom IAM role with Amazon S3, AWS KMS, and Amazon CloudWatch

To enable encryption and decryption with the configured AWS KMS key, the AWS KMS key policy must include a trust relationship with lakeformation.amazonaws.com, as shown in the following example:

{
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "lakeformation.amazonaws.com"
                ]
            },
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:DescribeKey"
            ],
            "Resource": "*"
    }

Isolation of compute environments with Lake Formation

Lake Formation is a managed service that doesn’t have any compute resources within the customer’s portion of the shared responsibility model. As a managed service, Lake Formation is protected by the AWS global network security procedures that are described in the AWS Architecture Center: Security, Identity, & Compliance.

If you use Amazon Virtual Private Cloud (Amazon VPC) to host your AWS resources, then you can establish a private connection between your VPC and Lake Formation by creating an interface VPC endpoint. Interface endpoints are powered by AWS PrivateLink, a technology that lets you privately access Lake Formation APIs without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Use this connection so that Lake Formation can communicate with the resources in your VPC without going through the public internet. Instances in your VPC don’t need public IP addresses to communicate with Lake Formation APIs. Traffic between your VPC and Lake Formation doesn’t leave the Amazon network. Each interface endpoint is represented by one or more Elastic Network Interfaces in your subnets.

VPC endpoint policies are also supported by Lake Formation. A VPC endpoint policy is an IAM resource policy that you attach to an endpoint that controls access to Lake Formation. The following information is specified in the policy:

  • The principal that can perform actions.
  • The actions that can be performed.
  • The resources on which actions can be performed.

The following example VPC endpoint policy for Lake Formation allows for credential vending using Lake Formation permissions, while restricting access to a specific account. You could use this policy to run queries using Lake Formation permissions from an Amazon Redshift cluster or an Amazon EMR cluster located in a private subnet.

{
    "Statement": [{
        "Effect": "Allow",
        "Action": "lakeformation:GetDataAccess",
        "Resource": "*",
        "Principal": "*",
        "Condition": {
            "StringEquals": {
                "aws:PrincipalAccount": "111122223333"
            }
        }
    }]
}

If you do not attach a policy when you create an endpoint, then a default policy that allows full access to the service is attached.

Automating audits with APIs with Lake Formation

Lake Formation is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service. Lake Formation provides comprehensive audit logs with CloudTrail to monitor access and show compliance with centrally defined policies. You can audit data access history across analytics and machine learning services that read the data in your data lake via Lake Formation. Lake Formation API actions are logged by CloudTrail. For example, calls to the PutDataLakeSettings, GrantPermissions, and RevokePermissions actions generate entries in the CloudTrail log files. This lets you see which users or roles have attempted to access what data, with which services, and when.

The following example shows a CloudTrail log entry for the GetDataAccess action. Principals do not directly call this API. Rather, GetDataAccess is logged whenever a principal or integrated AWS service requests temporary credentials to access data in a data lake location that is registered with Lake Formation.

{
    "eventVersion": "xxx",
    "userIdentity": {
        "type": "AWSAccount",
        "principalId": "ABSXYZ:GlueJobRunnerSession",
        "accountId": "11223333444"
    },
    "eventSource": "lakeformation.amazonaws.com",
    "eventName": "GetDataAccess",
  ...
  ...
    "additionalEventData": {
        "requesterService": "GLUE_JOB",
        "lakeFormationPrincipal": "arn:aws:iam:: 11223333444:role/ETL-Glue-Role",
        "lakeFormationRoleSessionName": "AWSLF-00-GL-11223333444-Punit123"
    },
...
}

The session name is formatted as: AWSLF-<version-number>-<query-engine-code>-<account-id->-<suffix>

  • version-number: The version of this format is currently 00. If the session name format changes, then the next version will be 01.
  • query-engine-code: This indicates the entity that accessed the data. The current values are:
    • GL : AWS Glue ETL job
    • AT: Amazon Athena
    • RE : Amazon Redshift Spectrum
  • account-id: The AWS account ID that requested credentials from Lake Formation.
  • Suffix: A randomly generated string.

Operational access and security with Lake Formation

AWS customers in FSI may require visibility to any access of their data stored on AWS. You can review third-party auditor reports, such as the AWS SOC 2 Type II report, ISO 27001, and others, in AWS Artifact.

Using Lake Formation, customers can build a common data access and governance framework for data in the data lake. Lake Formation is a service built on AWS Glue. Lake Formation can be used to manage AWS Glue crawlers, AWS Glue ETL jobs, the Data Catalog, security settings, and access control. Once the data is securely stored in the data lake, customers can access the data through their choice of analytics services, as shown in the following diagram.

Diagram illustrating how the Lake Formation service works. Data stores flow into the data lake where Lake Formation is shown managing AWS Glue functionality to provide self-service access to users through analytics services such as, Amazon Athena, Amazon Redshift, and Amazon EMR.

Using an administrative role in Lake Formation, customers can define security policy-based rules for users and applications. Moreover,  integration with IAM authenticates those users and roles. Once the rules are defined, Lake Formation enforces the access controls. Related use cases include:

  • When Athena users select the AWS Glue Data Catalog in the query editor, they can query only the databases, tables, and columns on which they have Lake Formation permissions.
  • When Amazon Redshift users create an external schema on a database in the AWS Glue Data Catalog, they can query only the tables and columns in that schema on which they have Lake Formation permission.
  • Lake Formation enabled Amazon EMR clusters enforce Lake Formation permissions through Apache Zeppelin or Amazon EMR Notebooks when Apache Spark is used.
  • When a QuickSight Enterprise Edition author creates the dataset in an Amazon S3 location that is registered with Lake Formation, the author must have the Lake Formation SELECT permission on the data. QuickSight readers that have access to the shared dashboard will be using the same rules/data that the author has been assigned.
  • For AWS Glue console operations (such as viewing a list of tables) and all AWS Glue API operations, users can access only the databases and tables on which they have Lake Formation permission.
  • For AWS Glue ETL jobs and Athena, you may restrict access to certain data in query results by including data filtering specifications. Lake Formation uses data filtering to achieve column-level security, row-level security, and cell-level security. You can implement this by creating named data filters and specifying a data filter when you grant the SELECT Lake Formation permission on tables. When you create a data filter, you provide a set of columns and a filter expression for rows that must be included.

Each time that an AWS Services (AWS Glue, Athena, Amazon EMR, Amazon Sagemaker, etc.) principal (user or role) runs a query on data registered with Lake Formation, Lake Formation verifies that the principal has the appropriate permissions to the database, table, and the underlying Amazon S3 objects. If the principal has access, then Lake Formation vends temporary credentials to AWS Glue, and the query runs.

Diagram illustrating how access to AWS Services is provided through Lake Formation. Lake Formation is vending temporary credentials to AWS Services to run the query for the end-service.

Summary

In this post, we reviewed Lake Formation and highlighted key information that can help FSI customers accelerate the approval of the service within these five categories:

  • Achieving compliance
  • Data protection
  • Isolation of compute environments
  • Automating audits with APIs
  • Operational access and security

While this isn’t a one-size-fits-all approach, this guidance can be adapted to meet your organization’s security and compliance requirements, as well as provide a consolidated list of key areas for Lake Formation.

Be sure to visit our AWS Financial Services Industry blog channel and stay tuned for more financial services news and best practices.

Punit Jain

Punit Jain

Punit Jain is a Solutions Architect at AWS who has been working with Digital Native Businesses. He’s passionate about technology and helping customers architect and build innovative, secure, resilient, and efficient solutions for complex business problems. He is also known as a regional security expert for his contributions towards improving the security posture of many AWS customers.

Rodney Underkoffler

Rodney Underkoffler

Rodney Underkoffler is a Senior Solutions Architect at AWS, focused on guiding enterprise customers on their cloud journey. He has a background in infrastructure, security, and IT business practices. He is passionate about technology and enjoys building and exploring new solutions and methodologies.