FSI Services Spotlight: Featuring AWS Glue

In this edition of the Financial Services Industry (FSI) Services Spotlight monthly blog series, we continue to highlight five key considerations of a particular service that FSI customers should focus on to help streamline cloud service approval. Each of the five areas will include specific guidance, suggested reference architectures and technical code that can help streamline service approval for the featured service, which may need to be adapted to your specific use case and environment.

For this edition of the Service Spotlight, we are covering AWS Glue, a fully managed extract, transform, and load (ETL) tool that makes it easy to prepare and load data for analytics, machine learning, and development. You simply point AWS Glue to your data source, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once catalogued, your data is immediately searchable, queryable, and available for ETL. Financial institutions are increasingly adopting managed services, such as AWS Glue, in order to avoid the expense of managing and maintaining infrastructure while quickly delivering outcomes and focusing efforts on their core competencies.

Financial institutions are leveraging AWS Glue as an essential part of many of their workloads. A common use case across the industry is implementing AWS Glue as part of an Amazon S3 data lake. The AWS Glue service provides the Data Catalog and transformation services necessary for modern data analytics. An example of this can be seen with TNG FinTech Group, who takes advantage of AWS Glue to enable serverless queries with extract, transform, and load functionality. TNG uses Amazon Athena to automate queries in their wallet service, so customers can easily retrieve information on past transactions, with no manual intervention required from TNG staff. TNG FinTech Group relies on AWS Glue to transform raw data from the CSV file format before it is integrated into Amazon Athena for processing or transferred to the data lake on Amazon S3.

“In the past, we had to do a lot of coding work by ourselves and scalability was also a problem with CSV files. After moving to AWS, we can leverage cloud resources like AWS Glue and Amazon Athena to shorten the processing time,” says Chris Chan, head of engineering at TNG.

AWS Glue development environments

AWS Glue has a number of development environments catered to different skillsets: visual ETL development for data engineers, notebook styled development for data scientists, and no-code development for data analysts.

AWS Glue Studio allows you to author highly scalable ETL jobs for distributed processing without becoming an Apache Spark expert. You can define your ETL process in the drag-and-drop job editor and AWS Glue automatically generates the code to extract, transform, and load your data. The code is generated in Scala or Python and written for Apache Spark.
AWS Glue DataBrew is a visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning.
With AWS Glue Elastic Views (Preview), you can use familiar Structured Query Language (SQL) to quickly create a virtual table—a materialized view—from multiple different source data stores. AWS Glue Elastic Views copies data from each source data store and creates a replica in a target data store. AWS Glue Elastic Views continuously monitors for changes to data in your source data stores and provides updates to the materialized views in your target data stores automatically, ensuring data accessed through the materialized view is always up-to-date.

Achieving compliance with AWS Glue

AWS Glue is an AWS managed service. Third-party auditors regularly assess the security and compliance of it as part of multiple AWS compliance programs. As part of the AWS shared responsibility model, the AWS Glue service is in the scope of the following compliance programs. You can obtain corresponding compliance reports under an AWS non-disclosure agreement (NDA) through AWS Artifact.

C5
CSA STAR CCM v3.0.1
DoD CC SRG (IL2-IL5)
ENS High
FedRAMP (Moderate and High)
FINMA
HIPAA
HITRUST CSF
IRAP
ISMAP
ISO/IEC 27001:2013, 27017:2015, 27018:2019, and ISO/IEC 9001:2015
K-ISMS
MTCS (Regions: US-East, US-West, Singapore, Seoul)
OSPAR
PCI
SOC 1,2,3

Your scope of the shared responsibility model when using AWS Glue is determined by the sensitivity of your data, your organization’s compliance objectives, and applicable laws and regulations. AWS provides several resources for compliance validation.

Data protection with AWS Glue

Data protection is the process of preventing critical information from being corrupted, compromised, or lost. Encryption is a recommended practice for ensuring the confidentiality and integrity of the data being processed, both in transit and at rest.

At-Rest Encryption:

Encrypting the AWS Glue Data Catalog: Encryption for AWS Glue Data Catalog objects may be enabled in the Data Catalog Settings section of the AWS Glue interface by passing the symmetric KMS key. The encrypted objects include the databases, tables, partitions, table versions, connections, and user-defined functions.

Metadata encryption checkbox: When enabled, it encrypts all the objects in the Data Catalog with the selected AWS KMS key. Also, note when encryption is enabled, the client that is accessing the Data Catalog must have AWS KMS permissions.
Encrypt connection passwords: When enabled, the password you provide for connection creation is encrypted with the selected AWS KMS key.The following screenshot illustrates the encryption options for your AWS Glue Data Catalog.

Figure 1: Glue Data Catalog settings

Encrypting the ETL Process: AWS Glue supports data encryption at rest for Authoring Jobs in AWS Glue and Developing Scripts using development endpoints with keys that you manage in KMS.

Security configuration: Create a security configuration to define at-rest encryption for S3, CloudWatch logs and Job Bookmark with an AWS KMS key.
When you attach a security configuration to a crawler or ETL job, the IAM roles that are passed must have permissions to the specified AWS KMS key.

Figure 2: AWS Glue at-rest encryption settings

In-Transit Encryption:

For data in transit, AWS offers Secure Sockets Layer (SSL) encryption. In order to ensure the encryption of data containing sensitive information while in transit, AWS Glue should be configured to use JDBC connections to data stores with SSL/TLS. Enable the “Require SSL connection” property while creating AWS Glue connections. When enabled, if AWS Glue cannot connect using SSL, the job run, crawler, or ETL statements in a development endpoint will fail.

For JDBC connections, AWS Glue only connects over SSL with certificate and host name validation. SSL connection support is available for:

Amazon Redshift
Amazon Relational Database Service (RDS) for MySQL
Amazon Aurora MySQL and PostgreSQL
Amazon Managed Streaming for Apache Kafka (MSK)
Microsoft SQL Server
Oracle

In order to maintain encryption while in-transit, the setting for server-side encryption (SSE-S3 or SSE-KMS) should be passed as a parameter to ETL jobs run with AWS Glue.

Similarly, when using AWS Glue DataBrew, you can configure it to use JDBC connections to data stores with SSL/TLS encryption. When connecting to JDBC data sources, DataBrew uses the settings on your AWS Glue connection, including the “Require SSL connection” option. Additionally, to maintain encryption while at rest in S3 buckets, the setting for server-side encryption (SSE-S3 or SSE-KMS) should be passed as a parameter to DataBrew jobs.

Isolation of compute environments with AWS Glue

AWS Glue is a managed service that doesn’t have any compute resources within the customer’s portion of the shared responsibility model. As a managed service, AWS Glue is protected by the AWS global network security procedures that are described in the AWS Architecture Center: Security, Identity, & Compliance.

You can establish a private connection between your VPC and AWS Glue by creating an interface VPC endpoint. Interface endpoints are powered by AWS PrivateLink, a technology that enables you to privately access AWS Glue APIs or JDBC/ODBC endpoints without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. Using an interface VPC endpoint, instances in your VPC don’t need public IP addresses to communicate with AWS Glue. The use of interface VPC endpoints also ensures that traffic between your VPC and AWS Glue does not leave the Amazon network.

Using resource-based policies, you can restrict access to your Data Catalog resources rather than to an IAM identity. These resources include databases, tables, connections, and user-defined functions, along with the Data Catalog APIs that interact with these resources. Resource policies can’t be attached to AWS Glue resources such as jobs, triggers, development endpoints, crawlers, or classifiers.

Consider the following example. Suppose that the following policy is attached to the Data Catalog in Account 1. It grants the IAM identity admin in Account 1 permission to create tables in the database DB in Account 1. It also grants the same permission to the “etlaccess” role in Account 2.

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "glue:CreateTable"
        ],
        "Principal": {"AWS": [
          "arn:aws:iam::account-1-id:user/admin",
          "arn:aws:iam::account-2-id:role/etlaccess"
        ]},
        "Resource": [
          "arn:aws:glue:us-east-1:account-1-id:table/db/books",
          "arn:aws:glue:us-east-1:account-1-id:database/db",
          "arn:aws:glue:us-east-1:account-1-id:catalog"        
        ]
      }
    ]
  }

Access can be further restricted to specific AWS services by leveraging the aws:CalledVia condition key, which allows you to create distinct access rules for the actions performed by your IAM principals. You can also restrict this to a specific VPC endpoint using the aws:SourceVpce condition key. Further details, including an example policy demonstrating how to limit access to Glue can be found in this blog post.

Automating audits with APIs with AWS Glue

AWS Config monitors the configuration of resources and provides some out-of-the-box rules to alert when resources fall into a non-compliant state. While there are no out-of-the-box managed rules for AWS Glue, there are many rules provided for the services that Glue would commonly interact with such as, Amazon S3, Amazon RDS, and Amazon Redshift. These rules can be applied to ensure that appropriate data protection policies are in place. An example of this is the “s3-bucket-ssl-requests-only” rule which will check if Amazon S3 buckets have policies that require requests to use SSL (HTTPS). This rule checks if Amazon S3 buckets have policies that require requests to use Secure Socket Layer (SSL). The rule is COMPLIANT if buckets explicitly deny access to HTTP requests. The rule is NON_COMPLIANT if bucket policies allow HTTP requests. Alerts can be delivered via SNS if a resource is determined to be non-compliant. AWS Config also includes automatic remediation capabilities with AWS Config rules. The automatic remediation feature gives you the ability to associate remediation actions with AWS Config rules and the ability to execute them automatically to address non-compliant resources without manual intervention.

A wide array of options are available to monitor usage and detect issues. AWS Glue integrates with AWS CloudTrail to automatically log actions taken by a user, role, or by an AWS service in AWS Glue. CloudTrail captures all API calls for AWS Glue as events. The calls captured include calls from the AWS Glue console and code calls to the AWS Glue API operations.

Following is an example of what a CloudTrail log looks like for a successful DeleteCrawler action:

{
  "eventVersion": "1.05",
  "userIdentity": {
    "type": "IAMUser",
    "principalId": "AKIAIOSFODNN7EXAMPLE",
    "arn": "arn:aws:iam::123456789012:user/johndoe",
    "accountId": "123456789012",
    "accessKeyId": "AKIAIOSFODNN7EXAMPLE",
    "userName": "johndoe"
  },
  "eventTime": "2017-10-11T22:29:49Z",
  "eventSource": "glue.amazonaws.com",
  "eventName": "DeleteCrawler",
  "awsRegion": "us-east-1",
  "sourceIPAddress": "72.21.198.64",
  "userAgent": "aws-cli/1.11.148 Python/3.6.1 Darwin/16.7.0 botocore/1.7.6",
  "requestParameters": {
    "name": "tes-alpha"
  },
  "responseElements": null,
  "requestID": "b16f4050-aed3-11e7-b0b3-75564a46954f",
  "eventID": "e73dd117-cfd1-47d1-9e2f-d1271cad838c",
  "eventType": "AwsApiCall",
  "recipientAccountId": "123456789012"
}

Note that CloudTrail does not log all information regarding the calls. For example, it doesn’t log certain sensitive information, such as the ConnectionProperties used in connection requests, and it logs a null instead of the responses returned by the following APIs:

BatchGetPartition	GetCrawlers	GetJobs	GetTable
CreateScript	GetCrawlerMetrics	GetJobRun	GetTables
GetCatalogImportStatus	GetDatabase	GetJobRuns	GetTableVersions
GetClassifier	GetDatabases	GetMapping	GetTrigger
GetClassifiers	GetDataflowGraph	GetObjects	GetTriggers
GetConnection	GetDevEndpoint	GetPartition	GetUserDefinedFunction
GetConnections	GetDevEndpoints	GetPartitions	GetUserDefinedFunctions
GetCrawler	GetJob	GetPlan

Operational access and security with AWS Glue

AWS customers in the financial services industry may require visibility to any access of their data stored on AWS. You can review third-party auditor reports such as the AWS SOC 2 Type II report, ISO 27001, and others in AWS Artifact.

Using Identity-based policies (IAM policies), you can provide rights to a person or group in their account to create, access, or edit an AWS Glue resource, such as a table in the AWS Glue Data Catalog, by attaching a policy to them.

The following is an example of an identity-based policy that allows specific permissions for AWS Glue activities. Resource value indicates that you are giving these operations access to selected tables and databases in the Asia Pacific (Mumbai) Region using a placeholder for the specific AWS account number.

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Sid": "GetTables",
        "Effect": "Allow",
        "Action": [
          "glue:GetTable",
          "glue:GetTables",
          "glue:GetDatabase",
          "glue:GetDataBases"          
        ],
        "Resource": [
          "arn:aws:glue:ap-south-1:account-id:catalog",      
          "arn:aws:glue:ap-south-1:account-id:database/db",
          "arn:aws:glue:ap-south-1:account-id:table/db/books"
        ]
      }
    ]
  }

It’s important to implement IAM policies that follow the principle of least privilege and enforce separation of duties with appropriate authorization for each interaction with your AWS resources. Least privilege is the practice of granting only the permissions required to complete a task. AWS Glue provides several controls to help you accomplish this.

Attribute based access control

AWS Glue provides tag-based access controls on your AWS Glue connections. This gives you the ability to restrict create, update, get, and delete actions based on the resource tag.

Following is an example of a policy statement that allows the GetConnection action for only resources that are tagged with the “dev-test” value.

{
    "Effect": "Allow",
    "Action": [
      "glue:GetConnection"
      ],
    "Resource": "*",
    "Condition": {
      "ForAnyValue:StringEquals": {
        "aws:ResourceTag/tagKey": "dev-test"
      }
    }
  }

Condition context keys

AWS Glue provides several new IAM condition keys including: “glue:VpcIds”, “glue:SubnetIds” and “glue:SecurityGroupIds”. IAM condition keys enable you to further refine the conditions under which an IAM policy statement applies. You can use the new condition keys in IAM policies when granting permissions to create and update AWS Glue jobs. They can be used to restrict jobs to run only within the designated VPC environment. Note that the VPC setting is not a direct input from the CreateJob request, but rather it is inferred from the job “connections” field which points to an AWS Glue connection.

The following is an example of a policy statement that specifies the condition keys for the CreateJob and UpdateJob actions.

{
    "Effect": "Allow",
    "Action": [
      "glue:CreateJob",
      "glue:UpdateJob"
    ],
    "Resource": [
      "*"
    ],
    "Condition": {
      "ForAnyValue:StringLike": {
        "glue:VpcIds": [
          "vpc-id1234"
        ]
      }
    }
  }

Two additional context keys have also been added for AWS Glue role sessions:

AWS Glue adds the “glue:CredentialIssuingService=glue.amazonaws.com” context key to each role session that AWS Glue makes available to the job execution and Developer endpoint. AWSGlue also adds the “glue:RoleAssumedBy=glue.amazonaws.com” context key to each role session that AWS Glue makes when calling other AWS services on your behalf.

You can specify the condition context keys in your policies, as shown below, and attach it to the role to be used by an AWS Glue job. This will ensure that certain actions are allowed or denied based on whether or not the role session is being used within an AWS Glue job execution environment.

{
    "Effect": "Allow",
    "Action": "s3:GetObject",
    "Resource": "arn:aws:s3:::confidential-bucket/*",
    "Condition": {
        "StringEquals": {
            "glue:CredentialIssuingService": "glue.amazonaws.com"
        }
    }
}

Disabling the AWS Glue service proxy

Glue has added a new special job parameter called “disable-proxy”. This allows you to disable the service proxy, forcing all such calls through your VPC. For AWS Glue Spark jobs with a connection to a customer VPC, AWS Glue uses a local proxy to send traffic through the AWS Glue service VPC for requests to S3 in order to download customer scripts and libraries, as well as for requests to CloudWatch to publish job logs and metrics. This proxy allows jobs to function normally even if the customer VPC does not have a valid route to S3 and CloudWatch. With the “disable-proxy” job parameter you now have the option to modify this behavior. This provides the ability to ensure that all AWS calls originating from your script obey any customized network control policies that you wish to place on the role sessions that are making these calls.

Note: When using this feature, you need to ensure that your VPC has configured a route to Amazon S3, AWS Glue and CloudWatch using a NAT gateway or VPC endpoint. Without a properly configured route to these services, jobs may fail or be unable to publish logs and job metrics to CloudWatch.

The following is an example that creates an AWS Glue job using disable-proxy

aws glue create-job \
    --name no-proxy-job \
    --role GlueDefaultRole \
    --command "Name=glueetl,ScriptLocation=s3://my-bucket/glue-script.py" \
    --connections Connections="traffic-monitored-connection" \
    --default-arguments '{"--disable-proxy" : "true"}'

Using AWS Lake Formation to enforce AWS Glue access controls

Using AWS Lake Formation, customers can build a common data access and governance framework for data in the data lake. AWS Lake Formation is a service built on AWS Glue. AWS Lake Formation can be used to manage AWS Glue crawlers, AWS Glue ETL jobs, the Data Catalog, security settings, and access control. Once the data is securely stored in the data lake, customers can access the data through their choice of analytics services, as shown in the following diagram.

Figure 3: How AWS Lake Formation works

Using an administrative role in Lake Formation, customers can define security policy-based rules for users and applications, and integration with AWS IAM authenticates those users and roles. Once the rules are defined, Lake Formation enforces the access controls. Related use cases include:

When Amazon Athena users select the AWS Glue Data Catalog in the query editor, they can query only the databases, tables, and columns that they have Lake Formation permissions on.
When Amazon Redshift users create an external schema on a database in the AWS Glue Data Catalog, they can query only the tables and columns in that schema on which they have Lake Formation permission.
For AWS Glue console operations (such as viewing a list of tables) and all API operations, AWS Glue users can access only the databases and tables on which they have Lake Formation permission.

Each time an AWS Glue principal (user, group, or role) runs a query on data registered with Lake Formation, Lake Formation verifies that the principal has the appropriate permissions to the database, table, and the underlying Amazon S3 objects. If the principal has access, Lake Formation vends temporary credentials to AWS Glue, and the query runs.

Figure 4: Access flow to AWS Glue when using AWS Lake Formation

Summary

In this post, we reviewed AWS Glue and highlighted key information that can help FSI customers accelerate the approval of the service within these five categories:

Achieving compliance
Data protection
Isolation of compute environments
Automating audits with APIs
Operational access and security

While not a one-size-fits-all approach, this guidance can be adapted to meet your organization’s security and compliance requirements and provide a consolidated list of key areas for AWS Glue.

In the meantime, be sure to visit our AWS Financial Services Industry blog channel and stay tuned for more financial services news and best practices.

AWS for Industries