AWS for Industries

Introducing a New Blog Series, Service Spotlight for Financial Services: Featuring AWS Glue DataBrew

Intro to service

Over the last 20 years, the users for data analytics have grown as a result of a surge in the variety and volume of data that companies can collect. To prepare data for analytics or machine learning projects, data scientists and analysts have to either use spreadsheets to explore and experiment with data, or rely on data engineers and ETL developers to transform data into the required format. This often delays the process for weeks or months, as customers can spend up to 80% of their time in data preparation tasks rather than actually analyzing the data.

AWS Glue DataBrew makes it easy for data scientists and data analysts to clean and normalize data using a visual interface, reducing the time it takes to prepare data by up to 80%. With Glue DataBrew, you can visualize, clean, and normalize data directly from your data lake, data warehouses, and databases. You can choose from over 250 built-in transformations to automate data cleaning and normalization tasks, and save these transformation steps so they’re applied to new data as it comes in. You can evaluate the quality of your data by profiling it to understand data patterns and detect anomalies, all without writing a single line of code. As part of AWS Glue, Glue DataBrew is serverless so you don’t need to manage the infrastructure. You pay only for what you use, with no upfront commitment.

Financial institutions can leverage DataBrew to mask and redact sensitive data, assess data quality rules, and filter anomalies to help improve downstream accuracy.

Achieving compliance with AWS Glue DataBrew

AWS Glue DataBrew is a managed service. Third-party auditors regularly assess the security and compliance of it as part of multiple AWS compliance programs. As part of the AWS shared responsibility model, AWS Glue DataBrew service is in the scope of the following compliance programs. You can obtain corresponding compliance reports under an AWS non-disclosure agreement (NDA) through AWS Artifact.

  • DoD CC SRG (IL2 – East/West)
  • FedRAMP (Moderate – East/West)
  • FINMA
  • HIPAA
  • HITRUST CSF
  • ISO/IEC 27001:2013, 27017:2015, 27018:2019, 27701:2019, ISO/IEC 9001:2015 and CSA STAR CCM v3.0.1
  • OSPAR
  • PCI
  • SOC 1,2,3
  • GSMA (US-East Ohio)
  • PituKri
  • MTCS (Regions: US-East, US-West, Singapore, Seoul)

Your scope of the shared responsibility model when using AWS Glue DataBrew is determined by the sensitivity of your data, your organization’s compliance objectives, and applicable laws and regulations. AWS provides several resources for compliance validation.

Data protection with AWS Glue DataBrew

Data protection is the process of preventing critical information from being corrupted, compromised, or lost. Encryption is a recommended practice for ensuring the confidentiality and integrity of the data being processed, both in transit and at rest.

At-Rest Encryption:
Encrypting the DataBrew project and jobs: Projects and jobs can read encrypted data, and jobs can write encrypted data using AWS Key Management Service (AWS KMS). You can use KMS keys to encrypt the job logs that are generated by DataBrew jobs. You can specify encryption keys using the DataBrew console or the DataBrew API.

Setting up encryption when creating jobs: When encryption is enabled , DataBrew encrypts the job output file using SSE-S3 or AWS KMS based on the selection.

setting up encryption when creating jobs

When you enable encryption, it applies to both Amazon S3 and CloudWatch. The IAM role that is passed must have the following AWS KMS permissions. Following is an example of a policy that provides access to a AWS KMS key in account-1.

  • JSON
  • {
  •     “Version”: “2012-10-17”,
  •     “Statement”: [
  •         {
  •             “Effect”: “Allow”,
  •             “Action”: [
  •                 “kms:GenerateDataKey”,
  •                 “kms:GenerateDataKeyPair”
  •             ],
  •             “Resource”: [
  •                 “arn:aws:kms:us-east-1:account-1-id:key/5f1ffef6-3318-4df7-8354-71a16eb58ed7”
  •             ],
  •          }
  •     ]
  • }
  • Ensure that AWS KMS Key is set to “Enabled” before it is used.

In Transit Encryption:
For data in transit , AWS offers Secure Sockets Layer (SSL) encryption. DataBrew support for JDBC data sources comes through AWS Glue. When connecting to JDBC data sources, DataBrew uses the settings on your AWS Glue connection, including the Require SSL connection option.

Notes:

  • AWS Glue DataBrew supports symmetric AWS KMS keys.
  • Encryption of data in staging area: When you use an Amazon Redshift dataset, objects unloaded to the provided temporary directory are encrypted with SSE-S3.

Identifying and accessing personally identifiable information (PII)

When you build analytic functions or machine learning models, you need safeguards to prevent exposure of personally identifiable information (PII) data. PII is personal data that can be used to identify an individual, such as an address, bank account number, or phone number.

DataBrew provides data masking mechanisms to obfuscate PII data during data preparation process. Identifying and masking PII data in DataBrew involves building a set of transforms that customers can use to redact PII data. Part of this process is providing PII data detection and statistics in the Data Profile overview dashboard on the DataBrew console.

dataset level configurations

You can use the following data-masking techniques:

  • Substitution – Replace PII data with other authentic-looking values.
  • Shuffling – Shuffle the value from the same column in different rows.
  • Deterministic encryption – Apply deterministic encryption algorithms to the column values. Deterministic encryption always produces the same ciphertext for a value.
  • Probabilistic encryption – Apply probabilistic encryption algorithms to the column values. Probabilistic encryption produces different ciphertext each time that it’s applied.
  • Decryption – Decrypt columns based on encryption keys.
  • Nulling out or deletion – Replace a particular field with a null value or delete the column.
  • Masking out – Use character scrambling or mask certain portions in the columns.
  • Hashing – Apply hash functions to the column values.

Isolation of compute environments

AWS Glue DataBrew is a managed service that doesn’t have any compute resources within the customer’s portion of the shared responsibility model. As a managed service, DataBrew is protected by the AWS global network security procedures that are described in the AWS Architecture Center: Security, Identity, & Compliance.

DataBrew is a fully managed service that runs in a secure AWS data center and does not require customers to maintain physical or virtual machines. Environments for a job or project are isolated from any other environments. There is no sharing of compute resources across accounts, nor within the same customer account between jobs or projects. DataBrew accesses your data using provided permissions only during projects or job execution and does not persist your data anywhere. In addition, DataBrew does not support cross-region data processing.

You can establish a private connection between your VPC and DataBrew by creating an interface VPC endpoint. Interface endpoints are powered by AWS PrivateLink, a technology that enables you to privately access DataBrew APIs without an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. A DataBrew VPC endpoint is not required to use DataBrew with your VPC. For more information, see Using AWS Glue DataBrew with your VPC.

The use of interface VPC endpoints also ensures that traffic between your VPC and DataBrew does not leave the Amazon network. To use AWS Glue DataBrew with an interface VPC endpoint in a private VPC subnet without NAT, you must have a gateway VPC endpoint to Amazon S3 and a VPC endpoint for the AWS Glue interface (see diagram below).

aws glue databrewFigure 1: Using AWS Glue DataBrew with VPC endpoints

With IAM identity-based policies, you can specify allowed or denied actions and resources, and also the conditions under which actions are allowed or denied. DataBrew supports specific actions, resources, and condition keys. Following is an example of the policy on how you can allow access to view the DataBrew project , however the permission is granted only if the project tag Owner has the value of that user’s user name. This policy also grants the permissions necessary to complete this action on the console.

  • JSON
  • {
  •     “Version”: “2012-10-17”,
  •     “Statement”: [
  •         {
  •             “Sid”: “ListResourcesInConsole”,
  •             “Effect”: “Allow”,
  •             “Action”: “databrew:ListProjects”,
  •             “Resource”: “*”
  •         },
  •         {
  •             “Sid”: “ViewJobRunsIfOwner”,
  •             “Effect”: “Allow”,
  •             “Action”: “databrew:ListProjects”,
  •             “Resource”: “arn:aws:databrew:*:*:project/*”,
  •             “Condition”: {
  •                 “StringEquals”: {“databrew:ResourceTag/Owner”: “${aws:username}”}
  •             }
  •         }
  •     ]
  • }

Automating audits with APIs with AWS Glue DataBrew

AWS Glue DataBrew API calls are recorded as databrew:Action NOT glue:Action when compared to for AWS Glue API calls.

AWS Config monitors the configuration of resources and provides some out-of-the-box rules to alert when resources fall into a non-compliant state. While there are no out-of-the-box managed rules for AWS Glue DataBrew, there are many rules provided for the services that Glue DataBrew would commonly interact with such as, Amazon S3, AWS KMS, Amazon RDS, Amazon Redshift, Amazon AppFlow, Snowflake, other JDBC connections, etc.

A wide array of options are available to monitor usage and detect issues. AWS Glue DataBrew integrates with AWS CloudTrail to automatically log actions taken by a user, role, or by an AWS service in AWS Glue DataBrew. CloudTrail captures all API calls for AWS Glue DataBrew as events. The calls captured include calls from the AWS Glue DataBrew console and code calls to the AWS Glue DataBrew API operations.

Following is an example of what a CloudTrail log looks like for a successful CreateProject action:

    • {
    •     “eventVersion”: “1.08”,
    •     “userIdentity”: {
    •         “type”: “AssumedRole”,
    •         “principalId”: “AIDACKCEVSQ6C2EXAMPLE:joe-example”,
    •         “arn”: “arn:aws:sts:: 1234567890:assumed-role/myrole/joe-example “,
    •         “accountId”: “1234567890”,
    •         “accessKeyId”: ” AKIAIOSFODNN7EXAMPLE”,
    •         “sessionContext”: {
    •             “sessionIssuer”: {
    •                 “type”: “Role”,
    •                 “principalId”: ” AIDACKCEVSQ6C2EXAMPLE “,
    •                 “arn”: “arn:aws:iam:: 1234567890:role/myrole”,
    •                 “accountId”: “1234567890”,
    •                 “userName”: “joe”
    •             },
  •             “webIdFederationData”: {},
  •             “attributes”: {
  •                 “creationDate”: “2022-02-14T21:54:42Z”,
  •                 “mfaAuthenticated”: “false”
  •             }
  •         }
  •     },
  •     “eventTime”: “2022-02-14T21:57:36Z”,
  •     “eventSource”: “databrew.amazonaws.com”,
  •     “eventName”: “CreateProject”,
  •     “awsRegion”: “ap-northeast-1”,
  •     “sourceIPAddress”: “192.0.2.0”,
  •     “userAgent”: “example/user/agent”,
  •     “requestParameters”: {
  •         “RecipeName”: “transform-recipe”,
  •         “DatasetName”: “poc-dataset”,
  •         “Sample”: {
  •             “Size”: 1000,
  •             “Type”: “FIRST_N”
  •         },
  •         “RoleArn”: “arn:aws:iam:: 1234567890:role/service-role/AWSGlueDataBrewServiceRole-poc”,
  •         “Name”: “poc-transform”
  •     },
  •     “responseElements”: {
  •         “Name”: “repro-transform”
  •     },
  •     “requestID”: “6090e6e9-2c76-409a-90dd-984b85822c80”,
  •     “eventID”: “d48810c2-0379-4986-8aa3-27f26a702f04”,
  •     “readOnly”: false,
  •     “eventType”: “AwsApiCall”,
  •     “managementEvent”: true,
  •     “recipientAccountId”: “1234567890”,
  •     “eventCategory”: “Management”
  • }

Operational access and security

AWS customers in the financial services industry may require visibility to any access of their data stored on AWS. You can review third-party auditor reports such as the AWS SOC 2 Type II report, ISO 27001, and others in AWS Artifact.

AWS managed policies specific to DataBrew are provided. You can attach AWSDataBrewConsoleAccess policy to users in the account to use the DataBrew console to restrict administrative access.This provides full access to AWS Glue DataBrew via the AWS Management Console including select access to related services (e.g., S3, KMS, Glue). This can be used with Multi-Factor Authentication(MFA) to ensure enhanced security.

AWS Glue DataBrew (service prefix: databrew) provides service-specific resources, actions, and condition context keys for use in IAM permission policies.

Using Identity-based policies (IAM policies), you can provide permissions to a user or role principal in their account to create, access, or edit an AWS Glue DataBrew resource, such as Project, Dataset, Ruleset, Recipe, Job, Schedule, etc. by attaching a policy to those principals. AWS Glue DataBrew does not support resource-based policies.  AWS Glue DataBrew defines aws:RequestTag/${TagKey}, aws:ResourceTag/${TagKey}, and aws:TagKeys as condition keys that can be used in Condition element of IAM policy to implement attribute based access control (ABAC). More information on IAM mechanism can be found in Developer Guide documentation.

As a best practice of granting least privilege to users , Actions with “Write” level access such as PublishRecipe, StartJobRun, StopJobRun, StartProjectSession, SendProjectSessionAction, UntagResource, CreateDataSet, DeleteDataSet, UpdateDataSet, etc. should be restricted via IAM, and logged and monitored via CloudWatch and CloudTrail logs.

The details on actions defined by AWS Glue DataBrew can is found in Service Authorization Reference documentation. The details on setting up IAM policies for AWS Glue DataBrew can be found in Developer Guide documentation.

Given below is a sample of IAM policy:

  • {
  •     “Version”: “2012-10-17”,
  •     “Statement”: [
  •         {
  •             “Sid”: “ListResourcesInConsole”,
  •             “Effect”: “Allow”,
  •             “Action”: “databrew:ListProjects”,
  •             “Resource”: “*”
  •         },
  •         {
  •             “Sid”: “ViewJobRunsIfOwner”,
  •             “Effect”: “Allow”,
  •             “Action”: “databrew:ListProjects”,
  •             “Resource”: “arn:aws:databrew:*:*:project/*”,
  •             “Condition”: {
  •                 “StringEquals”: {“databrew:ResourceTag/Owner”: “${aws:username}”}
  •             }
  •         }
  •     ]
  • }

Conclusion

In this post, we reviewed AWS Glue Databrew and highlighted key information that can help FSI customers accelerate the approval of the service within these five categories: achieving compliance, data protection, isolation of compute environments, automating audits with APIs, and operational access and security. While not a one-size-fits-all approach, the guidance can be adapted to meet your organization’s security and compliance requirements and provide a consolidated list of key areas for Databrew.

Bala KP

Bala KP

Bala KP is a Sr Partner Solutions Architect at Amazon Web Services. He helps global system integrator partners and customers in the financial services and insurance domain to move their most sensitive workloads to AWS.

Nripendra Shrestha

Nripendra Shrestha

Nripendra Shrestha is a Sr. Solutions Architect at Amazon Web Services. He works closely with multiple FSI customers in Japan to enable them run critical workloads in AWS while adhering to regulatory requirements.

Cheng Xu

Cheng Xu

Cheng Xu is a principal solution architect with Amazon Web Services, having 20+ years of financial industry experience. He works with global system integrators on strategic solution development for banking and insurance customers following AWS best practices. When not working, he enjoys street food, RV travel and home improvement.