Redacting sensitive information with user-defined functions in Amazon Athena
Amazon Athena now supports user-defined functions (in Preview), a feature that enables you to write custom scalar functions and invoke them in SQL queries. Although Athena provides built-in functions, UDFs enable you to perform custom processing such as compressing and decompressing data, redacting sensitive data, or applying customized decryption. You can write your UDFs in Java using the Athena Query Federation SDK. When a UDF is used in a SQL query submitted to Athena, it’s invoked and run on AWS Lambda. You can use UDFs in both SELECT and FILTER clauses of a SQL query, and invoke multiple UDFs in the same query. Athena UDF functionality is available in Preview mode in the US East (N. Virginia) Region.
This blog post covers basic functionalities of Athena UDF, writing your own UDF, building and publishing your UDF to AWS Serverless Application Repository, configuring the UDF connector application, and using the UDF in Athena queries. You can implement use cases involving sensitive data, such as personal identifiable information (PII), credit cards, or Social Security numbers (SSN). In this post, we deploy a Redact UDF and use that to mask sensitive information.
Before creating your development environment, you must have the following prerequisites:
- An AWS account that provides access to AWS services
- An AWS Identity and Access Management (IAM) user with an access key and secret key to configure the AWS Command Line Interface (AWS CLI)
- The IAM user has permissions to create an IAM role and policies, and create stacks in AWS CloudFormation
- Git, Maven, and an AWS Cloud9 integrated development environment (IDE) installed
To set up the development environment and address these prerequisites, deploy the CloudFormation template in the first part of this series, Extracting and joining data from multiple data sources with Athena Federated Query. The post provides instructions on building the required test environment and resources using the CloudFormation template.
Creating your IDE
After you deploy the CloudFormation template, you need to create the required AWS resources. To create the development environment to build and deploy the UDF, we use an AWS Cloud9 IDE. On the AWS Cloud9 console, locate your environment and choose Open IDE.
AWS Cloud9 Resize
The AWS Cloud9 IDE comes with a default 10 GB disk space, which can fill quickly when setting up the development environment, so you should resize it.
- Run the following command in the AWS Cloud9 IDE terminal to get the resize script:
- Run the script by issuing the following command on the terminal to resize the disk to 20 GB:
- Check the free space on the disk with the following code:
You should see something like the following screenshot.
Setting up the IDE
Next, you clone the SDK and prepare your IDE.
- Make sure that Git is installed on your system by entering the following code:
- To install the Athena Query Federation SDK, enter the following command at the command line to clone the SDK repository. This repository includes the SDK, examples, and a suite of data source connectors.
If you’re working on a development machine that already has Apache Maven, the AWS CLI, and the AWS Serverless Application Model build tool installed, you can skip this step.
- From the root of the
aws-athena-query-federationdirectory that you created when you cloned the repository, run the
prepare_dev_env.shscript that prepares your development environment:
This script requires manual inputs to run (choosing Enter as needed during the setup steps when prompted). You can edit this script to remove the manual inputs if you want to automate the setup entirely.
- Update your shell to source new variables created by the installation process or restart your terminal session:
- Run the following code from the
athena-federation-sdkdirectory within the GitHub project you checked out earlier:
Adding the UDF code and publishing the connector
In this section, you add your UDF function, build the JAR file, and deploy the connector.
- In the AWS Cloud9 IDE, expand the
aws-athena-query-federationproject and navigate to the
- Choose the file (double-click) to open it for editing.
Now we add the UDF code for a
String Redact function, which redacts a string to show only the last four characters. You can use this UDF function to mask sensitive information.
- Enter the following code:
You can also copy the modified code with the following command (which must be run from the
After copying the file, you can open it in the AWS Cloud9 IDE to see its contents.
- To build the JAR file, save the file and run
mvn clean installto build your project:
After it successfully builds, a JAR file is created in the target folder of your project named
artifactId-version.jar, where artifactId is the name you provided in the Maven project, for example,
- From the
athena-udfsdirectory, run the following code to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET_NAME in the command is the Amazon Simple Storage Service (Amazon S3) location where a copy of the connector's code is stored for the AWS Serverless Application Repository to retrieve it.
This allows users with relevant permission levels to deploy instances of the connector via a one-click form.
When the connector is published successfully, it looks like the following screenshot.
AthenaUserDefinedFunctions, choose the link shown in the terminal after the publish is successful or navigate to the AWS Serverless Application Repository by choosing Available Applications, Private applications.
Setting up the UDF connector
Now that the UDF connector code is published, we can install the UDF connector to use with Athena.
- Choose the
AthenaUserDefinedFunctionsapplication listed on the Private applications section in the AWS Serverless Application Repository.
- For Application name, leave it as the default name
- For SecretNameorPrefix, enter a secret name if you have already saved it in AWS Secrets Manager; otherwise, enter
- For LambdaFunctionName, enter
- Leave the remaining fields as default.
- Select I acknowledge that this app creates custom IAM roles.
- Choose Deploy.
Querying with UDF in Athena
Now that the UDF connector code is deployed, we can run Athena queries that use the UDF.
If you ran the CloudFormation template from Part 1 of this blog series, the
AmazonAthenaPreviewFunctionality workgroup was already created. If not, choose Create Workgroup on the Athena console and create a workgroup named
AmazonAthenaPreviewFunctionality and set up your query result location in Amazon S3.
To proceed, make sure you are in the workgroup
AmazonAthenaPreviewFunctionality. If not, choose the workgroup
AmazonAthenaPreviewFunctionality and choose Switch workgroup.
You can now run a query to use the Redact UDF to mask sensitive information from PII columns. To show the comparison, we have included the PII column and masked data as part of the query results. If you ran the CloudFormation template from Part 1 of this series, you can navigate to the Saved Queries on the Athena console and choose RedactUdfCustomerAddress.
The following screenshot shows your query.
After the query runs, you should see results like the following screenshot. The
redact_name, redact_phone, and
redact_address columns only show the last four characters.
To clean up the resources created as part of your CloudFormation template, complete the following steps:
- On the Amazon S3 console, empty and delete the bucket
- If you’re using the AWS CLI, delete the objects in the athena-federation-workshop-<account-id> bucket with the following code (make sure you’re running this command on the correct bucket):
- Use the AWS CloudFormation console or AWS CLI to delete the stacks
In this post, you learned about Athena user-defined functions, how to create your own UDF, and how to deploy it to a private AWS Serverless Application Repository. You also learned how to configure the UDF and use it in your Athena queries. In the next post of this series, we discuss and demonstrate how to use a machine learning (ML) anomaly detection model developed on Amazon SageMaker and use that model in Athena queries to invoke an ML inference function to detect anomaly values in our orders dataset.
About the Authors
Saurabh Bhutyani is a Senior Big Data specialist solutions architect at Amazon Web Services. He is an early adopter of open source Big Data technologies. At AWS, he works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.
Amir Basirat is a Big Data specialist solutions architect at Amazon Web Services, focused on Amazon EMR, Amazon Athena, AWS Glue and AWS Lake Formation, where he helps customers craft distributed analytics applications on the AWS platform. Prior to his AWS Cloud journey, he worked as a Big Data specialist for different technology companies. He also has a PhD in computer science, where his research was primarily focused on large-scale distributed computing and neural networks.