AWS Compute Blog

Building a serverless tokenization solution to mask sensitive data

This post is courtesy of Anuj Gupta, Senior Solutions Architect, and Steven David, Senior Solutions Architect.

Customers tell us that security and compliance are top priorities regardless of industry or location. Government and industry regulations are regularly updated and companies must move quickly to remain compliant. Organizations must balance the need to generate value from data and to ensure data privacy. There are many situations where it is prudent to obfuscate data to reduce the risk of exposure, while also improving the ability to innovate.

This blog discusses data obfuscation and how it can be used to reduce the risk of unauthorized access. It can also simplify PCI DSS compliance by reducing the number of components for which this compliance may apply.

Comparing tokenization and encryption

There is a difference between encryption and tokenization. Encryption is the process of using an algorithm to transform plaintext into ciphertext. An algorithm and an encryption key are required to decrypt the original plaintext.

Tokenization is the process of transforming a piece of data into a random string of characters called a token. It does not have direct meaningful value in relation to the original data. Tokens serve as a reference to the original data, but cannot be used to derive that data.

Unlike encryption, tokenization does not use a mathematical process to transform the sensitive information into the token. Instead, tokenization uses a database, often called a token vault, which stores the relationship between the sensitive value and the token. The real data in the vault is then secured, often via encryption. The token value can be used in various applications as a substitute for the original data.

For example, for processing a recurring credit card payment, the token is submitted to the vault. The index is used to fetch the original data for use in the authorization process. Recently, tokens are also being used to secure other types of sensitive or personally identifiable information. This includes data like social security numbers (SSNs), telephone numbers, and email addresses.

Overview

In this blog, we show how to design a secure, reliable, scalable, and cost-optimized tokenization solution. It can be integrated with applications to generate tokens, store ciphertext in an encrypted token vault, and exchange tokens for the original text.

In an example use-case, a data analyst needs access to a customer database. The database includes the customer’s name, SSN, credit card, order history, and preferences. Some of the customer information qualifies as sensitive data. To enforce the required information security policy, you must enforce methods such as column level access, role-based control, column level encryption, and protection from unauthorized access.

Providing access to the customer database increases the complexity of managing fine-grained access policies. Tokenization replaces the sensitive data with random unique tokens, which are stored in an application database. This lowers the complexity and the cost of managing access, while helping with data protection.

Walkthrough

This serverless application uses Amazon API Gateway, AWS Lambda, Amazon Cognito, Amazon DynamoDB, and the AWS KMS.

Serverless architecture diagram

The client authenticates with Amazon Cognito and receives an authorization token. This token is used to validate calls to the Customer Order Lambda function. The function calls the tokenization layer, providing sensitive information in the request. This layer includes the logic to generate unique random tokens and store encrypted text in a cipher database.

Lambda calls KMS to obtain an encryption key. It then uses the DynamoDB client-side encryption library to encrypt the original text and store the ciphertext in the cipher database. The Lambda function retrieves the generated token in the response from the tokenization layer. This token is then stored in the application database for future reference.

The KMS makes it easy to create and manage cryptographic keys. It provides logs of all key usage to help you meet regulatory and compliance needs.

One of the most important decisions when using the DynamoDB Encryption Client is selecting a cryptographic materials provider (CMP). The CMP determines how encryption and signing keys are generated, whether new key materials are generated for each item or are reused. It also sets the encryption and signing algorithms that are used. To identify a CMP for your workload, refer to this documentation.

The current solution selects the Direct KMS Provider as the CMP. This cryptographic materials provider returns a unique encryption key and signing key for every table item. To do this, it calls KMS every time you encrypt or decrypt an item.

The KMS process

  • To generate encryption materials, the Direct KMS Provider asks AWS KMS to generate a unique data key for each item using a customer master key (CMK) that you specify. It derives encryption and signing keys for the item from the plaintext copy of the data key, and then returns the encryption and signing keys, along with the encrypted data key, which is stored in the material description attribute of the item.
  • The item encryptor uses the encryption and signing keys and removes them from memory as soon as possible. Only the encrypted copy of the data key from which they were derived is saved in the encrypted item.
  • To generate decryption materials, the Direct KMS Provider asks AWS KMS to decrypt the encrypted data key. Then, it derives verification and signing keys from the plaintext data key, and returns them to the item encryptor.

The item encryptor verifies the item and, if verification succeeds, decrypts the encrypted values. Finally, it removes the keys from memory as soon as possible.

For enhanced security, the example creates the Lambda function inside a VPC with a security group attached to allow incoming HTTPS traffic from only private IPs. The Lambda function connects to DynamoDB and KMS via VPC endpoints instead of going through the public internet. It connects to DynamoDB using a service gateway endpoint and to KMS using an interface endpoint providing a highly available and secure connection.

Additionally, VPC endpoints can use endpoint policies to enforce allowing only permitted operations for KMS and DynamoDB over this connection. To further control the management of encryption keys, the KMS master key has a resource-based policy. It allows the Lambda layer to generate data keys for encryption and decryption, and restrict any administrative activity on master key.

To deploy this solution, follow the instructions in the aws-serverless-tokenization GitHub repo. The AWS Serverless Application Model (AWS SAM) template allows you to quickly deploy this solution into your AWS account.

Understanding the code

The solution uses the tokenizer package, deployed as a Lambda layer. It uses Python UUID4 to generate random values. You can optionally update the logic in hash_gen.py to use your own tokenization technique. For example, you could generate tokens with same length as the original text, preserving the format in the generated token.

The ddb_encrypt_item.py file contains the logic for encrypting DynamoDB items and uses a DynamoDB client-side encryption library. To learn more about how this library works, refer to this documentation.

There are three methods used in the application logic:

  • Encrypt_item encrypts the plaintext using the KMS customer managed key. In AttributeActions actions, you can specify if you don’t want to encrypt a portion of the plaintext. For example, you might exclude keys in the JSON input from being encrypted. It also requires a partition key to index the encrypted text in the DynamoDB table. The hash key is used as the name of the partition key in the DynamoDB table. The value of this partition key is the UUID token generated in the previous step.
def encrypt_item (plaintext_item,table_name):
    table = boto3.resource('dynamodb').Table(table_name)

    aws_kms_cmp = AwsKmsCryptographicMaterialsProvider(key_id=aws_cmk_id)

    actions = AttributeActions(
        default_action=CryptoAction.ENCRYPT_AND_SIGN,
        attribute_actions={'Account_Id': CryptoAction.DO_NOTHING}
    )

    encrypted_table = EncryptedTable(
        table=table,
        materials_provider=aws_kms_cmp,
        attribute_actions=actions
    )
    response = encrypted_table.put_item(Item=plaintext_item)
  • Get_decrypted_item gets the plaintext for a given partition key. For example, the UUID token using the KMS customer managed key.
  • Get_Item gets the obfuscated text, for example the ciphertext stored in the DynamoDB table for the provided partition key.

The dynamodb-encryption-sdk requires cryptography libraries as a dependency. Both of these libraries are platform-dependent and must be installed for a specific operating system. Since Lambda functions use Amazon Linux, you must install these libraries for Amazon Linux even if you are developing application code on different operating system. To do this, use the get_AMI_packages_cryptography.sh script to download the Docker image, install dependencies within the image, and export files to be used by our Lambda layer.

If you are processing DynamoDB items at a high frequency and large scale, you might exceed the AWS KMS requests-per-second limit, causing processing delays. You can use tools such as JMeter to test the required throughput based on the expected traffic for this serverless application. If you need to exceed a quota, you can request a quota increase in Service Quotas. Use the Service Quotas console or the RequestServiceQuotaIncrease operation. For details, see Requesting a quota increase in the Service Quotas User Guide. If Service Quotas for AWS KMS are not available in the AWS Region, create a case in the AWS Support Center.

After following this walkthrough, to avoid incurring future charges, delete the resources following step 7 of the README file.

Conclusion

This post shows how to use AWS Serverless services to design a secure, reliable, and cost-optimized tokenization solution. It can be integrated with applications to protect sensitive information and manage access using strict controls with less operational overhead.