How Amazon built a highly scalable and secure tokenization solution on AWS

Protecting sensitive (personal data such as payment and health data) is one of Amazon’s highest priorities. In support of this goal, Amazon developed Lumos, a highly secure and scalable internal service that provides low-latency APIs to tokenize sensitive data. Lumos is a cloud-native application built on Amazon Web Services (AWS) serverless and security offerings; processing tens of thousands of requests per second and scaling to more than six billion tokens. Lumos is built to conform with the compliance requirements of the Payment Card Industry Data Security Standard (PCI-DSS) and Health Insurance Portability and Accountability Act (HIPAA).

With more than 240 data protection and privacy laws worldwide, it’s becoming more important for organizations to protect customer data at scale. Organizations need to adopt changing data privacy regulations within a short amount of time to remain compliant. Tokenization has been emerging as an effective method to improve data security and reduce audit scope. You can learn more about tokenization and why you should consider tokenization for your needs on How to use tokenization to improve data security and reduce audit scope blog post.

This blog post demonstrates how Lumos utilizes AWS services to deliver a highly scalable, cost-effective, and reliable tokenization solution; along with illustrating the design patterns that keep it highly secure.

Why data tokenization is an effective method for protecting sensitive data

Tokenization is a form of pseudonymization, a de-identification procedure to replace sensitive data, such as personally identifiable information (PII) data, with unique tokens. Unlike data masking, tokenization is reversible by means of de-tokenization but cannot be reverse engineered to get sensitive data. Therefore, it can be used for use cases where reversibility provides value.

Tokenization offers a more secure way to mask data rather than encryption, replacing sensitive information with non-sensitive tokens that can be reversed with authorized access, ensuring additional layer of data security and compliance. Building a centralized tokenization solution streamlines governance and upholds principle of data minimization by processing only essential personal data. This consolidation ensures consistent policy enforcement across applications, enhancing security and lowering breach risks. The scalable, low-latency tokenization solution can support both real-time and batch use cases across multiple applications in an organization.

By using centralized tokenization APIs, you can tokenize sensitive data as soon as it enters your data capture application, limiting the exposure of sensitive data. This helps prevent sensitive data from being persisted in different data stores and sent to multiple systems. In most cases, de-tokenization is needed only for displaying sensitive data, and workloads can persist or send tokens to other workloads without access to the sensitive information. The de-tokenization method can include fine-grained authorization and access logging to provide information for auditing purposes. Tokenization and de-tokenization methods can handle various data types, each with different sensitivity levels, and ensure that only users with the appropriate roles can access sensitive data.

Using tokenization solutions can also result in lower compliance costs and reduce exposure to costly data breaches. Additional cost savings can be achieved by re-purposing a centralized tokenization solution across multiple workloads with streamlined security controls.

What is Lumos?

Amazon’s Lumos is built using AWS services and is designed to tokenize sensitive data such as PII, PCI and HIPAA. Lumos offers API-level integration for a variety of use cases inside Amazon. Amazon retail payments use it for PCI compliance during credit card processing and PII redaction to store customer private data. Amazon India leverages Lumos to store and process credit card information in compliance with both PCI standards and data localization regulations. Consumers of Lumos use a centralized API that is designed to consistently protect sensitive data across multiple Amazon businesses, its various workloads, and multiple technologies.

As of today, Lumos can scale to more than six billion tokens and processes tens of thousands requests per second within double-digit millisecond latency. Since Lumos is available through a centralized API, it can support new requirements of data protection and privacy regulations across different businesses quickly and in a cost-effective manner.

Lumos inherently enforces a wide array of security controls; making it secure by design. It uses AWS services to reduce the overhead of system patching, vulnerability management, audit trails, and monitoring in order to help meet compliance requirements. Lumos employs a comprehensive security solution to counter both internal and external risks. This includes a Zero Trust model, continuous monitoring, and data protection, alongside a multi-layered defense strategy. The security model emphasizes the principle of least privilege, AWS security controls, automation, and multi-person approval. The security solution also establishes a robust data perimeter protection on AWS, by using Amazon Virtual Private Clouds (to break down security perimeters into small zones), VPC endpoints, security groups, and network access control lists (ACLs) to help protect sensitive data and prevent unauthorized access.

Prior to migrating to AWS, the legacy on-premise solution required significant effort for maintaining security patches and was susceptible to hardware and network availability issues. These issues are now addressed through AWS compute and network infrastructure. Additionally, Amazon Elastic Container Service auto-scaling, AWS Cloud Development Kit constructs, and the multi-region replication in Amazon DynamoDB and AWS Key Management Service allow for horizontal scaling and geo expansion respectively.

Lumos’s AWS based architecture enables Amazon to meet critical availability, resiliency, operational requirements. It reduced time to scale while providing flexibility for meeting specific regional requirements and improved compliance management. With AWS design, Lumos achieved increased availability to 99.999% from 99.9%, resulting in approximately nine additional hours of uptime annually. The new design not only enhances operational efficiency but also ensures robust disaster recovery and business continuity; supporting Amazon’s dynamic and global operational requirements.

Exploring Lumos Architecture

Lumos consists of two core components: tokenization and transmission.

Tokenization is the process of converting sensitive data into tokens. For example, when customers enter their credit card numbers during online shopping, Lumos persists the encrypted sensitive data in Amazon DynamoDB and generates a token that is specific to the data. Lumos also offers a de-tokenization API, providing programmatic access exclusively to authorized users, allowing tokens to be reverted to their original data when necessary.

Eliminating human access to the data is a core principle of Lumos’s design. As a best practice, only limited systems like Lumos Transmission are granted access to the de-tokenization API. Lumos uses access control policies, which are resource-based policies to determine who has access to de-tokenized data based on business justification and role. Each action is logged and traces are analyzed to provide alerts about possible malicious activities. This fine-grained control helps to ensure data security while facilitating data processing.

The transmission component focuses on securely transmitting messages that contain sensitive data to authorized external endpoints (e.g. payment partner). This component interfaces with the tokenization service to de-tokenize sensitive data before transmission, without providing visibility of data to the caller. For instance, when customers place an order on Amazon, Lumos de-tokenizes sensitive data and sends the encrypted data to payment partners through secure channels, helping to maintain the confidentiality of sensitive data during transmission.

Lumos prioritizes resource isolation by design, and the isolation is consistently maintained throughout data operations. Lumos creates isolation boundaries between resources of data owners (tenants), and actors who are interacting with the data on behalf of tenants (clients). Lumos uses AWS multi-account architecture not only to enhance security, but to provide additional benefits such as scaling resources while adhering to AWS service quotas. Lumos inherits the security provided by AWS serverless services. The security value of serverless on AWS includes the ability to offload a wide range of security responsibilities to AWS, such as ensuring the underlying infrastructure is secure from attacks, which allows Lumos application team to concentrate more on securing their application code and data.

Lumos uses AWS Shield Advanced and AWS WAF along with Amazon API Gateway to provide a layered defense. Lumos uses a custom hardened container image in AWS Fargate, hosted within a VPC. AWS Fargate provides a managed serverless environment without the overhead of maintaining resilience of compute and security of operating system. Amazon DynamoDB global tables are used for persisting the encrypted data across multiple AWS regions and meeting the data residency requirements. Based on requirements defined by the Amazon security team, Lumos uses AWS Key Management Service (AWS KMS), AWS Certificate Manager, and AWS Secrets Manager for key management, and uses AWS Payment Cryptography (with DUKPT, Derived Unique Key Per Transaction) for payment system integration.

Figure 1 shows the AWS architecture for Lumos.

Figure 1: Lumos AWS architecture

Lumos consists of a number of AWS services to process and protect sensitive data, keeping the data within a secure boundary. As shown in Figure 1, the workflow is as follows:

The client application makes a call to Lumos through an API. Lumos APIs are protected by a layered defense mechanism using AWS Shield, AWS WAF, Amazon API Gateway and other network security controls.
AWS Shield provides protection against Distributed Denial of Service (DDoS) attacks for AWS resources at the network (layer 3), transport layer (layer 4) and the application layer (layer 7).
AWS WAF filters unwanted and malicious traffic based on a number of WAF rules designed to protect against application layer threats and allow legitimate requests.
API Gateway performs request validation, authorization and protects using throttling and burst limit controls.
The requests are then directed to a private AWS Network Load Balancer within a Amazon VPC and forwarded to an application running on AWS Fargate. The application makes downstream calls to several AWS service to facilitate tokenization, detokenization and other service operations.
For the tokenization operation, only the token and non-sensitive data are returned to the client application.
For a detokenization request, only the trusted Transmission service is authorized to perform the operation. The Transmission client provides a token as input and receives encrypted data to communicate with partners.

Resilience and Observability

For high availability, AWS Fargate is deployed across multiple Availability Zones and scales up and down to handle variations in traffic patterns in a cost-effective manner. Lumos utilizes AWS Fargate for primary compute due to its granular control over the environment, allowing customization of container configurations, network settings, and resource allocation. This setup enables the use of secure Docker images (via Amazon Elastic Container Registry), the implementation of a custom intrusion detection system (IDS), and restricted access to the main application process. Additionally, AWS Fargate supports long-running processes, facilitating batch processing for Lumos Tokenization.

The application logs all actions and streams to Amazon OpenSearch Service by using Amazon Kinesis, making every API action or any change available for detecting malicious activity. As of the publication of this post, Lumos tokenization hosts more than 400 detections (including many from AWS Shield and Amazon GuardDuty) to provide defense in depth against malicious activity. It has automated detections built through analysis and observations performed on Amazon CloudWatch, Amazon Kinesis, and Amazon OpenSearch Service.

Ease of region expansion

Lumos was initially launched in a single AWS region and rapidly expanded to two additional regions within few months; showcasing the region flexibility inherent in AWS’s multi-regional backbone. Leveraging services like Amazon Global DynamoDB tables and AWS Multi-Region KMS keys, Lumos seamlessly extended its operations across regions. Additionally, AWS CloudFormation and Cloud Development Kit streamlined infrastructure management, enabling Lumos to seamlessly replicate its environment across regions with minimal effort.
This AWS architecture provided Amazon with significant cost savings across multiple areas, including security, infrastructure management, scaling, database management, and data recovery. By leveraging AWS services, the architecture freed up the bandwidth of development and database administration teams, allowing them to be redirected toward addressing new business needs and driving innovation. As a result, Amazon was able to reallocate resources more efficiently, fostering greater agility and responsiveness to evolving business requirements.

Amazon leverages Lumos as a centralized tokenization platform across multiple business verticals, thanks to its robust multi-tenant architecture. Lumos’s design allows it to horizontally scale, dynamically provisioning storage and secret management resources for each business vertical or client. This ensures effective data isolation and uniform security standards across all features, regardless of the scale or specific needs of the vertical. The ability to quickly launch new business initiatives with consistent security protocols also accelerates time-to-market. Additionally, Lumos reduces redundant efforts, leading to significant cost savings and operational efficiency.

Security patterns

Let’s look at some of the security design patterns in Lumos that help it to process sensitive data securely.

Allow only programmatic access to production resources
To reduce the risk of unauthorized access, Lumos allows only programmatic access to production resources. For troubleshooting purposes, Lumos provides detailed logs that are accessible to people for debugging, and tools are in place to detect sensitive data in logs. In case of emergency (as outlined in our standard operating procedures), Lumos offers a break-glass mechanism, enabling human workers to access limited resources in emergency situations, while maintaining least privilege access. The permission boundary follows these policies:

All actions are disallowed by default
All permissions are strictly scoped and have time constraint
Changing the permissions workflow is configurable and requires approval from multiple parties
High-privilege operator access goes through multiple layers of defense for enhanced protection.

The Lumos permission boundary is configured with attribute-based access control and matching AWS tags with user role properties to allow access to AWS resources. At the organization level in AWS Organizations, Lumos has service control policies (SCPs) to define a boundary for each AWS Identity and Access Management (IAM) role to allow only specific services and specific actions.

Isolation Boundaries
In order to reduce the impact of privilege escalation, Lumos uses a technique called compartmentalization that isolates different components through the isolation of trust boundaries. To fine-tune permissions, Lumos has team-based and role-based access control. The layered approach starts at the organization level to deny all by default, then creates permission boundaries for each team. The Lumos team continuously baselines permission based on usage, until absolute least privilege is reached. The layered approach allows us to define guardrails and detective controls for each layer, and to make changes independently to further improve security posture. If privilege escalation happens within one of the layers, the impact is limited to the permissions within that layer.

Security Governance
Security governance is a continuous process, because the threat landscape changes as systems and applications evolve. Lumos has mechanisms in place to continuously look for new threats and implement mitigation steps. Security risks can arise from outdated assumptions, so Lumos uses different frameworks, such as MITRE ATT&CK, attack trees and STRIDE, to do threat modeling and identify threat vectors and corresponding security controls. In order to verify that security controls work as expected every time, Lumos performs regular automated tests to check the state of the security controls. The tests run in our production system capturing the results in a dashboard and generating alarms in the case of failures. These tests produce logs that are analyzed and reported in AWS Security Hub.

Lumos uses Security Hub as a consolidated security finding dashboard, and we have built automation to remediate findings. The Lumos team conducts game day simulations to identify new vulnerabilities and threats, enhancing our incident response playbooks.

Lumos has a number of detective controls that use AWS Config, Amazon Macie, AWS Trusted Advisor, and Amazon GuardDuty to check for deviations from the desired configuration, the presence of sensitive data in undesired locations, and unwanted network access. Lumos has security measures implemented to reduce the risk of data exposure, such as enforcing that VPCs with critical components don’t have an internet gateway attached, and that external IAM principals can’t access resources within an AWS account. In addition, Lumos has mitigations for known risks, such as denying public Amazon Simple Storage Service (Amazon S3) access for the entire organization by using SCPs.

Data perimeter

To make sure that trusted entities access only trusted resources over an expected network, Lumos uses IAM condition keys. This helps to ensure that approved actions can only be performed by trusted entities on allowed resources. IAM Access Analyzer and Network Access Analyzer provide insights that are used to set fine-grained access permissions, defining what’s required to perform intended actions on a resource. Lumos defines the data perimeter using multiple policy sets, including: a) an IAM policy with condition keys, b) a resource policy with condition keys, c) VPC endpoint policies, among others. This helps ensure that access validation happens at each layer within the data perimeter.

To achieve network isolation, Lumos runs within a VPC with no internet or NAT gateway attached to it. All connections to AWS services are established using Amazon VPC endpoints. The Amazon VPC endpoint policies control the access to resources within the VPC based on condition keys. AWS CloudTrail logs provide insights into what actions happened, who took those actions, and when. Lumos uses this information to enforce detective controls and alert on anomalies. Lumos also uses Amazon Macie to detect sensitive data in logs and take appropriate action based on findings.

Conclusion

With the help of AWS services, Amazon Lumos team built a large-scale cost effective and highly secure tokenization solution. The centralized data protection provides a powerful tool to protect sensitive information and minimize compliance scope requirements.

To start creating your own AWS serverless solution, we recommend that you read the AWS blog post Building a serverless tokenization solution to mask sensitive data.

To learn more about Lumos, watch the AWS re:Invent 2021 – Lumos video.

Select your cookie preferences

AWS for Industries

How Amazon built a highly scalable and secure tokenization solution on AWS

Why data tokenization is an effective method for protecting sensitive data

What is Lumos?

Exploring Lumos Architecture

Resilience and Observability

Ease of region expansion

Security patterns

Conclusion

Resources

Follow

Learn

Resources

Developers

Help