This Guidance demonstrates how to customize and apply normalization rules on data as it arrives and prepare it for AWS Entity Resolution. Normalization standardizes the input data through tasks such as removing extra spaces and special characters or standardizing to lowercase. This Guidance provides a cloud development kit (CDK) code that demonstrates how to read the data input from an Amazon Simple Storage Service (Amazon S3) bucket, apply the normalization rules, and prepare the resultant dataset for use in AWS Entity Resolution.

Please note: [Disclaimer]

Architecture Diagram


Download the architecture diagram PDF 

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

  • Every service in this Guidance publishes metrics to Amazon CloudWatch, through which you can configure dashboards and alarms. Amazon CloudWatch Events deliver a near real-time stream of system events that describe changes in resources. Use alarms or Amazon SNS to notify incident management systems of events and escalate based on severity.

    Read the Operational Excellence whitepaper 
  • The data at rest in the S3 bucket is encrypted. AWS Glue supports using resource policies to control access to AWS Glue Data Catalog resources. These resources include databases, tables, connections, and user-defined functions, along with the AWS Glue Data Catalog APIs that interact with these resources. You can turn on encryption of objects in the AWS Glue Data Catalog and encrypt connection passwords using AWS Key Management Service (AWS KMS).

    Read the Security whitepaper 
  • This Guidance needs to implement throttling and retries. AWS Glue is subject to a Region-specific service quota that may affect reliability. You can contact AWS Support to request a quota increase based on your workload. Additionally, you can use Step Functions to set up retries, backoff rates, max attempts, intervals, and timeouts for any failed AWS Glue job.

    Read the Reliability whitepaper 
  • You can experiment and test each Guidance component, enabling you to perform comparative testing against varying load levels, configurations, and services. For example, auto scaling is available for AWS Glue extract, transform, load (ETL) jobs. With auto scaling enabled, AWS Glue automatically adds and removes workers from the cluster depending on the parallelism at each stage of the job run.

    Read the Performance Efficiency whitepaper 
  • The services in this Guidance are designed to help you optimize costs based on your workload. When Amazon Glue performs data transformations, you pay only for infrastructure during the time the processing is occurring. For the AWS Glue Data Catalog, you pay a monthly fee for storing and accessing the metadata. With Amazon S3, you pay for storing objects in buckets. With EventBridge Free Tier, you can schedule rules to initiate data processing using the Step Functions workflow, and you will be charged based on the number of state transitions. 

    Read the Cost Optimization whitepaper 
  • Serverless services used in this Guidance, such as AWS Glue and Amazon S3, automatically optimize resource utilization in response to demand. You can also use Amazon S3 lifecycle configuration to define policies that move objects to different storage classes based on access pattern. This helps free up storage resources that would otherwise unnecessarily be used to store infrequently accessed objects. 

    Read the Sustainability whitepaper 

Implementation Resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

[Content Type]


This [blog post/e-book/Guidance/sample code] demonstrates how [insert short description].


The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Was this page helpful?