Guidance for Customizing Normalization Library for AWS Entity Resolution
Overview
How it works
These technical details feature an architecture diagram to illustrate how to effectively use this solution. The architecture diagram shows the key components and their interactions, providing an overview of the architecture's structure and functionality step-by-step.
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
Operational Excellence
Every service in this Guidance publishes metrics to Amazon CloudWatch, through which you can configure dashboards and alarms. Amazon CloudWatch Events deliver a near real-time stream of system events that describe changes in resources. Use alarms or Amazon SNS to notify incident management systems of events and escalate based on severity.
Security
The data at rest in the S3 bucket is encrypted. AWS Glue supports using resource policies to control access to AWS Glue Data Catalog resources. These resources include databases, tables, connections, and user-defined functions, along with the AWS Glue Data Catalog APIs that interact with these resources. You can turn on encryption of objects in the AWS Glue Data Catalog and encrypt connection passwords using AWS Key Management Service (AWS KMS).
Reliability
This Guidance needs to implement throttling and retries. AWS Glue is subject to a Region-specific service quota that may affect reliability. You can contact AWS Support to request a quota increase based on your workload. Additionally, you can use Step Functions to set up retries, backoff rates, max attempts, intervals, and timeouts for any failed AWS Glue job.
Performance Efficiency
You can experiment and test each Guidance component, enabling you to perform comparative testing against varying load levels, configurations, and services. For example, auto scaling is available for AWS Glue extract, transform, load (ETL) jobs. With auto scaling enabled, AWS Glue automatically adds and removes workers from the cluster depending on the parallelism at each stage of the job run.
Cost Optimization
The services in this Guidance are designed to help you optimize costs based on your workload. When Amazon Glue performs data transformations, you pay only for infrastructure during the time the processing is occurring. For the AWS Glue Data Catalog, you pay a monthly fee for storing and accessing the metadata. With Amazon S3, you pay for storing objects in buckets. With EventBridge Free Tier, you can schedule rules to initiate data processing using the Step Functions workflow, and you will be charged based on the number of state transitions.
Sustainability
Serverless services used in this Guidance, such as AWS Glue and Amazon S3, automatically optimize resource utilization in response to demand. You can also use Amazon S3 lifecycle configuration to define policies that move objects to different storage classes based on access pattern. This helps free up storage resources that would otherwise unnecessarily be used to store infrequently accessed objects.
Deploy with confidence
We'll walk you through it
Dive deep into the implementation guide for additional customization options and and service configurations to tailor to your specific needs.
Let's make it happen
Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.
Disclaimer
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages