Guidance for Deduplicating Syndicated Data on AWS

Go to sample code

Overview

This Guidance shows how large enterprise customers can efficiently identify and manage duplicate datasets distributed across multiple AWS accounts. It helps these users to search and locate identical or highly similar data tables, allowing for the identification of redundant data assets. This enables procurement teams to easily access a comprehensive, searchable data inventory, thereby avoiding the unnecessary purchase of the same datasets multiple times. Through these capabilities, this Guidance helps organizations optimize their data management practices and drive cost savings through the elimination of data duplication.

How it works

This architecture diagram shows how to obtain an aggregated view of similar tables across multiple AWS accounts within an AWS Organization.

Download the architecture diagram

Deploy with confidence

Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.

Go to sample code

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

This Guidance is designed to be fully serverless, reducing the operational overhead and complexity associated with maintaining infrastructure. In addition, the use of Lambda, OpenSearch, and other managed services allows the system to scale automatically without the need for manual intervention. Furthermore, this Guidance outlines a systematic approach to handling data updates and changes, with a systemic user enrichment building block facilitating automated, scheduled, and event-driven updates.

Read the Operational Excellence whitepaper

This Guidance restricts access to the data stored in OpenSearch by granting permissions only to allowlisted users, roles, or principals. Furthermore, this Guidance defines the access control and authorization mechanisms for both administrative and remote users. This includes specifying the appropriate permissions and privileges required for different user roles to access and interact with the various components of the system. For example, administrative users may be granted full control over the configuration and management of the Guidance, while remote users could be limited to read-only access or specific data querying capabilities.

Read the Security whitepaper

The serverless architecture of this Guidance, with its inherent capability to automatically scale resources and self-heal, promotes the overall reliability of the system. Additionally, the use of Amazon SQS to manage data updates and changes helps ensure message durability and delivery. Furthermore, this Guidance provides the ability to incrementally add new AWS accounts and Regions, which supports the scalability and fault tolerance of the overall system.

Read the Reliability whitepaper

The use of OpenSearch, a fully managed service, as well as the adoption of vector databases, helps ensure efficient query performance and data retrieval capabilities within this Guidance. Furthermore, this Guidance uses K-Means clustering to group similar data tables, which can enhance the performance of similarity searches.

The serverless architecture of this Guidance, combined with the use of managed services such as Lambda and Amazon SageMaker, helps optimize resource utilization and reduce the need for manual performance tuning.

Read the Performance Efficiency whitepaper

The serverless architecture of this Guidance, with its pay-as-you-go pricing model, can help reduce the overall cost of running the system, as resources are only consumed when needed. Additionally, the use of managed services, such as OpenSearch and SageMaker, can help organizations avoid the overhead associated with managing and maintaining the underlying infrastructure.

Read the Cost Optimization whitepaper

Through right-sized, transient resources that avoid excess idling, this Guidance minimizes energy consumption and hardware waste. For example, rather than pre-provisioning servers that continually run even when unutilized, Lambda functions are invoked on-demand only when needed. Each function is individually configured with the optimal amount of memory and CPU capacity required to complete its designated task, avoiding over-provisioning of resources. By dynamically allocating just the right compute power when workloads arrive and terminating those resources after use, Lambda eliminates resource waste from idle servers.

Read the Sustainability whitepaper

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Guidance for Deduplicating Syndicated Data on AWS

Overview

How it works

Deploy with confidence

Well-Architected Pillars

Disclaimer

Did you find what you were looking for today?

Learn

Resources

Developers

Help