- AWS Solutions Library›
- Guidance for Deduplicating Syndicated Data on AWS
Guidance for Deduplicating Syndicated Data on AWS
Overview
How it works
This architecture diagram shows how to obtain an aggregated view of similar tables across multiple AWS accounts within an AWS Organization.
Deploy with confidence
Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs.
Well-Architected Pillars
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
This Guidance is designed to be fully serverless, reducing the operational overhead and complexity associated with maintaining infrastructure. In addition, the use of Lambda, OpenSearch, and other managed services allows the system to scale automatically without the need for manual intervention. Furthermore, this Guidance outlines a systematic approach to handling data updates and changes, with a systemic user enrichment building block facilitating automated, scheduled, and event-driven updates.
This Guidance restricts access to the data stored in OpenSearch by granting permissions only to allowlisted users, roles, or principals. Furthermore, this Guidance defines the access control and authorization mechanisms for both administrative and remote users. This includes specifying the appropriate permissions and privileges required for different user roles to access and interact with the various components of the system. For example, administrative users may be granted full control over the configuration and management of the Guidance, while remote users could be limited to read-only access or specific data querying capabilities.
The serverless architecture of this Guidance, with its inherent capability to automatically scale resources and self-heal, promotes the overall reliability of the system. Additionally, the use of Amazon SQS to manage data updates and changes helps ensure message durability and delivery. Furthermore, this Guidance provides the ability to incrementally add new AWS accounts and Regions, which supports the scalability and fault tolerance of the overall system.
The use of OpenSearch, a fully managed service, as well as the adoption of vector databases, helps ensure efficient query performance and data retrieval capabilities within this Guidance. Furthermore, this Guidance uses K-Means clustering to group similar data tables, which can enhance the performance of similarity searches.
The serverless architecture of this Guidance, combined with the use of managed services such as Lambda and Amazon SageMaker, helps optimize resource utilization and reduce the need for manual performance tuning.
The serverless architecture of this Guidance, with its pay-as-you-go pricing model, can help reduce the overall cost of running the system, as resources are only consumed when needed. Additionally, the use of managed services, such as OpenSearch and SageMaker, can help organizations avoid the overhead associated with managing and maintaining the underlying infrastructure.
Through right-sized, transient resources that avoid excess idling, this Guidance minimizes energy consumption and hardware waste. For example, rather than pre-provisioning servers that continually run even when unutilized, Lambda functions are invoked on-demand only when needed. Each function is individually configured with the optimal amount of memory and CPU capacity required to complete its designated task, avoiding over-provisioning of resources. By dynamically allocating just the right compute power when workloads arrive and terminating those resources after use, Lambda eliminates resource waste from idle servers.
Disclaimer
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages