Guidance for Customer Data Platform on AWS

This Guidance shows how you can build a well-architected customer data platform with data from a broad range of sources, including contact centers, email, web and mobile entries, point of sale (POS) transactions, and customer relationship management (CRM) systems. It explores each stage of building the platform, starting with the extraction of batched and real-time data streams. Next, this Guidance shows how to cleanse, enrich, and process the data to create a unified customer record across all data sources. Finally, the processed data is ready for analysis and collaboration, all in a restricted, secure environment where you set the controls. The data can be used to build more personalized customer experiences and to enhance the monetization of your marketing campaigns.

Architecture Diagram

[text]

Download the architecture diagram PDF

Guidance Architecture Diagram for Customer Data Platform on AWS

Step 1
Data sources for building a customer 360 profile include website and mobile application events, advertising events, social media events, and transactional data from multiple system of records and third-party data sets. This data is available for consumption in multiple formats and protocols. For example, software as service (SaaS) applications, batch files, cloud data shares, databases, and data market places.

Step 2
Near real-time data ingestion is achieved through Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon API Gateway. Batch data ingestion uses AWS Transfer Family, AWS Database Migration Service (AWS DMS), and Amazon AppFlow. Amazon AppFlow Custom Connector Software Development Kit (SDK) is used to build custom connectors to pull data from system of record API’s. AWS Data Exchange subscriptions provide access to third-party data in multiple modes.

Step 3
In near real-time data stream processing, the ingestion services collect data, applies near real-time data transformations using AWS Lambda, and stores the data in Amazon DynamoDB. A DynamoDB stream is used to propagate data downstream in near real-time using Lambda.

Step 4
In batch data processing, the ingestion services collect and store raw data in Amazon Simple Storage Service (Amazon S3).

Step 5
AWS Step Functions orchestrates AWS Glue data pipeline jobs to clean and validate data. The cleansed data is passed to an Identity Resolution workflow. This workflow is built using AWS Entity Resolution.

Step 6
Data processing and the transient data storage for the Identity Resolution workflow uses clean zone Amazon S3 bucket. The Amazon S3 curated zone bucket stores the final output of data processing for consumption.

Step 7
The unified customer profile is stored in Amazon S3 and used for segmentation. Artificial intelligence and machine learning (AI/ML) models for segmentation are developed and deployed using Amazon SageMaker. The unified view of customer profiles for contact center applications is stored in Amazon Connect Customer Profiles. Next Best Item recommendations for cross sell or up sell are created from the unified customer view using Amazon Personalize.

Step 8
Amazon Pinpoint utilizes the unified customer profile to conduct multi-channel outbound marketing. Amazon Connect uses the unified customer profile to enhance the customer’s experience in call centers. Audience upload to advertising platforms is done using Amazon AppFlow integrations.

Step 9
AWS Clean Rooms is used for privacy enhanced data collaborations to support media planning, audience activation, and measurement use cases. The customer 360 profile is made available for API-based consumption using DynamoDB, Lambda, and API Gateway.

Step 10
Amazon Redshift stores clean, modeled data for fast and repeated queries. Amazon QuickSight provides large-scale data analysis and visualization. Amazon Athena enables data exploration and querying.

Step 11
Customer 360 profile data is uploaded to paid media ad platforms such as Amazon Marketing Cloud and Amazon DSP for online media targeting. Marketing platforms and other SaaS solutions use the customer 360 profile data for marketing and data monetization use cases. Media platforms use customer 360 profiles for website and mobile app personalization.

Step 12
AWS Lake Formation defines access controls on AWS Glue catalog tables, columns, and rows in the data lake. AWS Identity and Access Management (IAM) securely manages identities and access to AWS services and resources.

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

Operational Excellence

This Guidance has observability built-in, with every AWS service publishing metrics to Amazon CloudWatch where dashboards and alarms can be configured, enhancing operational excellence to support a well-architected framework. And by using CloudWatch alarms, or Amazon Simple Notification Service (Amazon SNS), you are notified and can respond appropriately to incidents.

Read the Operational Excellence whitepaper
Security

IAM policies are created using the least privilege access principle, and include restrictions to the specific resource and operation, supporting a secure framework for people and machine access. To further protect resources in this Guidance, secrets and configuration items are centrally managed and secured using AWS Key Management Service (AWS KMS). And to protect data, the Amazon S3 bucket is encrypted using the AWS KMS keys for data at rest. The data in transit is encrypted and transferred over HTTPS.

Additionally, all of the Amazon S3 buckets are blocked from public access, and access to DynamoDB is only required within a virtual private cloud (VPC). Thus, we are using a VPC endpoint to limit access from only the required VPC. Doing this prevents that traffic from traversing the open internet and being subject to that environment.

Read the Security whitepaper
Reliability

By deploying this Guidance, you also implement a highly available network topology in multiple ways. First, every service and technology chosen for each architecture layer is serverless and fully managed by AWS, making the overall architecture elastic, highly available, and fault-tolerant. Second, DynamoDB has a point-in-time recovery feature that provides continuous backups of your tables and enables you to restore your table data to any point-in-time in the preceding 35 days. Third, Amazon S3 offers industry-leading durability, availability, performance, security, and virtually unlimited scalability at very low costs. Finally, AWS serverless services, including Lambda, are fault-tolerant and designed to handle failures. If a service invokes a Lambda function and there is a service disruption, Lambda invokes the function in a different Availability Zone.

Read the Reliability whitepaper
Performance Efficiency

The services selected for this Guidance are designed to enhance your workload performance. For example, by using serverless technologies, you only provision the exact resources you use. The serverless architecture reduces the amount of underlying infrastructure you need to manage, allowing you to focus on solving your business needs. You can also use automated deployments to deploy the different components of this Guidance into any AWS Region quickly, providing data residence and reduced latency.

Also, all components of this Guidance are collocated in a specific Region and use a serverless stack, which avoids the need for you to make location decisions about your infrastructure apart from the Region choice.

Read the Performance Efficiency whitepaper
Cost Optimization

By using serverless technologies and managed services, you only pay for the resources you consume, helping you control costs. Another way this Guidance can help optimize costs is by helping you plan for data transfer charges. To do this, we recommend you identify data egress points and evaluate the use of network services like AWS PrivateLink and AWS Direct Connect to reduce data transfer costs.

To further optimize compute costs for this Guidance, scoping of near real-time data ingestion allows you to leverage Amazon Kinesis Data Streams with a provisioned capacity mode. Provisioned capacity mode is best suited for predictable application traffic or for applications where the traffic is consistent, increases gradually, or where you can forecast capacity requirements to control costs. Similarly, for DynamoDB, use provisioned capacity mode for predictable workloads to reign in costs. Also, when AWS Glue is performing data transformations, you only pay for the infrastructure while the processing is occurring. In addition, through a tenant isolation model and resource tagging, you can automate cost usage alerts and measure costs specific to each tenant, application module, and service.

Read the Cost Optimization whitepaper
Sustainability

This Guidance scales to continually match the needs of your workloads with only the minimum resources required through the extensive use of serverless services. The efficient use of these resources also reduces the overall energy required to operate your workloads. And, this Guidance uses purpose-built data stores for specific workloads, which minimizes the resources provisioned. For example, Amazon S3 is used for data lake storage, and DynamoDB is used to support low latency queries.

Finally, all of the services used in this Guidance are managed services that allocate hardware according to the workload demand. We recommend using the provisioned capacity options (as mentioned previously) in the services when available, and when the workload is predictable, to reduce cost.

Read the Sustainability whitepaper

Implementation Resources

A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

Open implementation guide

Open sample code on GitHub

Related Content

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

Was this page helpful?

Feedback