AWS Architecture Blog
An overview and architecture of building a Customer Data Platform on AWS
The deprecation of digital consumer identifiers, such as third-party cookies and mobile advertising IDs, and the rapid growth of data from expanding consumer touchpoints, has created challenges in identifying, managing, and reaching customers in digital channels. Organizations must rethink their strategies for collecting and storing customer data. Customer Data Platforms (CDPs) collect, aggregate, and organize customer data sources, and create individual centralized customer profiles to better manage and understand customers.
Independent Software Vendors (ISVs) in the advertising and marketing industry vertical can aid many companies in achieving these goals. The ISVs can help organizations with the heavy lifting required to build, secure, govern, and maintain near real-time, high volume CDPs. However, building these types of vendor solutions, that support a large number of customers, data volumes, and use cases, is a complex undertaking with often unforeseen challenges.
This post examines the logical architecture of the CDP to provide guidance to help reduce complexity, increase agility, improve operational excellence, and optimize cost. Although the material can benefit advanced marketers evaluating building a CDP, the intent of this post is to provide guidance to those ISVs that are facing these challenges for multiple clients.
Marketing CDP logical architecture
The principal challenge of the CDP architecture is to integrate data from many disparate sources and types. Envision a data lake-centric approach based on a layered architecture where the sources of customer data flow through six logical layers: Ingestion; Processing; Storage; Unified Governance (and Security); Cataloging; and Consumption.
A layered, component-oriented architecture promotes separation of concerns, decoupling of tasks, and the flexibility required to build each component consistent with best practices. These components provide the agility necessary to quickly integrate new data sources and support new analytics or product capabilities. The components are depicted in the image below in the conceptual logical model of the marketing CDP and are then described in this post.
Logical ingestion layer
The ingestion layer is responsible for collecting data across various customer touchpoints. It provides the ability to connect with internal and external data sources. It can ingest batch, near real-time, and real-time data into the storage layer. This layer aggregates data from multiple source systems and therefore elevates the marketing CDP as the primary repository of marketing and advertising data across an organization. Sources can include cloud or on-premise data sources, streaming data, file stores, third-party Software as a Service (SaaS) connectors and APIs. Separating this component into three layers reduces complexity while providing agility to the process.
Logical storage layer
A scalable, flexible, resilient, and reliable storage layer is critical to the value proposition of a marketing CDP. It consists of three distinct storage areas:
- Raw Zone – Contains ingested data in its original, immutable format, which can be used to source additional attributes in the future. It can also be used to restore data in certain disaster recovery scenarios. This layer acts as an immutable record of what has happened/been observed historically so that we can use that immutable data to generate a source of fact.
- Clean Zone – Contains the first transformation of raw data, including conversions to an efficient data format such as Parquet or Avro, as well as basic data quality validations. This layer also acts as an ad-hoc layer to develop answers to unknown question in reasonable time frames so that they can be migrated to the curated zone.
- Curated Zone – Contains data, organized by subject area, that is ready for consumption by users and applications For a marketing CDP, this includes identity resolution, data enrichment, customer segmentation, and aggregation.
Automated data archival can be configured individually for each layer, and aligned to compliance requirements set by the organization. Access to these layers is controlled at a granular level to ensure a secure and collaborative data exchange and exploration.
Logical cataloging layer
The cataloging layer provides a centralized governance control, including mechanisms for data access control, versioning, and metadata exploration. It provides the ability to track the schema and the partitioning of datasets. This layer makes the datasets discoverable. The Governance capabilities of the Catalog ensure standardization for audit purposes.
Logical processing layer
This layer is responsible for transforming data into a consumable state by applying business rules for data validation, identity resolution, segmentation, normalization, profile aggregation, and machine learning (ML) processing. This layer comprises custom application logic. The compute resources for this layer are designed to scale independently from storage to handle large data volumes; support schema-on-read, support partitioned data and diverse data formats; and orchestrate event-based data processing pipelines.
Logical consumption layer
The consumption layer is responsible for providing scalable tools to gain insights from the vast amount of data in the marketing CDP.
- Analytics layer – Enables consumption by all user personas through several purpose-built analytics tools that support analysis methods, including ad-hoc SQL queries, batch analytics, business intelligence (BI) dashboards and ML-based insights. Components in this layer should support schema-on-read, data partitioning, and a variety of formats.
- Data collaboration layer – Consists of data clean rooms where organizations can aggregate customer data from different marketing channels or lines of business, and combine it with first-party data to gain insights while enforcing security, anonymization, and compliance controls.
- Activation layer – This layer integrates customer profiles with the organization. It can also integrate with third-party SaaS providers in the advertising and marketing industry, and is capable of enriching data sets for consumption.
Logical security and governance layer
The Security and Governance layer is responsible for providing mechanisms for access control, encryption, auditing, and data privacy. CDP platforms must securely organize and control the flow of customer event and attribute data. The CDP must manage data, regardless of ingestion method, to unify that data to unique customer profiles, centralizing audience segmentation, and forwarding data to your purpose-built data stores.
Privacy regulations, which often vary by region or country, make it necessary to focus on collecting only the vital data for your marketing efforts. The CDP must align to a standards-based security process. There must be procedures in place to audit data collection, follow least privilege data access, and avoid data silos.
A marketing CDP must include the following security and governance aspects:
- Encryption at rest – Data must be persisted in encrypted format to protect it from unauthorized access.
- Encryption in transit – To protect data in transit, encryption protocols such as TLS and certificates to create a secure HTTPS connection to make API requests.
- Key management – Keys must be managed securely because they grant access to data.
- Secrets management – Secrets, such as application passwords and login credentials, must be protected from unintended access.
- Fine-grained access controls – Control data access to only those users that have the right to see the data.
- Data archival – Users need to take advantage of storage tiers and data lifecycle policies, which automatically move data to lower cost tiers over time, based on expected access patterns.
- Auditing – It is critical to monitor and record all activity within the environment with the goal of being able to analyze activity down to individual API call level.
- Data masking – It is important to allow users the ability to automatically detect and optionally mask, substitute, or encrypt/decrypt Personally Identifiable Information (PII). This helps outputs of the CDP to comply with such standards as HIPAA and GDPR.
- Compliance programs – Compliance frameworks such as SOC2, GDPR, CCPA, and others can be attested by tying together governance-focused, audit-friendly service features with applicable compliance or audit standards.
Conclusion: Using the CDP to better manage customers
In this post, we reviewed a logical CDP data architecture that addresses several complexities at scale, using a decoupled, component-driven architecture. The Data Analytics Lens can provide further guidance when designing, deploying, and architecting analytics solution workloads. In addition, ISVs should consider a serverless model for implementation, which helps optimize cost and scalability while reducing the required maintenance on the system.