[SEO Subhead]
This Guidance introduces data ingestion patterns for connecting advertising and marketing data to AWS services. Data can come from a variety of data stores, and once activated, can be used for setting up a customer 360 profile, an AWS Clean Rooms collaboration, artificial intelligence and machine learning (AI/ML) training, and analytics applications. This Guidance includes an overview architecture diagram demonstrating the data pipeline in addition to six architectural patterns that show different approaches to provision data for your analytical workloads.
Please note: [Disclaimer]
Architecture Diagram
-
Overview
-
API Pull Pattern with AWS Lambda
-
API Pull Pattern with Amazon AppFlow
-
Push Pattern with Amazon S3
-
Batch Pull and Change Data Capture Pattern
-
Managed File Transfer Pattern
-
File Replication Pattern
-
Overview
-
This architecture diagram shows an overview of how to connect data sources stored in a variety of data sources to AWS. To review the architectural patterns, open the other tabs.
Step 1
Sources of data needed for advertising and marketing analytics belong to one of the three categories: software as a service (SaaS) applications, relational databases, or file storage.Step 2
Use AWS services purpose-built for data ingestion to connect and pull data from data sources. The subsequent architecture patterns detail the ingestion service for each type of data source.Step 3
Use a cloud data storage “raw” zone as the destination for data ingestion services.Step 4
Use extract, transform, load (ETL) data processing jobs to transform data in a way that meets data consumption needs.Step 5
Store the transformed data in a cloud data storage “clean” zone. Catalog the data as relational tables in a data catalog service.Step 6
To build analytical applications, make the cataloged data available to consuming services such as AWS Clean Rooms, AWS Entity Resolution, Amazon SageMaker, Amazon Redshift, Amazon Athena, and Amazon QuickSight.Step 7
Build a unified observability stack that delivers the following functionality: a workflow metadata repository; workflow trigger events; chain tasks together to form an end-to-end workflow with consumption workloads; ability to generate observability notifications; and log capture and detailed observability dashboards.Step 8
Implement security and access control to achieve the following functionality: least privilege access to specific resources and operations; encryption for data at rest and data in transit; storage of hashing keys for personally identifiable data (PII) data; and monitoring of logs and metrics across all services used in this Guidance. -
API Pull Pattern with AWS Lambda
-
This architecture diagram shows an API pull pattern with AWS Lambda for Amazon Ads and Amazon Selling Partner APIs. To review the other architectural patterns, open the other tabs.
Step 1
Amazon EventBridge schedules a job that starts an AWS Step Functions state machine. The state machine processes a series of AWS Lambda functions to facilitate report creation.Step 2
The state machine invokes a Lambda function (Create Request Execution) to create a report request from the Amazon Ads API or Amazon Selling Partner API.Step 3
Store API credentials within AWS Secrets Manager, and use them when making calls to the APIs. The state machine then moves into a series of polling steps, invoking a Lambda function (Check Report Status) to check the report request status before downloading the report.Step 4
Amazon DynamoDB stores the metadata for each downloaded report.Step 5
The Lambda function (Download Report) writes the report into a raw Amazon Simple Storage Service (Amazon S3) bucket with a prefix that contains the specific report type and report date. Lambda uses the Amazon-managed AWS Key Management Service (AWS KMS) key to encrypt the reports as they’re written to the S3 bucket.Step 6
A Step Functions state machine is invoked by notifications from S3 objects as they’re inserted into the bucket. When no more objects are received after a given time, the data transformation Step Functions state machine starts and invokes a Lambda function.Step 7
A Lambda function (Update Metadata) stores the task token for the Step Functions execution ID in the DynamoDB table. An AWS Glue job processes the data and reads the data from the raw S3 bucket and transforms it to a usable format. -
API Pull Pattern with Amazon AppFlow
-
This architecture diagram shows an API pull pattern with Amazon AppFlow for SaaS application data. To review the other architectural patterns, open the other tabs.
Step 1
EventBridge schedules a job that starts the Step Functions state machine, which includes the launch of Amazon Appflow.Step 2
The Amazon AppFlow flow starts by opening the connection to the external data provider and requesting data. The external provider responds with data to Amazon AppFlow.Step 3
Amazon AppFlow puts the data into the raw S3 bucket and specified prefix. Amazon AppFlow uses an AWS KMS key to encrypt objects written to the raw S3 bucket.Step 4
A Step Functions state machine is initiated by notifications from Amazon S3 as the Guidance stores objects in the bucket. When no more objects are received after a given time, the data transformation Step Functions starts and continues with the common flow.Step 5
A Lambda function stores the task token for the Step Functions execution ID in the DynamoDB table and invokes the AWS Glue job. The AWS Glue job runs, reads data from the raw bucket, and transforms it. AWS Glue writes the transformed data to the clean S3 bucket. AWS Glue Data Catalog metadata is also written out. The AWS KMS customer managed key (CMK) created by this stack encrypts the bucket contents and the AWS Glue metadata.Step 6
By storing the Step Functions workflow metadata and execution information, DynamoDB builds unified observability. Amazon Simple Notification Service (Amazon SNS) then generates observability notifications.Step 7
Use the following AWS services for security and access: AWS Identity and Access Management (IAM) enables least privilege access to specific resources and operations. AWS KMS provides encryption for data at rest and data in transit. Secrets Manager provides hashing keys for PII data. Amazon CloudWatch monitors logs and metrics across all services used in this Guidance. -
Push Pattern with Amazon S3
-
This architecture diagram shows a push pattern with Amazon S3 for SaaS application data. To review the other architectural patterns, open the other tabs.
Step 1
External data sources push raw data files (such as CSV) into a daily partitioned Landing Zone S3 bucket. Refer to external documentation to set up push job inputs like S3 bucket location, access key, and schedule frequency.Step 2
Create a rule in EventBridge to schedule a Step Functions standard workflow for data processing at required frequency.Step 3
In the workflow, use a Lambda function to do file-level processing, such as Pretty Good Privacy (PGP) decryption. Place the decrypted file in a different S3 bucket prefix.Step 4
Use AWS Glue jobs to process the decrypted data files in the Landing Zone S3 bucket, and write data in a separate Processed Zone S3 bucket. Write the object in read optimized Apache Parquet format. Apply attribute level transformation like SHA256 hashing to secure sensitive data. Apply partitioning scheme as needed to optimize reads.Step 5
The AWS Glue crawler executes from the workflow to catalog the read optimized data in the Data Catalog.Step 6
Use a Lambda function to do post processing activities, such as moving the source data files to an "archive" prefix location as part of clean-up.Step 7
Use Amazon SNS to publish a workflow complete event and notify operators and users using email. Use HTTP or Topic options to integrate with other observability tools.Step 8
Use the following AWS services for security and access: IAM enables least privilege access to specific resources and operations. AWS KMS provides encryption for data at rest and data in transit. Secrets Manager provides hashing keys for PII data. CloudWatch monitors logs and metrics across all services used in this Guidance. -
Batch Pull and Change Data Capture Pattern
-
This architecture diagram shows a push pattern with Amazon S3 for SaaS application data. To review the other architectural patterns, open the other tabs.
Step 1a
Use AWS Glue and pre-built or marketplace connectors to extract data needed for advertising and marketing analytical use case from the relational database management system (RDBMS) in batch mode. AWS Glue will retrieve the data from data stores and load it to an S3 bucket. Amazon S3 is configured as a target for storing remote database files in parquet format.Step 1b
Use AWS Database Migration Service (AWS DMS) to replicate data stored in compatible relational databases (on-premises or on a cloud) to AWS.Step 2
A rule in EventBridge schedules a Step Functions standard workflow for post upload processing at required frequency.Step 3
AWS Glue jobs and workflows do the row-level processing of the decrypted data files and write data in a separate S3 bucket. Write the object in read optimized Apache Parquet format. Apply attribute level transformation like SHA256 hashing to secure sensitive data. Apply custom partitioning scheme as needed to optimize reads.Step 4
The AWS Glue crawler executes from the workflow to catalog the read optimized data in the Data Catalog.Step 5
Publish a notification to Amazon SNS and notify operators of success or failure of the workflow. Use the HTTP or Topic option to integrate with other observability tools.Step 6
Use the following AWS services for security and access: IAM enables least privilege access to specific resources and operations. AWS KMS provides encryption for data at rest and data in transit. Secrets Manager provides hashing keys for PII data. CloudWatch monitors logs and metrics across all services used in this Guidance. -
Managed File Transfer Pattern
-
This architecture diagram shows a batch pull and change data capture pattern for RDBMS sources. To review the other architectural patterns, open the other tabs.
Step 1
AWS Transfer Family securely migrates data stored in file systems (on-premises or on a cloud) that is needed for advertising and marketing analytical use cases.Step 2
Raw files from remote servers are uploaded as-is into a raw S3 bucket.Step 3
Transfer Family managed workflows complete post-upload processing of files, including decrypting, error checking, and formatting changes.Step 4
Lambda completes custom post processing of data before sending it to storage.Step 5
Processed data that is ready for analysis by applications and other data consumers is stored in a standard S3 bucket.Step 6
Transfer Family managed workflows will invoke a Lambda function when post upload processing steps fail for a file.Step 7
Amazon SNS publishes an exception event to notify users through email or other observability tools.Step 8
Use the following AWS services for security and access: IAM enables least privilege access to specific resources and operations. AWS KMS provides encryption for data at rest and data in transit. Secrets Manager provides hashing keys for PII data. CloudWatch monitors logs and metrics across all services used in this Guidance. -
File Replication Pattern
-
This architecture diagram shows a managed file transfer pattern for SFTP data sources. To review the other architectural patterns, open the other tabs.
Step 1
Install and configure AWS DataSync Agent on a virtual machine in the public cloud where the source Object Storage is hosted.Step 2
The DataSync Agent and DataSync allow discovery and scheduling of data transfer for both the initial sync and continuous, ongoing sync.Step 3
Configure DataSync to store the replicated data in a Landing Zone S3 bucket.Step 4
Create a rule in EventBridge to schedule a Step Functions standard workflow for data processing at required frequency.Step 5
In the workflow, use a Lambda function to do any necessary file- or object-level decryption, and invoke an AWS Glue task to normalize the data.Step 6
Use AWS Glue jobs and workflows to do data processing of the decrypted data files, and write data in a separate S3 bucket.Step 7
Write the object in read optimized Apache Parquet format. Apply data. Apply custom partitioning scheme as needed to attribute level transformation like SHA256 hashing to secure sensitive optimize reads.Step 8
Create an AWS Glue crawler, and add it to the workflow to catalog the read optimized data in Data Catalog.Step 9
Use another Lambda function to do post-processing activities, such as moving the source data files to an "archive" prefix location as part of clean-up and to save on storage costs.Step 10
Use Amazon SNS to publish a workflow complete event and notify operators and users using email. Use HTTP or Topic option to integrate with other observability tools.Step 11
Use the following AWS services for security and access: IAM enables least privilege access to specific resources and operations. AWS KMS provides encryption for data at rest and data in transit. Secrets Manager provides hashing keys for PII data. CloudWatch monitors logs and metrics across all services used in this Guidance.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
The services in this Guidance are serverless, which eliminates the need for users to manage (virtual or bare metal) servers. For example, Step Functions is a serverless managed service for building workflows and reduces undifferentiated heavy lifting associated with building and managing a workflow solution. AWS Glue is a serverless managed service for data processing tasks.
Similarly, the following services eliminate the need for capacity management: Amazon SNS for notifications, AWS KMS for key management, Secrets Manager for secrets, EventBridge for event driven architectures, DynamoDB for low-latency NoSQL databases, AppFlow for integrating with third-party applications, Transfer Family for file transfer protocols, DataSync for discovery and sync of remote data sources (on-premises or other clouds), and AWS DMS for a managed data migration service that simplifies migration between supported databases.
-
Security
IAM manages least privilege access to specific resources and operations. AWS KMS provides encryption for data at rest and data in transit using Pretty Good Privacy (PGP) encryption of data files. Secrets Manager provides secrets for remote system access and hashing keys for personally identifiable information (PII) data. CloudWatch monitors logs and metrics across all services used in this Guidance. As managed services, these services not only support a security strong posture, but help free up time for you to focus efforts on data and application logic for fortified security.
-
Reliability
Use of Lambda in the pipeline is limited to file-level processing, such as decryption. This avoids the pipeline from hitting the 15-minute run time limit. For all row-level processing, AWS Glue Spark engine scales to handle large volume of data processing. Additionally, you can use Step Functions to set up retries, back-off rates, max attempts, intervals, and timeouts for any failed AWS Glue job.
-
Performance Efficiency
The serverless services in this Guidance (including Step Functions, AWS Glue, Lambda, EventBridge, and Amazon S3) reduce the amount of underlying infrastructure you need to manage, allowing you to focus on solving your business needs. You can use automated deployments to quickly deploy the architectural components into any AWS Region while also addressing data residency and low latency requirements.
-
Cost Optimization
When AWS Glue performs data transformations, you only pay for infrastructure during the time the processing is occurring. For Data Catalog, you pay a simple monthly fee for storing and accessing the metadata. With EventBridge Free Tier, you can schedule rules to initiate a data processing workflow. With a Step Functions workflow, you are charged based on the number of state transitions. In addition, through a tenant isolation model and resource tagging, you can automate cost usage alerts to help you measure costs specific to each tenant, application module, and service.
-
Sustainability
Serverless services used in this Guidance (such as AWS Glue, Lambda, and Amazon S3) automatically optimize resource utilization in response to demand. You can extend this Guidance by using Amazon S3 lifecycle configuration to define policies that move objects to different storage classes based on access patterns.
Implementation Resources
A detailed guide is provided to experiment and use within your AWS account. Each stage of building the Guidance, including deployment, usage, and cleanup, is examined to prepare it for deployment.
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
[Title]
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.