Connecting Google Analytics data in AWS Clean Rooms

Businesses across industries increasingly want to complement their first-party data with partners in order to generate more insights for use cases such as creating a more complete view of their customers, optimizing marketing and advertising campaigns, and improving reporting and measurement. At the same time, these companies want to protect their underlying data and are seeking solutions to analyze datasets with their partners in a protected environment while minimizing the movement of their data.

Amazon Web Services (AWS) Clean Rooms helps companies and their partners more easily and securely analyze and collaborate on their collective datasets—without sharing or copying one another’s underlying data. AWS Clean Rooms enables customers to create a secure data clean room in minutes, and collaborate with any other company on AWS to generate unique insights.

A common use case for advertisers and marketers across industries is to collaborate with partners to find aggregate overlapping customer insights across various first and third-party data sources for campaign planning, activation, and measurement. In some instances, these partner data sources come from software as a service (SaaS) and cloud providers, like web analytics data from Google Analytics.

We will consider the case of an online retailer that plans to introduce a new version of one of their products. They are contemplating running an advertisement campaign on a social media platform, targeting customers who showed interest in a previous version of the product. However, before initiating the campaign, they wish to ensure that their target audience is active on this platform. To ascertain this, they will identify the intersection of users who viewed the previous product version (using data from Google Analytics) and users of the social media platform through AWS Clean Rooms.

Solution Overview

In the solution walkthrough, we will show you how you can conduct an AWS Clean Rooms collaboration with Google Analytics. An AWS Step Function will be triggered on schedule and using data connectors for AWS Clean Rooms, we will ingest Google Analytics data from Google BigQuery to AWS Glue. The data is encrypted and stored in Amazon Simple Storage Service (Amazon S3) along with generated schema stored in an AWS Glue Data Catalog. The solution uses AWS Cloud Development Kit (AWS CDK) to automate the deployment of the AWS resources described in this blog.

Finally, once the web analytics event data is ingested into Amazon S3, we will demonstrate how to use AWS Clean Rooms to set restrictive analysis rules, run queries, and collaborate between data tables in multiple AWS accounts. You will be able to perform different operations on the data sets such as performing a join to see the common users in the two different data sets, correlated by the hashed IP address in the clickstream web analytics data.

Prerequisites

Google Cloud Platform (GCP) account with data within BigQuery available in the required format.
Clone the repository available at: https://github.com/aws-samples/connecting-data-from-google-analytics.
Follow the prerequisites for setting up AWS Clean Rooms.
Your datasets must be saved as one of the supported data formats for AWS Clean Rooms such as Apache Parquet.
Your datasets must be cataloged in AWS Glue and use the supported data types for AWS Clean Rooms.
All of your datasets must be stored in Amazon S3 in the same AWS Region in which the AWS Clean Rooms collaboration was created.
The AWS Glue Data Catalog must be in the same Region in which the collaboration was created.
The AWS Glue Data Catalog must be in the same AWS account as the membership.

Reference Architecture

Figure 1: Architectural diagram

Architecture Flow

1. Initiating On-demand connection

a. In your GCP console, go to the IAM & Admin page and then select Service accounts.
b. Create a new service account.

Figure 2: Creating a service account in GCP

c. In the roles, give the role BigQuery Data Transfer permission.

Figure 3: Granting permission to service account

d. Under Actions, click on Manage Keys and add a new key.

Figure 4: Managing keys for GCP service accounts

Figure 5: Creating a new key for service account

e. Use the JSON credentials that were downloaded from your Google Cloud Platform account and convert them to base64 format. If you’re using the AWS CloudShell, you can use ‘base64 filename.txt’ to perform the conversion.
f. In the AWS Console, go to Secrets Manager in your region of deployment.
g. Click on Store a New Secret. Under Secret type select Other type of secret.

Figure 6: Create a new secret within AWS Secrets Manager

h. Under Key/value pairs type the name as bigquery_credential and for the value paste the base64 encoded string created during Step e.

2. Securely connecting to GCP BigQuery dataset

a. Go to the Google BigQuery Connector for AWS Glue (no charge) in the AWS Marketplace.

Figure 7: Subscribe to BigQuery Connector for AWS Glue in AWS Marketplace

b. Choose Continue to Subscribe.
c. Review the terms and conditions, pricing, and other details.
d. Choose Continue to Configuration.
e. Under Fulfillment option, choose the AWS Glue version you are using (3.0).
f. Under Software version, choose your software version.

Figure 8: Configuration of BigQuery Connector for AWS Glue

g. Click Continue to Launch.
h. Under Usage instructions, review the documentation, then click on Activate the Glue connector from AWS Glue Studio. You will be redirected to AWS Glue Studio to create a connection.

Figure 9: Activate the connector

Figure 10: Creating a connection in AWS Glue

i. Under Name, enter a name for your connection (for example, bigquery).
j. Under Connection Access, then AWS Secret-optional, select bigquery_credentials.
k. Click Create connection and activate the connector. A message will appear stating that the connection was successfully created, and the connection is now visible on the AWS Glue Studio console.

3. Data ingestion using AWS Glue and Google BigQuery connector for AWS Glue

a. Ensure that you’ve cloned the GitHub repository mentioned in the Perquisites section of this blog. The project code uses the Python version of the AWS CDK. To execute the project code, please ensure that you have fulfilled the AWS CDK Prerequisites for Python. The project code requires that the AWS account is bootstrapped to allow the deployment of the AWS CDK stack.
b. From the directory where you cloned the project from the Perquisites section, navigate to the cdk.context.json file and update it with your values:

{
"job_name": "bigquery-analytics",
"data_bucket_name": "your-s3-bucket-name",
"dataset_id": "Big Query Dataset ID",
"parent_project": "GCP Project ID",
"connection_name": "bigquery",
"filter": "",
"job_script": "job-analytics.py",
"schedule_daily_hour": "3",
"glue_database": "gcp_analytics",
"glue_table": "ga4_events",
"timedelta_days": "1",
"classification": "confidential"
}

Context Parameter Summary

job_name – The name of the AWS Glue job. AWS CDK creates this job.
data_bucket_name – Specify an existing S3 bucket name for the data and Glue Job Script. Create the S3 bucket in the AWS console and specify the bucket name here.
dataset_id – The Google Cloud BigQuery Dataset ID containing the Google Analytics 4 export. It needs to be specified from what exists in GCP.
parent_project – The Google Cloud Project ID of the BigQuery database containing the Google Analytics 4 export.
connection_name – AWS Glue Connection name. The name of the selected connection created prior during Section 2, Step i, given the example name: Bigquery.
filter – This is not currently used; however, it can be used for BigQuery query filtering. Default is “”.
job_script -default job-analytics.py – The provided AWS Glue script pulls and flattens the nested data from BigQuery. Optionally, you can use a job script for custom transformations.
schedule_daily_hour – The daily scheduled hour for when the job runs to get yesterday’s analytics data—the default is 3 AM. Amazon EventBridge which is also deployed as part of the CDK is used to trigger the Step Function schedule.
glue_database – Specify an AWS Glue database name to be created. AWS CDK creates this AWS Glue database.
glue_table – Specify the AWS Glue table name to be created. AWS CDK creates the AWS Glue table.
timedelta_days – The number of days previous to pull events. 0 = today, 1 = yesterday—the default is 1.
classification – The data classification tag applied to the S3 bucket. The default is confidential.

c. In the same directory, run cdk_deploy.sh to deploy the project. The CDK deploys all the resources indicated in the architectural diagram along with the corresponding Amazon CloudWatch log groups. The project may take up to 30 minutes to deploy.

4. Getting data into AWS Glue tables

a. Since the first scheduled run is typically after a day, to trigger data flow in AWS Glue immediately, navigate to Step Functions in the AWS console and click on gcp-connector-glue.
b. Click Start Execution on the top right. The step function should take about 10 minutes to complete. The data will now be in your AWS Glue table.

5. Analyzing data from AWS Clean Rooms

a. To set up an AWS Clean Rooms, go to the AWS Console and search for AWS Clean Rooms.
b. Click on Create Collaboration and enter a name for the collaboration.
c. Enter the details and add the other members to the collaboration, along with their account numbers.
d. Designate which member can query receive the results.

Figure 11: Defining a collaboration in AWS Clean Rooms

e. Click Next until you are able to successfully create the collaboration and memberships.
f. Once the collaboration is configured, open the collaboration and click Configured tables from the left-side navigation pane.
g. Click Configure new table.
h. Select the AWS Glue database that you used for the AWS Glue Connector.
i. Select the bigquery_table table. Optionally you can view the schema to make sure it is correct.
j. Under Columns allowed in collaboration, select All columns.
k. Give the table a name and click Configure new table. You can have as many configured tables as you like.
l. Once the table is configured, you need to create an Analysis rule for members to be able to query the table.

Figure 12: Configuring a table in AWS Clean Rooms

Analysis Rules

An analysis rule is a privacy-enhancing control that each data owner sets up on a configured table. An analysis rule determines the type and specifications of queries that can be run on their configured table. To protect your data, you determine the use cases that you want to enable and the queries that you want to restrict in a collaboration.

If there is no analysis rule configured, the configured table can be associated to collaborations, but it can’t be queried.

When the collaboration member runs a query, AWS Clean Rooms validates the following:

The type of query matches the analysis rule type of the configured tables in the query
The query structure matches the expected query structure
The columns and tables returned match those defined in the analysis rules

For an aggregation query, AWS Clean Rooms also enforces the query results controls from each configured table in the query. It enforces the most restrictive query results controls when multiple configured tables have the same query result control.

5. Analyzing data from AWS Clean Rooms – Steps Continued

m. Open the configured table created in Section 5, Step g and click Configure analysis rule.

Figure 13: Associating the configured table with a collaboration

n. Click Aggregation rule and then click Guided flow.

Figure 14: Select the analysis rule type

o. Add a column which you want to enable for joining, (optionally) select the columns you want to enable List on. Click Next.

Figure 15: Specify your query controls

Figure 16: Specify query controls

p. Click Configure analysis rule.

Figure 17: Creation of aggregate analysis rule

q. Now that the analysis rule is configured, we will associate this table with the collaboration. Click on Associate to collaboration and select the appropriate collaboration.

Figure 18: Associate the table with a collaboration

r. Select the table, provide a name to be used for the collaboration and then create a service access role for Cleanrooms to access the data. The console will create a new service role in your account. If you want to create the role, refer to the AWS Clean Rooms documentation for instructions. The table is now ready to be queried by member accounts.

Figure 19: Creation of role to associate the table with the collaboration

To get the count of IP addresses of users across campaigns based on the hashed IP addresses from the clickstream data, now we will configure the Cleanroom table in second member account for querying.

6. Adding the second member’s table

a. Select the Aggregate rule and with guided flow.

Figure 20: Creation of aggregation constraints in member account

b. Next, specify the columns that can be used for joins as well as the columns that can be listed out in results. In the case, we will use the same JOIN join column as the first account.

Figure 21: Selecting appropriate columns for join and list controls

Figure 22: Specify query results control

c. Verify the settings and click configure analysis rule

Figure 23: Configure Analysis Rule

Once, the Aggregate type analysis rule is configured for both the tables, you can now run a query to list out the users in a geo along with their corresponding event by joining on the hashed IP addresses for the same event date in the different campaign data.

d. Go to the AWS Clean Rooms homepage within AWS console and open the collaboration you just created.
e. Associate the AWS Glue table you just created for this AWS Clean Rooms collaboration.
f. Next, set up the Query Results setting as an output to an S3 bucket.
g. Go to the Query Results settings, and select the output format: csv or parquet. Click Browse S3 and select an S3 bucket in the same region.

Figure 24: Configuration of query results settings

h. Once the output settings are set, you’re ready to run your first query. You can select one of the queries from the table sample queries (depending on the constraints) and paste it in the query editor. Click Run. Once the execution is complete, the member account that was chosen to view the results will be able to see the results at the bottom of the page.

Figure 25: Running a JOIN query across configured tables

This concludes the steps for creating an AWS Clean Rooms collaboration with data ingested from Google Analytics. Now that one of the accounts has ingested the data, it is available for querying by joining it with the data available from collaboration members.

There are many things you can do with the collaboration, such as obtaining a list of common users that can be targeted for a marketing campaign, finding overlaps between a collaborator’s entity data, and much more.

Cost Assessment

For pricing information for the primary services used in the solution, visit:

Cleanup

When you are finished experimenting with this solution, cleanup your resources by running the command:

cdk destroy

This command will delete the resources deployed by the solution. The AWS Secret Manager secret containing the manually added GCP Secret and Amazon CloudWatch log groups are retained after the stack is deleted.

Key Considerations and Best Practices

The results for queries are saved to the S3 bucket designated by the query runner.
Review with your organization's legal and compliance teams that your use case and the controls you set with AWS Clean Rooms comply with your organization's data handling policies.
Create separate configured tables for each query use case (for example, audience planning or attribution). You can create multiple configured tables with the same underlying AWS Glue table.
Follow the principle of least privilege. Specify columns in the analysis rule. For example, apply the Dimension columns, List columns, and Join columns analysis rules necessary for collaboration queries. Least privilege will help mitigate the risk of differencing attacks or enabling other members to reverse engineer your data. Use the Allow list columns feature to note other columns you want to make queryable in the future.
Set up AWS Clean Rooms in the same region where you and other members' S3 buckets and AWS Glue data catalogs reside.

Conclusion

In this post, we showed you how to ingest Google Analytics v4 data into AWS using AWS Glue while safely collaborating on the data between AWS accounts using AWS Clean Rooms.

Contact an AWS Representative to know how we can help accelerate your business and get started on using AWS Clean Rooms.

Further Reading