Analyzing Google Analytics data with Amazon AppFlow and Amazon Athena

Software as a service (SaaS) applications are rapidly growing in importance. This data is essential to include when performing analytics to influence business decisions. Amazon AppFlow is a fully managed integration service that helps you transfer SaaS data to your data lake securely. You can run data transfer flow on demand, on a schedule, or after an event. You can quickly analyze this data using Amazon Athena and join it with numerous datasets already stored on Amazon Simple Storage Service (Amazon S3). You can join multiple SaaS datasets and combine it with operational data sitting in traditional databases such as Amazon Relational Database Service (Amazon RDS) via the Athena federated query feature.

This post walks you through extracting Google Analytics data using Amazon AppFlow and storing it in Amazon S3, so you can query it with Athena.

Architecture overview

The following diagram shows the flow described in this post. You first create a new flow inside Amazon AppFlow to transfer Google Analytics data to Amazon S3. The format of transferred data is multi-line JSON, which Athena doesn’t support. An AWS Lambda function transforms this JSON format file into Apache Parquet format. This transformation enables you to run a query efficiently and cost-effectively. This function can also include other transformations, such as Amazon S3 prefix changes and storing the data using Hive style partitions. Amazon AppFlow supports scheduled jobs to extract only new data, so you can develop an automated workflow with using an Amazon S3 event trigger and a transformation Lambda function. Amazon AppFlow is currently available in 15 Regions; pick the Region where your S3 bucket is located. In this walkthrough, you use US East (N. Virginia).

In this post, you use a sample Google account, OAuth client with appropriate permission, and Google Analytics data. You can also use your own Google resources. To enable Google Analytics access from Amazon AppFlow, you should set up a new OAuth client in advance. Complete the following steps:

On the Google API Console (https://console.developers.google.com), choose Library.
Enter analytics in the search field.
Choose Google Analytics API.
Choose ENABLE and return to the previous page.
Choose Google Analytics Reporting API listed in the search results.
Choose ENABLE and return to the main page.
Choose OAuth consent screen.
Create a new Internal app (if you’re using your personal account, choose External).
Add com as Authorized domains.
Choose Add scope.
Add ../auth/analytics.readonly as Scopes for Google APIs.
Choose Save.
Choose Credentials.
Add OAuth client ID credentials.
Choose Web application.
Enter https://console.aws.amazon.com/ as an authorized JavaScript origins URL.
Enter https://AWSREGION.console.aws.amazon.com/appflow/oauth as an authorized redirect URL. (Replace AWSREGION with the Region you’re working in. If you’re using Amazon AppFlow in us-east-1, enter https://console.aws.amazon.com/appflow/oauth.)
Choose Save.

Setting up Lambda and Amazon S3

You need to start by creating a new S3 bucket as your Amazon AppFlow transfer destination. Then you develop a new Lambda function to transform JSON format data into Parquet format using pandas and pyarrow modules. Finally, you set an Amazon S3 event trigger to automatically call the Lambda function when a new Amazon S3 object is created.

Creating a new S3 bucket

To create an Amazon S3 bucket, complete the following steps:

On the Amazon S3 console, choose Create bucket.
Enter a name for your bucket; for example, appflow-ga-sample.
Choose Create bucket.

Preparing a .zip file for your Lambda layer

To create a .zip file that includes pandas and pyarrow module, complete the following steps:

Set up any environment that can run Docker.
Run the following command:

mkdir python
docker run -it --rm -v $(pwd)/python:/python python:3.6.8
pip install -t /python pandas==0.23.4 pyarrow==0.11.1
zip -r pandas-pyarrow.zip python

On the Amazon S3 console, choose appflow-ga-sample.
Choose Create folder.
Enter a name for your folder; for example, lambda-layer.
Choose Save.
Choose lambda-layer.
Choose Upload.
Choose pandas-pyarrow.zip and choose Upload.

Creating a Lambda layer for Parquet export

To create a Lambda layer, complete the following steps:

On the Lambda console, choose Layers.
Choose Create layer.
For name, enter a name for your layer; for example, pandas-parquet.
Select Upload a file from Amazon S3.
For Amazon S3 link URL, enter an Amazon S3 path for your zip file; for example, s3://appflow-sample/lambda-layer/pandas-parquet.zip.
For Compatible runtimes, choose Python 3.6.
Choose Create.

Creating a Lambda function for data transformation

To create a Lambda function and trigger an Amazon S3 event, complete the following steps:

On the Lambda console, choose Create function.
Select Author from scratch.
For Function name, enter a name for your function; for example, ga-converter.
For Runtime, choose Python 3.6.
Select Create a new role with basic Lambda permissions.
Choose Create function.

At the Lambda function configuration, enter the following code in the lambda_function area in the Function code

This Lambda function downloads AppFlow output file, extracts the necessary data from the Google Analytics JSON file, and transforms it into Parquet format. Finally, it uploads it to Amazon S3 again with a different key name. You can modify the script, especially in dimensions and values names, or any other transformations according to your needs.

import pandas as pd
import boto3
from datetime import datetime as dt

def lambda_handler(event, context):
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    object_key = event['Records'][0]['s3']['object']['key']

    s3_client = boto3.client('s3')
    raw_object = s3_client.get_object(Bucket=bucket_name, Key=object_key)
    raw_data = json.loads(raw_object['Body'].read().decode('utf-8'))
    
    record_dates = [dt.strptime(r['dimensions'][0], '%Y%m%d%H') for r in raw_data['reports'][0]['data']['rows']]
    devices = [r['dimensions'][1] for r in raw_data['reports'][0]['data']['rows']]
    user_counts = [int(r['metrics'][0]['values'][0]) for r in raw_data['reports'][0]['data']['rows']]
    df = pd.DataFrame({
        'year': [r.year for r in record_dates],
        'month': [r.month for r in record_dates],
        'day': [r.day for r in record_dates],
        'hour': [r.hour for r in record_dates],
        'device': devices,
        'user_count': user_counts
    })
    
    output_file = dt.now().strftime('%Y%m%d%H%M%S')
    output_path = '/tmp/{}.parquet'.format(output_file)
    df.to_parquet(output_path)

    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucket_name)
    bucket.upload_file(output_path, 'ga-data/{}.parquet'.format(output_file))

Adding layers

To add layers to your Lambda function, complete the following steps:

Choose Layers in the Designer
Choose Add a layer.
Select Select from list of runtime compatible layers.
For Name, choose pandas-pyarrow.
For Version, choose 1.
Choose Add.

Increasing your timeout setting

To increase the Lambda timeout setting, complete the following steps:

On the Configuration tab, under Basic settings, choose Edit.
Set 30 sec as Timeout.
Choose Save.
On the Configuration tab, choose Save.

Adding an event trigger

To add an Amazon S3 event trigger, complete the following steps:

In the Designer section, choose Add trigger.
Choose S3, and choose the bucket you created.
For Event type, choose All object create events.
For Prefix, enter raw.
Choose Add.

Adding permissions

To add appropriate permissions for this Lambda function to read and write Amazon S3 objects, complete the following steps:

On the Permissions tab, enter a role name; for example, ga-converter-role-zm6u0f4g.
On the AWS Identity and Access Management (IAM) console, choose Policies.
Choose Create Policy.
On the JSON tab, enter the following policy (replace the target bucket name arn:aws:s3:::appflow-ga-sample* with your own bucket name).

{	
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
             ],
            "Resource": [
                "arn:aws:s3:::appflow-ga-sample*"
            ]
        }
    ]
}

Choose Review policy.
Enter a name for your new policy; for example, lambda-s3-ga-converter-policy.
Choose Create policy.
On the IAM console, choose
Enter your role name (ga-converter-role-zm6u0f4g) in the search field.
Choose your role.
Choose Attach policies.
Choose lambda-s3-ga-converter-policy.
Choose Attach policy.

Setting up Amazon AppFlow

Now you can create a new Amazon AppFlow flow to transfer from Google Analytics to Amazon S3. To create a new Amazon AppFlow transfer flow, complete the following steps:

On the Amazon AppFlow console, choose Create flow.
Enter a name for your flow; for example, my-ga-flow.
Choose Next.
For Source name, choose Google Analytics.
Choose Create new connection.
Enter your OAuth client ID and client secret, then name your connection; for example, ga-connection.
In the pop-up window, choose to allow amazon.com access to the Google Analytics API.
For Choose Google Analytics object, choose Reports.
For Choose Google Analytics view, choose All Web Site Data.
For Destination name, choose Amazon S3.
For Bucket details, choose the bucket you created.
Enter raw as a prefix.
Select Run on demand.

Choose Next.
Select Manually map fields.
Select the following three fields for Source field name:
- Time: DIMENSION: ga:dateHour
- Platform or Device: DIMENSION: ga:deviceCategory
- User: METRIC: ga:users
Choose Map fields directly.

Choose Next.
In the Add filters section, choose Next.
Choose Create flow.

Running the flow

After creating your new flow, you can run it on demand:

On the Amazon AppFlow console, choose my-ga-flow.
Choose Run flow.

For this walkthrough, you choose on-demand job execution for ease of understanding. In practice, you can choose a scheduled job and periodically extract only newly added data. The Amazon S3 event trigger also helps you transform data automatically.

Querying via Athena

You need to create an external table before querying. Complete the following steps:

On the Athena console, enter create database appflow_data into the query editor.
Choose Run query.
Enter the following command in the query editor (replace the target bucket name appflow-ga-sample with your own bucket):

CREATE EXTERNAL TABLE appflow_data.ga_sample (
  `year` int,
  `month` int,
  `day` int,
  `hour` int,
  `device` string,
  `user_count` int
 )
STORED AS PARQUET
LOCATION 's3://appflow-ga-sample/ga-data'
tblproperties ("parquet.compression"="SNAPPY")
;

Choose Run query.

Now you can query Google Analytics data. Enter the following query and run it. This query shows what kind of device is popular for accessing your website on an hourly basis:

SELECT
  year
  , month
  , day
  , device
  , count(user_count) as cnt
FROM
  appflow_data.ga_sample
GROUP BY
  year
  , month
  , day
  , device
ORDER BY
  cnt DESC
LIMIT 10
;

The following screenshot shows the query results.

Summary

This post demonstrated how you can transfer Google Analytics data to Amazon S3 using Amazon AppFlow and analyze it with Amazon Athena. You no longer need to build your own application to extract data from Google Analytics and other SaaS applications. Amazon AppFlow enables you to develop a fully automated data transfer and transformation workflow and an integrated query environment in one place.

About the Author

Makoto Shimura is a specialist solutions architect, analytics at Amazon Web Services. He helps customers develop efficient data pipelines on the AWS platform. Previously, he worked as a data engineer, developing a distributed data platform. Outside of work, he loves to spend time with his family, play with his dog, and also play video games.

AWS Big Data Blog