Mistplay: Improving Business Analytics with Amazon S3 & Amazon Athena

Guest post by Steven Wang, AI Engineer, Mistplay

Mistplay is the world’s leading loyalty program for mobile gamers. Millions of players use Mistplay to discover new games, earn loyalty rewards, and connect with other players. The platform empowers mobile game studios like Playtika, Scopely and Peak to acquire and deeply engage users around the world.

Why Mistplay moved to AWS

At Mistplay, we rely heavily on our data to make informed decisions and take calculated risks. The Mistplay Android application is therefore an indispensable source of data for our team, since it generates a large amount of event information which captures user interactions. This data is essential, playing a key role in answering major business questions and helping us create the best user experience possible.

Historically, our Android application was designed to send event data to Firebase. From there, we used the out-of-the-box integration between Firebase and BigQuery to expose our event data in BigQuery for further analytics.

However, as our business grew, we encountered some challenges which could not be addressed with our existing solution. For example, we noticed increasing quality and accuracy issues that were affecting the data being processed. Furthermore, infrastructure costs were growing as quickly as our business. The pricing model was not easy to understand and our use patterns would often result in unexpectedly expensive bills.

It was clear that our data needed a new home. Moving our event analytics to AWS was a natural choice, since we were already using Amazon S3 and Amazon Athena as our primary data lake. Additionally, we wanted to unify our tooling under one set of services, allowing us to streamline analytics tasks and leverage existing security efforts already in-place under AWS. Last but not least, AWS’s clear pricing model made budgeting simple.

In this post, we’ll explain how we migrated from Firebase and BigQuery to Amazon S3 and Amazon Athena, and how this improved our analytics capability, cost structure, and operations.

Migration strategy

Our migration consisted of two phases. The first involved migrating existing data from BigQuery into Amazon S3 using open source tools and building AWS Glue tables to make them accessible. The second involved directing events to Kinesis Firehose from our Android application using the AWS SDK for Java.

Phase 1

To migrate our existing data from BigQuery to AWS S3, we employed the following strategy:

We used the `bq` command-line tool to export our BigQuery tables to Google Cloud Storage buckets. The `bq export …` command runs on managed infrastructure, so there was no need to spin up our own compute resources.
Once the data was exported, it was cleaned and flattened to make it easier to query. The data was then formatted as compressed parquet files to make consumption within Amazon Athena more cost effective.
Next, we used the `gsutil` command-line tool to export the transformed events into an Amazon S3 bucket. This was done from a number of different Amazon EC2 instances, each handling a specific group of data.
Once the data was transferred to Amazon S3, we used AWS Glue crawlers to index our data in the Glue data catalog and make it available for querying with Amazon Athena.

Phase 2

Once our existing data was in Amazon S3, we needed a way to send new data from our Android Application to AWS. We used Kinesis Firehose Delivery streams to do so, which enabled us to easily ingest, transform, and store event data in a serverless manner.

We integrated the Java AWS SDK into our Android application and started pointing our events to a Firehose Delivery stream.
To ingest the data, we created a new Kinesis ingestion stream with input from PUT sources and destination to Amazon S3. Since our event data does not always match the format defined in our Glue table, we utilized source record transformation to convert the incoming events to the correct format. We also enabled record format conversion to reduce the object size of our events.
To save on query costs, we also partitioned the streaming data by day. This allows our analysts to only query events from the specific days they are interested in. A Glue crawler is scheduled to run every 24 hours to keep the table schema up to date.

This two-phase approach allowed us to migrate our historical data into AWS and integrate it with new events all while keeping costs down, by leveraging a compressed columnar file format.

Summary

In this post we gave a high-level look into how Mistplay migrated its Android application event solution from GCP to AWS. Notably, we highlighted just how easy it is to migrate over from Firebase Analytics/BigQuery to AWS S3 + Athena and take advantage of the reliability, scalability and simplicity of serverless data lakes in addition to the clear pricing of AWS services.

Learn more about Mistplay

To learn more about Mistplay and what we are working on, head over to https://www.mistplay.com/#/

Learn more about about the AWS services we used

If you want to learn more about serverless data lake architectures, some helpful resources are below:

· Data Lakes and Analytics on AWS

· https://data-processing.serverlessworkshops.io/

AWS Startups Blog