Snowplow Analytics provides an event analytics platform. The UK-based company enables its clients to collect granular, customer-level, and event-level data from multiple platforms, including web and mobile, and load that data into structured data stores to support advanced data analytics. Snowplow customers, who include retailers, media companies and gaming companies, mine and visualize data using Business Intelligence tools such as Looker and Tableau, and statistical and modelling tools like R and pandas. Snowplow is an open source platform: businesses can download Snowplow and set it up on their own AWS accounts, giving them complete ownership and control over their event data.

Snowplow is built to enable businesses to perform a wide range of data analytics. The company supports enormous data sets—for example, gaming companies can generate billions of events each day. When the company was founded in 2012, Snowplow Analytics was a batch-based processing system that, while robust, did not allow for continuous, real-time event analysis. Founders Alexander Dean and Yali Sassoon wanted their customers to be able to use the same data to drive operational systems like ad targeting or product recommendation engines. “At the time, there wasn’t an Amazon Elastic MapReduce (Amazon EMR) equivalent for real-time processing,” Dean says. “We really like Amazon EMR—it’s very straightforward to get up and running. With Amazon Redshift, you can figure out so many interesting, cool things offline—but we wanted a tool that would give us insights that we could act on within seconds, in real time.”

The company’s batch-based system could return results overnight, but Snowplow founders Alexander Dean and Yali Sassoon wanted a way to use that data immediately for decisioning. “The decisioning loop is a model predicated on customer behavior,” Dean explains. “Our goal was to take the customer behavior models that we’d built using tools on offline data in Amazon Redshift and apply it to users while they’re in the middle of behaving a certain way. Let’s say a user is on a car sales site, signing up for a test drive—but before they complete the form, they leave the site. We needed a way to identify that behavior and immediately take action.”

No software existed with the combination of scalability, ease of use, and cost-effectiveness Snowplow needed, until Amazon Kinesis was launched in late 2013.

Snowplow uses Amazon Web Services (AWS) to collect and store event-level data using a technical architecture that is linearly scalable. The company uses different data collectors: one runs on Apache Tomcat using AWS Elastic Beanstalk, and another runs on Amazon Elastic Compute Cloud (Amazon EC2), using Elastic Load Balancing and Auto Scaling to manage data collection across multiple instances of Amazon EC2. The company uses Amazon Simple Storage Service (Amazon S3) as a data store, and Scalding on Amazon Elastic MapReduce (Amazon EMR) to validate, clean, and enrich the data. Snowplow uses Amazon Redshift as a database to support analytics.

In 2014, Snowplow added an Amazon Kinesis stream to its service to capture and store data from client systems. The data is then drip-fed into Redshift for continuous real-time processing. Amazon EC2 is used for data collection with Kinesis. “Adding Amazon Kinesis to the mix was like adding rocket fuel,” Dean says. “Thanks to Amazon Kinesis, our users have gone from having data that was fresh yesterday to having data that was fresh 2 minutes ago,” Dean says. Snowplow Analytics is currently running Amazon Kinesis in beta, but expects to move into production later this year.

In addition, Snowplow users can input, query, and analyze data from third-party tools. “AWS gives you very simple building blocks that can nonetheless be used to create a robust, customized infrastructure,” Dean says. The new Snowplow data pipeline leveraging Amazon Kinesis is shown below in Figure 1.


Figure 1. Snowplow Data Pipeline

Using Amazon Kinesis has enabled Snowplow to offer real-time feedback loops to its customers, and has reduced analysis time from several hours to seconds—even when Snowplow is ingesting hundreds of millions of events each day. “With Amazon Kinesis, you can actually identify behavioral patterns as they are playing out,” Dean says. “You can provide incentives quickly to change that behavior, encouraging users to meet a certain business goal, whether that’s signing up for a test drive or staying in an online game. By using AWS, we can ingest data rapidly and at scale, and help our customers push offers to their own customers.”

Ease of use has been another bonus. “We were able to get started on Kinesis quickly, easily, and at scale,” Dean says. “And because we’re built on AWS from the ground up, we can continue to evolve the Snowplow platform as AWS evolves. AWS is a great platform to host Snowplow.”

Snowplow’s developer community has expanded as a result—as an open-source platform, Snowplow relies on a thriving community of developers to evolve the platform. “Now that we’re using Amazon Kinesis, we’re getting some great new contributors to our projects, and a lot of excellent feedback,” Dean says.

By using Amazon Kinesis, Snowplow can now expand its user base, entering new markets like digital advertising, where the ability to leverage feedback loops quickly is critical. “Using Amazon Kinesis opens up the scope of what Snowplow can do—it has opened us up to new markets and new users,” Dean says.

To learn more about how AWS can help you with big data, visit our Big Data details page: