Pulse – Using Big Data Analytics to Drive Rich User Features
Its always exciting to find out that an app that has changed how I consume news and blog content on my mobile devices is using AWS to power some of their most engaging features. Such is the case with Pulse, a visual news reading app for iPhone, iPad and Android. Pulse uses Amazon Elastic MapReduce, our hosted Hadoop product, to analyze data from over 11 million users and to deliver the best news stories from a variety of different content publishers. Born out of a Stanford launchpad class and awarded for its elegant design by Apple at WWDC 2011, the Pulse app blends a strong high-tech backend with great visual appeal to conquer the eyes of mobile news readers everywhere.
Pulse backend team members from left to right: Simon, Lili, Greg, Leonard
The December 2011 update included a new feature called Smart Dock, which uses Hadoop and a tool called mrjob, developed by Yelp, to analyze users reading preferences and continuously recommend other articles or sources they might enjoy.
To understand the level of engineering that goes behind such rich customer features, I spoke to Greg Bayer, Backend Engineering Lead at Pulse:
How big is the big data that Pulse analyzes every day?
Our application relies on accurately analyzing client event logs (as opposed to web logs) to extract trends and enable other rich features for our users. To give you a sense of the scale at which we run these analyses, we literally go through millions of events per hour, which translates to as many as 250+ Amazon Elastic MapReduce nodes on any given day. Since we are dealing with event logs, generated by our users from the various platforms on which they access our app (Android, iPhone, iPad, etc.), our logs grow in proportion to our user base. For example, the recent influx of new users from Kindle Fire (Android) means we now have a lot more logs coming in from those devices. Also, since the logs are big, weve found that it is very efficient to write them to disk as fast as possible – directly from devices to Amazon EC2 (see my tandem article on the logging architecture we use and the graph below, which highlights some of our numbers).
For more Pulse numbers, checkout the full infographic.
Powering Rich Features for Our Users
Much of our backend is built on industry standard systems such as Hadoop. The innovation happens in how we leverage these systems to create value. For us, its all about how we can make the app more fun to use and provide rich features that our users will love. For techies, you can read about many of these features in the backend section of the Pulse engineering blog and learn about all the details.
The Right Choice for Big Data
I joined the team here pretty early on as the first backend engineer. I came to Pulse after working at Sandia National Labs, where I built and managed an in-house 70-node Hadoop cluster. This was an investment of over $100,000, operational support, and over 6 months time to get it fully fine-tuned. Needless to say, I was fully aware of the cost and resources needed to run something at the scale that Pulse would need to accommodate.
AWS was and still is the only feasible solution for us. I love the flexibility to quickly stand up a cluster of hundreds of nodes and the added flexibility of choosing the pricing scheme thats needed for a job. If I need a job done faster, I can always spin up a very large cluster and get results in minutes, or take advantage of smaller instances and the spot marketplace for Amazon Elastic MapReduce if Im looking to complete a job thats not time-sensitive. Since an Amazon Elastic MapReduce cluster can simply be turned off when we are done, the cost to run big queries is usually quite reasonable. Consider a cluster of 100 m1.large machines: a set of queries that takes 45 minutes to run on this cluster could cost us approximately $11 – $34 (depending on whether we bid on spot instances or use regular on-demand instances).
Lessons Learned (the bold formatting below is our doing :) )
It is important to consider the trade-offs and choose the right tool for the job. In our experience, AWS provides an exceptional capability to build systems as close to the metal as you like, while still avoiding the burden and inelasticity of owning your own hardware. It also provides some useful abstraction layers and services above the machine level.
By allowing virtual machines (Amazon EC2 instances) to be provisioned quickly and inexpensively, a small engineering team can stay more focused on the development of key product features. Since stopping and starting these instances is painless, its easy to quickly adapt to changing engineering or needs perhaps scaling up to support 10x more users or shutting down a feature after pivoting a business model.
AWS also provides many other useful services that help save engineering time. Many standard systems, such as load balancers or Hadoop clusters, that normally require significant time and specialized knowledge to deploy, can be deployed automatically on Amazon EC2 for almost no setup or maintenance cost.
Simple, but powerful services like Amazon S3 and the newly released Amazon DynamoDB make building complex features on AWS even easier. Because bandwidth is fast and free between all AWS services, plugging together several of these services is a great way to bootstrap a scalable infrastructure.
Thanks for your time, Greg & best of luck to the Pulse team!
Related: Pulse Engineering – Scaling to 10M on AWS