AWS Startups Blog

Crittercism: Solving Mobile Performance Issues for One Billion Users

Crittercism

Crittercism, a company based in San Francisco, provides the world’s leading performance management solution for mobile applications, and is devoted to helping mobile and IT Ops teams run better, faster, and smarter mobile apps. Crittercism has grown rapidly since its inception in 2011 and, with the help of AWS, now supports over one billion active users per month. We sat down with Paul Lappas, Vice President of Engineering at Crittercism, to learn more.

How did Crittercism get started and what services does it offer?

Crittercism was founded on the simple premise that the increasing complexity of mobile apps can create performance issues that are difficult to identify, manage, and fix. Many companies that deliver revenue-critical mobile apps don’t have the necessary infrastructure and resources to rapidly find and respond to performance issues in a mobile environment.

That’s where Crittercism can help. We find and fix issues quickly by sifting through massive amounts of app operational analytics. We monitor crashes, and we handle exceptions, transactions, endpoint latency, and errors for you. We also give you the option to configure custom metadata and breadcrumbs. All of this is tied together through real-time dashboards and APIs that prioritize, troubleshoot, and trend performance issues. Our product lets you drill down into the data for individual mobile users and monitor the experience they are having. We also provide actionable, data-driven guidance that helps you solve problems that cause your users pain.

We support all mobile platforms and offer a freemium, subscription-based model, including free basic monitoring for up to 30,000 users. You can get up and running in minutes simply by installing a small SDK.

What is one of your top technical challenges?

A significant challenge we face is sheer scale. Currently, our SDK is installed on approximately one billion devices, and each device sends dozens of raw analytics to our servers every day. This amounts to roughly 50,000 incoming requests per second. As a point of comparison, Google’s world-wide search traffic is about 40,000 requests per second (according to this site). We obviously don’t have Google’s infrastructure budget, so we have to get very creative to manage our costs.

AWS is critical to helping us manage costs and achieve this scale with a small team. For example, we use Amazon EMR as a core piece of our solution to process, store, and provide actionable data for our customers. The approach that we’ve taken is based on basic distributed system design patterns: the system should a) be fault-tolerant, against both hardware failures and human error; b) create abstractions, which allow us to serve a wide range of workloads and use cases; c) achieve low-latency reads and updates; and d) scale linearly and “out” by adding more servers.

Our technology approach includes implementing a very high-capacity ingest tier using Node.js and Kafka, storing all the raw data into HDFS, and then using EMR jobs for processing and moving the result-set into a place where it can be served to users. This approach gives us the flexibility to find insights into this raw data to meet our customer’s needs. This might be “fix this crash here” or “this particular API endpoint is having latency issues.” We can also segment the data for customers across device make/model, location, carrier, and OS.

Can you share some more metrics?

In the past year, we’ve grown to support one billion users with only three DevOps people on staff. Supporting a lot of users means that we need to generate massive amounts of data:

  • 0.5 trillion app sessions
  • 3.5 billion total errors monitored/diagnosed
  • 120 billion total server endpoint samples
  • 150 million mobile transactions managed within three months of launch

To manage all this data, our data engineering team selects which metrics we want to store, and builds the data pipelines to move that data into a warehouse so that anyone in the company can access it to help with projects. We push all of it into Amazon Redshift. Redshift has become the de-facto data warehouse solution for our company and “source of truth” for the various metrics that we capture daily. And there’s also a large ecosystem of third-party visualization tools that support Redshift, so it makes it really easy for us visualize and export the data internally.

Because of our cross-platform view, we’re sitting on one of the industry’s largest data sets, and the trends are interesting. We’ve done a lot to share some of this publicly, most recently through the launch of data.crittercism.com, the first industry benchmark portal for mobile apps. And in our product, we have features that enable you to compare your app’s performance versus trends in your vertical or app category. Folks love to know how they are doing versus trends in their industry verticals.

What have you learned about managing data sets this large?

The core challenge is defining the right metrics and them moving them into a place where they can be reliably accessed by analysts or automated systems. This is both a technical challenge and an organizational one, as it requires coordinating between development, product, and business users. In addition, new product offerings might change the type of data being collected, and architectural changes might impact how the data is being processed and where it is stored. There are just so many possible metrics that landing the “right” ones is a bit of trial and error and requires feedback from business users. You need great data engineers who can work across business and technical teams to make this happen.

In general, you need to set up the following solutions to manage large data sets:

  1. Single platform to store the data — Use a single platform as your “source of truth.” It’s important that metrics are stored in one place so that you can easily correlate data together, without needing to add complexity. We use Amazon Redshift as our data platform.
  2. Robust ETL pipelines — Extract the right metrics from your online data stores, process them, and then move them into your data warehouse. Your dev teams need to treat this as a requirement, and they need to consider the impact on your Extract, Transform, and Load (ETL) systems when they add new features or modify the architecture. Also, your DevOps teams should treat this like a first-class production system — including designing for robustness, fault tolerance, and monitoring.
  3. Accessible data — Make the data accessible to as many people as possible within your org, with special attention to product development teams, marketing, and sales ops. Each group will have different levels of skills and needs, so you need to provide interfaces for both machines and humans. For example, you can use SQL and BI visualization tools (there are many SaaS-based BI tools available today) so that non-technical folks can easily write queries and visualize the data.

Why was Crittercism “born in the cloud”?

We have a small team, so we can’t afford to spend precious time working on projects that don’t add customer value. We decided early on that we wanted to avoid managing our own infrastructure. I founded an IaaS cloud provider in the past, so I know how complex it is to manage the lifecycle of server infrastructure over time. There are many hidden costs. Plus, you limit the flexibility of your product architecture because you have to make certain decisions upfront about how to build out your infrastructure that are hard to change later.

For example, if you scope your infrastructure for MongoDB, and later decide to shift from MongoDB to Amazon S3, you’re stuck with a bunch of physical servers that you don’t need. In the long run, you can’t optimize for the things that are best for your dev team and customers, which is really what you should be focusing on.

On AWS this is not an issue. We did our due diligence and took a look at TCO for the “build the base and rent the spike” approach. Our cost analysis revealed that staying on AWS was cost equivalent to doing it ourselves. When you factor in the flexibility issue, it was a no-brainer to stay on AWS. We committed to a three-year RIs to max our cost savings.

How does the AWS regional presence help you achieve your business goals?

We have a global client base and an aggressive channel and partnership strategy, focusing mainly on the EU and emerging markets in Asia. Data privacy and personal information is a big deal in the EU today, especially in Germany. So there’s a lot of business in the EU that we can tap by opening a point-of-presence there.

We were pleased to see AWS invest further in the EU by opening a Frankfurt region. We will be launching a presence in Frankfurt in Q1 2015. We’ll be looking at Asia as well, in the second half of 2015. And we can do this without adding more people because of the way we’ve tooled our deployments and managed our infrastructure entirely as code. When you think about it, it’s pretty amazing that we can do that. Ten years ago it was unheard of. We would’ve had to fly our employees out there and ramp up support on-site to get this done. It would’ve taken six months. On AWS, we can do it in less than one month.

How else does AWS support the growth of your business?

In addition to the various services that we use, there are a number of factors that differentiate AWS from other options:

  • AWS Support — We love AWS Support for enterprises. Those folks are essentially an extension of our own team. They are always available day or night to help us with issues or answer questions.
  • AWS regional presence — The regional presence of AWS enables us to reach new audiences and markets with minimal incremental investment.
  • Continuous deployment — AWS provides continuous deployment through automation, which gives us competitive advantage through accelerated Time to Market. We can stay lean and service huge numbers of customers.
  • Disaster recovery/business continuity — Because of AWS APIs, we can recreate our entire infrastructure (assuming we have database backups and access to our code repository) in just a few hours.
  • Reliability and uptime — We are able to achieve 99.99% uptime for our service on AWS.
  • Security and compliance — This is a big one. Obviously, there is a shared security model, because AWS doesn’t manage our own application. But I’m confident that AWS will continue to ensure compliance with emerging infrastructure requirements, which means that we can focus on application security on our end. AWS makes compliance and InfoSec a huge priority.

What’s an example of a success story that is unique to mobile apps?

We have a customer in Spain that was receiving complaints from some of their users because the users were unable to access the company’s app. But the company’s server monitoring didn’t show any issues. So they dug deeper by using the geolocation features of our product and discovered that all affected users were clustered in a single building. It turned out that a steel bridge near the building was disrupting mobile phone data and preventing access to the server. It’s an interesting story because this sort of problem is unique to mobile. You need deep visibility into the mobile experience to isolate stuff like this.

What are you working on now that you think is a game-changer?

We have a new feature that monitors mobile transactions, because we really think it’s a game-changer for how folks think about the impact of mobile performance on business revenue. We define a mobile transaction as any series of steps that leads to a business outcome.

For example, in retail this can be a cart checkout or a bank deposit for financial apps. For non- consumer facing apps, this might be a time & expense submission. The way it works is simple: you wrap your “transaction” around ‘start’ and ‘end’ tags in source code, and give it a name. In our dashboard, we show you how many transactions succeeded or failed. For a transaction that failed, we give you the reason for the failure. We also tell you much revenue is at risk, across your entire user base, as a result of the failure. It’s been hugely eye-opening for our clients. They say, “I had no idea that 20% of my transactions are failing” for this or that reason. It helps them focus on fixing the most important issues.

It’s hugely gratifying for us to be able to solve problems for so many mobile users. We’re looking forward to what the future holds in store.