AWS Startups Blog

Segment: Scaling a Week-Long Hack Project to Billions of Events a Month

Segment data hub office

Guest post by Calvin French-Owen, CTO and Co-founder, Segment

Segment is a customer data hub, the central place where you can send your customer data and then analyze it with more than a hundred different tools–all with the flip of a switch. But looking at Segment today, you’d never guess where we started.

A Few Failed Ideas

Believe it or not, we got our start as an education product called ClassMetric. The idea was to create a tool that would let college professors know when the class was confused in real time during the lecture. Students would visit our site on their laptops from the audience, and the professor would have her own dashboard up at the podium.

The good news was that students had no problem using their laptops during class. The bad news was that they weren’t using ClassMetric. They were visiting Facebook, Gmail, and Twitter instead.

Oops.

After enough failed classroom attempts, we decided to switch gears. We clearly weren’t nailing the market, so it was time to try something new. After some brainstorming, we settled on building a new, better analytics product.

We went through a bunch of iterations, everything from segmenting groups of people in every possible way to visualizing where users were dropping off of conversion funnels. The problem was that none of our products were getting very much traction, no matter what we tried.

So we decided on yet another approach. We turned the problem on its head and asked ourselves: “Why aren’t people integrating our tool?”

Asking the Right Question

The answer we arrived at was that analytics are a pain to integrate correctly. Each tool has its own API, so you first have to learn that API and then figure out what data you even want to send. If you’re not careful, the data won’t be mapped correctly either. In the end, it’s a ton of custom work to get the same data across different tools. It’s a tricky problem to get right, and it’s definitely not one engineering teams should be spending their time on. They should be focused on building their core product.

So we decided to solve that problem once and for all. What if we made a simple API that transformed your data into whatever format you wanted?

Since then, we’ve made data portability our core mission. We started with an open-source JavaScript library but quickly expanded into a hosted version, server-side libraries, and mobile SDKs that make it as easy as possible to get your data from point A (your site or app) to point B (analytics, growth, and product tools).

As the routing and transformation layer for all this data on its way from place to place, we were able to build two cool features that give customers hypercontrol over their data. First is the ability to replay historical data into new applications. Second is what we announced last week, direct SQL access to the data in Amazon Redshift.

We built these features because, after all, your data should belong to you. And you should be able to analyze it in any tool you’d like.

Cut Scope, Move Fast

The biggest lesson we’ve learned from our journey comes down to focusing on one product that fills a niche really well. Paul Buchheit has a great quote that “it’s better to make a few people really happy than to make a lot of people semi-happy.”[1] He’s exactly right, but achieving that in practice requires a ton of discipline.

When it comes to building product, my mantra is: “big scope, high quality, fast iterations: choose two.” When you’re a very early stage startup, you need to move fast or else you die. At the same time, you need to ship something good enough that people actually want to use it (and then tell a friend about it).

So early on in the product, we cut scope everywhere we could. The very first version of Segment shipped with just eight integrations. There were no mobile or server-side libraries; only JavaScript was supported. We didn’t store the data at all, so there was no idea of export or replay across tools.

When we did start thinking about storing data, we decided to dump the raw logs into Amazon S3, rather than build transformation into Redshift right away, because we didn’t know how customers would want to use it yet.

Version one simply rendered some JavaScript that loaded other JavaScript. It was basic and rudimental. But we kept it easy to use.

Now you might be thinking “I could hack that together in a week.” And you’d be right. We launched seven days after the initial commit.

In order to move quickly and validate that we were actually achieving product-market fit, we had to think carefully about what we actually want to build. And we really had to leverage existing infrastructure from the open-source community and AWS to avoid spending all our time reinventing the wheel.

Buy, Not Build

Besides, our mission is to make customer data a joy to use. Building a content delivery network isn’t anywhere near our core competency, so why waste precious time that could be spent on our core product? It wouldn’t matter that if we built the best CDN of all time if the company died after six months.

Even beginning with that first version, we aggressively outsourced infrastructure to AWS. So how did we do it and what did we learn?

All of our code runs on Amazon EC2 instances in an Amazon VPC. After ClassMetric we made the switch to running on our own private subnets, and it’s made a world of difference. If you’re hosting a service, there’s almost no reason to host as part of a public cloud. Having dedicated IP ranges you can control and better network restrictions makes it pretty much a no brainer.

Because our dynamic JavaScript is cached by Amazon CloudFront, we can serve hundreds of millions of requests a day with only a handful of servers. We started with invalidating specific paths whenever users updated their settings. But since then, we’ve found it easier to use a low max-age on the response-caching headers since invalidations take up to 15 minutes anyway, and clients will never cache a really stale version.

The DNS all runs through Amazon Route 53, and we use Elastic Load Balancing load balancers extensively to route requests to our API. We’ve learned that it really pays to set up a weighted routing scheme early for any load balancers you’re using. That way you can add new load balancers with a low weight while they scale up to meet production load.

We store all of the data as flat files in our S3 bucket. It’s great because S3 provides a highly available “source of truth” for the data while allowing us to bypass all the cost and scalability problems of a database.

One thing we hadn’t realized, however, is that you’ll want to take care in naming your files due to the partitioning of the data under the hood. The data itself is sharded by prefix, so if you have a few hot prefixes, then it can hurt your workload significantly.[2]

Leveraging New Infrastructure

Today we continue to see the benefits of building atop AWS. Now that it’s grown into a more mature ecosystem, the number of tools that are integrated with Amazon is staggering.

Our Stackdriver server monitoring hooks seamlessly into Amazon CloudWatch, surfacing all kinds of data about our cluster with almost no configuring. We’ve scripted our deployment and provisioning tools against the AWS API. We use Loggly’s S3 integration to dump our logs directly into our S3 bucket. And as of this week, we transform and load all of our analytics data into Redshift for universal SQL access.

As we grow, we’re excited that the different AWS offerings continue to grow with us. We’re currently migrating to use containers in production, and we hope to take advantage of the newly announced container service. We’re also investigating how we get the most from Amazon Aurora and the new AWS Config service for actually querying all this data and registering our services.

Most of all, we’ll continue to use those tools and open-source infrastructure to move fast, achieve quality, and cut scope.

[1] http://www.paulgraham.com/13sentences.html

[2] http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html