According to BackType founder Christopher Golda, all the company’s infrastructure currently resides on Amazon Web Services (AWS), except for an external message transfer agent (MTA). The company made the decision in 2008 to move to AWS, based largely on flexibility. Golda says, “AWS gave us the flexibility to try new ideas extremely rapidly and remain flexible with our existing infrastructure.”
BackType requires about 25 terabytes of compressed binary data on its servers, holding over 100 billion individual records. Its API serves 400 requests per second on average, and it has on average 60 Amazon Elastic Compute Cloud (Amazon EC2) Reserved and Spot Instances running at any given time, scaling up to 150 for peak loads with Spot and On-Demand Instances.
Dealing with that volume of data with limited resources has forced the team to be creative: They've invented their own language, Cascalog, to make analysis easy, and their own database, ElephantDB, to simplify delivering analysis results to users. They have even written a system to update traditional batch processing of massive data sets with new information in near real-time.
Golda says, “The backbone of our pipeline is AWS, using Amazon Simple Storage Service (Amazon S3) for storage and Amazon EC2 for servers. We leverage technologies such as Clojure, Python, Hadoop, Cassandra, and Thrift to process this data in batch and real-time.”
The start of the pipeline (see diagram below) is a group of instances that ingest data from Twitter, Facebook, and millions of other sites and social media services. From there, the architecture branches into two pipelines: the traditional batch layer that takes hours to produce results and a "speed layer" that reflects new changes immediately.
“Captured data is fed into the batch layer through processes on each machine called collectors,” explains Golda. “These append new data to a local file, which is then copied over to Amazon S3 periodically. This raw data is put through a process called shredding, which organizes it in two different ways. First, data units are stored with others of the same type. Second, the same data is sliced by time, so everything within a single day will be stored together.” This organization of the data enables BackType to run more efficient queries only against the relevant data.
The team’s database, ElephantDB, takes all the data from a batch job, and splits it up into shards, each of which is written out to disk as BerkeleyDB-format files. After that, an ElephantDB cluster serves the shards. Unlike many traditional databases, it's read-only, so updating data served from the batch layer requires creating a new set of shards.
Golda says, “The system is robust, scalable, and flexible, but it is not low latency. Because the batch workflow runs once or twice per day, we needed to create a speed layer on top to compensate. Our speed layer is being powered by a pair of technologies we've code-named Storm and Thunderlog. Storm manages the stream processing and guarantees reliability of messages, while Thunderlog is a query language for writing real-time queries against unbounded spouts of data.”
Backtype’s system is optimized to reduce their costs for each type of workload. BackType leverages Reserved Instances to ensure the availability of their database workloads while reducing costs by up to 34% over On-Demand instances. “Reserved Instances made sense for our use case, because we knew we would run our instances for at least 5 months of the year. We also wanted Amazon to reserve the capacity for us for the entire year. We are a Y-Combinator startup, so many of our Board members are aware of AWS. So, when we presented the plan to purchase Reserved Instances to our board, they knew it was worth the investment since it reduced our costs so much,” says Michael Montano, founder at BackType.
For their batch workloads, BackType utilizes Spot Instances to maintain elasticity and flexibility while reducing their costs by up to 66% over On-Demand. BackType migrated to Spot by updating the Cloudera boot scripts to initialize the instances on startup with the right data and settings. “Since our batch processes are short running processes, the benefit of the price reduction cancels out any costs to restart our clusters,” said Michael Montano. “If we need to restart a cluster, we simply leverage the Amazon EC2 APIs to submit a new bid.” Backtypes bidding strategy is fairly similar to other customers, “We bid high enough to reduce the chances of being interrupted. When bidding choose a maximum price we are comfortable paying in the worst case. However, Amazon always charges us the market price rather than my max bid price, so we typically pay a lot less.” In some other one-off use cases, such as scaling a database or a handful of components in the Speed Layer, BackType leverages On-Demand instances. “We typically use On-Demand instances to scale up portions of our system for short periods of time that cannot be interrupted.” says Golda.
Looking ahead, the team plans to use additional AWS services. Golda says, “We’d love to use Amazon Elastic MapReduce more extensively, as well as the newly introduced e-mail delivery services.” He adds, “BackType would not have been possible without AWS.”
To learn more, visit http://www.backtype.com/
.
Added April 12, 2011