AWS Startups Blog
Ufora: Algorithms Got Smart. So Should Compute.
Guest post by Braxton McKee, CEO, Ufora
The Ufora platform empowers data scientists to analyze any data quickly and easily by automating the engineering.
Credit Modeling: Just Add More Machines
I founded Ufora in 2009, after running Credit Modeling at a large mortgage hedge fund for several years. My basic realization was that adding more machines was a lot more efficient than adding more engineers. The challenge, though, was getting the machines to behave like smart engineers.
During the financial meltdown, I had the unenviable task of trying to get our models to reflect the rapidly changing mortgage markets. The problem was that the relatively simple code describing the business logic we cared about was deeply entangled in complicated code describing the infrastructure logic that made the system fast, scalable, etc. The business logic described things like what would happen if a judge suddenly decreased the interest rate on a borrower’s mortgage (even though the models assumed this would never happen, judges used this as a mechanism to try to keep borrowers in their homes). The infrastructure logic represented the low-level engineering work of getting many machines to operate in a unified way, including fault tolerance, memory management, message passing, cache locality, etc. Carefully threading business logic changes, like the interest rate decrease, through 500,000 lines of C++ infrastructure code without breaking anything turned out to be glacially slow, expensive, and painful.
My credit crisis experience led me to the realization that we could dramatically amplify the value of distributed computing if only there were a way to avoid painstakingly hand-coding the infrastructure logic. This realization, combined with the emergence of inexpensive cloud computing on Amazon Elastic Compute Cloud, quickly led me to found Ufora. Four years later, we’re working with some of the largest banks, insurers, credit card companies, and hedge funds to change the way data scientists analyze their data. So, what happened?
Algorithms Got “Smart”
After leaving the hedge fund, I witnessed a revolution in data modeling: the world was shifting from a manual, explicit framework for modeling complex problems to an automatic, implicit one. In other words, data scientists stopped explicitly telling computers what to do, and started telling them how to learn what they wanted them to do. IBM Watson and Google’s cat recognition are classic examples of this shift to an implicit learning approach.
But, while these approaches adopted state-of-the-art machine learning techniques to analyze the data, they relied on manually engineered infrastructure systems to implement those models — the same low-level infrastructure code I used to write describing how to get 1000s of machines to behave like a single giant machine. Stanford’s Andrew Ng famously discusses this challenge of relying on human engineering as data science problems scale. This manual approach might make sense for a big-budget, multi-year project like Watson, but most data scientists don’t have the time, capital, or expertise to execute massive engineering implementations that take years to complete.
And who wants to wait that long anyway? At Ufora, our design goal is “Coffee Time”: regardless of the data size or compute intensity, is this model going to compute in the time it takes me to stand up, walk to our office kitchen, deliberately brew myself a cup of Nespresso Caramelito, and return to my desk with a smile? In other words, we’re trying to make data science euphoric again by restoring the gratification and productivity that comes from rapid prototyping — something large datasets have all but destroyed.
Two Options for Scaling Your Machine Learning
As the Ufora team watched this machine learning revolution unfold, we started thinking about data scientists’ options for scaling these newly empowering learning models, which could seemingly do anything, and we were pretty underwhelmed. There were basically two options: Option 1 (Fixed Framework) imposes infrastructure frameworks on analysis (e.g. MapReduce). Option 2 (Fixed Functionality) is the opposite: it tries to pre-optimize the infrastructure scalability by rigidly constraining the analytics functionality. Option 1 is great if you want to set up a scalable job that you’ll run repetitively and will rarely change (e.g. sorting). Option 2 works well if you want to run plain vanilla models that don’t need modifying.
Unfortunately, both of these options force data scientists to make unpleasant trade-offs. Option 1 delivers scale and sometimes speed at the expense of ease-of-use and flexibility, which fails because it still burdens the data scientist with infrastructure constraints that at best are a major distraction from modeling. Option 2 restores some ease-of-use, but severely limits flexibility and expressibility, which fails under the realities of messy data and domain-specific algorithmic improvements. What we really wanted was data scale, compute speed, ease-of-use AND complete algorithmic flexibility. In essence, we wanted R, with the speed of compiled code, at terabyte scale: model anything under the sun with a little bit of code, and have the implementation happen automatically under the hood.
Machine Learning for Machine Learning, or “Smart Compute”
At Ufora, we like to say we’re building “machine learning for machine learning,” mostly because it confuses people, but also because we think it’s the future of Data Science 2.0. The way our technology works is to “watch” computations as they are running, automatically infer things about your models, and make “smart” decisions adaptively on the fly. These “smart” decisions include things like taking any algorithm you can write, and automatically multi-threading it to run in parallel across the available CPU cores on all machines (a handy trick when running in AWS). It means automatically distributing data to always keep it in memory across all of your available machines. And it means adaptively reevaluating these decisions in realtime while computations are running. For instance, if a machine fails during a computation, Ufora adaptively reallocates the workload until your algorithm finishes. This frees data scientists from ever thinking about a “DAG”, a “job” or a “data chunk” again.
We call this machine learning for machine learning because it mimics the implicit approach that has been so successful in data analysis, and applies it to infrastructure engineering. Rather than explicitly describe machine interactions for every model you write, Ufora uses machine learning to learn how to optimally distribute and parallelize your models while they’re running — automatically. The result is to turn your browser into a super-computer with invisible machine boundaries. It looks and acts like one giant machine with as many processors and as much memory as you need. We call this Smart Compute.
Smart Compute at Work: Linear Algebra on a 1Mx1M Dense Matrix
How does Smart Compute work in practice? Well, we recently got a request to run a conjugate gradient descent model on a 1 million by 1 million dense matrix (that’s right, all 1 trillion entries!). And the best part was that this was just the test case for running a 10 million by 10 million matrix! (As we say around the office, we’ll burn that bridge when we get there.) Holding the matrix in memory would have been impractical on the 1Mx1M matrix and impossible on the 10Mx10M matrix (unless you know of a nearby machine with 800TB of memory), and writing to disk would’ve been impossibly slow.
For this model, each step required a matrix vector multiply operation. In Ufora, you can create a matrix object that behaves like a matrix, but computes entries as needed rather than holding the entire matrix in memory. By using our “synthetic” matrix (we also have synthetic vectors that are useful for high-dimensional analysis problems), we effectively converted the problem into a compute problem, which is handy because Amazon has a lot of multi-core machines lying around. But, being a startup, we didn’t really want to pay for 1000s of processors using Amazon Reserved Instances. Instead, we fired up 1000 cores using Amazon Spot Instances, which are amazingly priced. We knew losing the machines to higher bidders wouldn’t be an issue because Ufora is fault-tolerant; when the machines get pulled, the computation just keeps cranking away on the remaining hardware, and gets reallocated as machines come back online. To compute the ‘multiply’ operation in under one hour, we chose between 500 and 1000 cores (that was a first!) using our Core Selection menu, and within a minute had the machines working in concert in AWS.
In the end, we were able to run the matrix vector multiply in about 45 minutes.
- Total lines of code: 50.
- Amazon’s charge to Ufora for providing more than $1M worth of hardware? Approximately $10.
Leveraging the Latest
It’s well known that cloud computing has drastically lowered the cost of business, but even in Ufora’s short lifetime, we’ve been blown away by the leverage it and complementary products create for us. We’re moving to Docker to modularize our service architecture and Chef (AWS OpsWorks) for automated and testable software configuration management. We rely heavily on AWS CloudFormation for streamlined hardware provisioning and cluster management. Spot Instances allow us to run tests on our own software development pipeline 24/7, and now Ufora’s ability to orchestrate hundreds of Spot Instances simultaneously for ad hoc data analysis has radically changed our conception of what’s possible.
In addition to nerding out (always mandatory), the reason we’re so excited about these infrastructure advances is because they allow people to focus on what actually matters (imagine how unproductive we’d all be if we had to engineer an engine every time we wanted to drive a car). At Ufora, we believe bright people are a scarce resource — time is precious. When they’re unencumbered from technical limitations they don’t care about, they solve important problems in inspiringly creative ways. Our mission is to empower those people with incredible technology, and let them drive.