Brandon Allgood, co-founder and Director of Computational Science at Numerate, explains that prior to the founding of Numerate, the team built and maintained a number of Beowulf-style clusters in-house. Now, he says, “We only have an IBM ~100 core bladecenter and a 48-core shared memory machine. The bladecenter is used for development and testing. We also maintain a number of database servers in house.”
Allgood notes that after their experience with building and paying to maintain large clusters, the value of Amazon Web Services (AWS) to Numerate was “obvious.” He says, “The cost of maintaining a large cluster is not compelling unless you can maintain a high percentage of utilization over time. We were never able to achieve such an average workload.” While the Numerate team looked carefully at Sun Grid and other compute rental options from groups such as IBM, they found that the other offerings were either too expensive or too inflexible for their spiky use cases. Allgood explains that, “When we started using the cloud in 2007, AWS was, in our view, the only real choice for a cloud solution, especially from a security point of view. AWS still has a significant edge in terms of cost, and in terms of our ability to control both the setup and the security of our compute. Amazon has continued to innovate at an incredible pace to keep our costs low and create products that stay ahead of our needs”
Numerate has incorporated Amazon Elastic Compute Cloud (Amazon EC2) as a production computational cluster and Amazon Simple Storage Service (Amazon S3) as cache storage. Allgood explains: “We have built a proprietary Java-based drug design platform architected to run on a cloud-based infrastructure to solve large-scale scientific compute problems. Unlike many MapReduce workloads, which tend to be IO-bound, our workloads are typically compute-bound and therefore require as many cores as possible with the minimal memory footprint of a gigabyte per core. Such computational requirements make Amazon EC2 and Amazon S3 perfect for our use.”
Like many large batch processing workflows, Numerate has leveraged Sun Grid Engine (recently renamed Oracle Grid Engine), an open source framework, to schedule their compute jobs across hundreds of instances. For optimal instance security, Numerate uses OpenVPN to create a secure virtual network and has modified IPTables on their default Amazon Machine Images (AMIs). Numerate stores processing results in Amazon S3, which they secure using Perl and Apache tools including the JetS3t library. The remainder of the automation is done using Perl and Bash scripts to initialize environments correctly and move data.
Allgood adds, “In February 2011 alone, we have reduced our instance costs by around 50% by leveraging Amazon EC2’s Spot Instances. We were comfortable bidding below the On-Demand price to ensure we were benefiting from significant cost savings, and we typically have gotten all of the instances we need.” Numerate developed an in-house fault tolerant solution to manage potential node termination, enabling them to restart interrupted instances automatically. “For us, there were minimal changes to our application, on the order of five engineering days, and we were able to get significant cost savings immediately. The main changes we made included minor workflow changes to place persistent Spot bids whenever we added nodes to the cluster and the addition of a monitoring system to autonomously detect and handle AMIs coming up and going down.”
Nigel Duffy, Numerate’s CTO, notes that using AWS allows the Numerate team to spend their time doing higher level tasks: “Amazon EC2, in particular, allows us to be more agile and responsive to our partners’ needs, and allows us to deliver on time even in the face of unforeseen challenges.”
Allgood offers some security advice to other developers. Regarding isolation, he says, “On Amazon EC2 we only use eight core high-CPU XL instances. This lowers the risk of a side-channel attack and has the added advantage of getting the most out of the network card.
Numerate employs access control methods for further security, using encryption wherever possible and practical on Amazon EC2 and Amazon S3.
Another lesson learned, according to Allgood, is, “Do not underestimate the impact of the latency between your organization and the Amazon EC2 facility.”
Numerate is looking at incorporating Amazon Elastic Block Store (Amazon EBS) for future products and capabilities. They are also excited to make use of some of the new features of Amazon EC2, such as having the ability to use their own key-pairs.
The partnership between Numerate and AWS has been invaluable. “Our business is truly enabled by AWS,” says Duffy. “The capital efficiency that AWS offers allows us to share risk with our partners. The flexibility allows us to deliver the results we promise. Combining these enable us pursue a business with an unlimited upside.”
To learn more, visit http://www.numerate.com/ .
Published November 2010. Updated May 2011.