The Belle Experiment is a particle physics experiment—also known as a High Energy Physics (HEP) experiment—taking place at the KEK B-Factory particle collider in Japan through an international collaboration of 400 physicists and engineers for the purpose of studying matter and anti-matter asymmetries. The experiment is by its nature very compute-intensive and therefore makes for a perfect use case for cloud computing.
The joint Barcelona-Melbourne team is using the DIRAC distributed computing software framework to define and steer the execution of a sizable part of Belle Experiment simulation needs for their data reprocessing using computing resources on Amazon Elastic Compute Cloud (EC2). The team is using Amazon EC2 as a supplement to its existing large-scale grid computing infrastructure.
In the current exercise, nearly all of DIRAC’s components are running on Virtual Machines, with the results of the cloud-based computation being passed to the Belle experiment’s non-cloud infrastructure (see architecture diagram for details). The management of these tasks is performed from remote Virtual Machines—which are also capable of being migrated to EC2—while the CPU power is provided by “ad hoc” Amazon Machine Images (AMIs) running on AWS.
The CPU power afforded by Amazon EC2 supplements the existing resources from the Belle groups and the Worldwide LHC Computing Grid (WLCG) Grid infrastructure. With its AWS deployment, the team aims to demonstrate the potential for successful integration of AWS with its existing large-scale grid computing resources.
As the Barcelona group shares with AWS, “Amazon EC2 supported the ability to run virtual machine images of our own creation—allowing us to use the cloud as infrastructure. The large community of users surrounding EC2 was instrumental in supporting development efforts, and this was built in turn on a foundation of support from Amazon staff themselves.”
Using AWS, the team has been able to benefit from the economies of the cloud. In particular, measuring the cost of using AWS to absorb peaks in demand for computing capacity in a way that is transparently integrated with other computing resource providers, such as local clusters, grid resources, and other cloud infrastructures, has “worked especially well” for the team.
The underlying elasticity of AWS resources enables the team to run a low-scale phase with only 15-20 EC2 instances initially for a week long period, and then in a second phase scale the capacity to from 20 to 250 EC2 instances in the space of four hours. The economies of using this compute-as-a-utility model for only the duration needed has enabled the team to rely increasingly on the cloud computing based model of AWS.
Moving to Amazon EC2 Spot Instances enabled the University of Melbourne to save 56% per instance hour with negligible changes to their application. Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price. Once the team from the Universities of Barcelona and Melbourne became aware of spot instances, it became apparent that the spot price remained “low and steady”. They chose a simple bidding strategy, selecting a maximum bid price just slightly above the current spot price. This strategy enabled them to reduce their cost to an average of $0.21 per hour for each instance versus $0.48 for equivalent on-demand instances. Since the University of Melbourne team chose a lower maximum bid price, they ran a larger chance of being outbid and hence having their spot instances interrupted.
According to Professor Sevior, the team's application was easily enabled to run on spot instances. The application was originally designed for grid computing--a distributed and geographically dispersed environment in which jobs may fail. The application was built on gLite middleware, to securely move data between instances, and the DIRAC framework, to manage their jobs in a fault-tolerant manner. If their instances were interrupted, they would simply re-run any individual jobs that are not completed for any reason, such as a grid node going offline or an Amazon EC2 spot instance being terminated. Professor Sevior concludes, "Spot instances are a great way to engineer Grid Computing workloads to run within a pre-defined cost or within a specific timeframe while still getting the benefits of Amazon EC2 like elasticity."
The Melbourne group concludes, “The complete openness and scalability of the Amazon Web Services EC2 service allowed us to easily deploy our very complex application on a large scale supercomputer. We particularly liked the flexibility to build exactly the virtual machines we needed and to transfer data to and from every instance we created. This flexibility and openness allowed us to rapidly deploy the sophisticated collection of programs needed for the Belle experiment and to integrate the results into the world-wide grid. Consequently we were able to accelerate our joint effort with researchers across the world to build exactly the application we needed.”
The team is very excited at the prospect of using AWS for future projects. “We’re very interested in seeing how far EC2 scales, and in innovative ways to save money using cloud computing.”
To learn more, visit http://belle01.ecm.ub.es/
or contact dirac.project@gmail.com.