AWS provides a comprehensive suite of tools to manage Scientific Computing workloads by utilizing services like: Amazon Elastic Compute Cloud (Amazon EC2) for scaling compute capacity up and down as needed, Amazon Simple Storage Service (Amazon S3) for storing data, and Amazon Elastic Map Reduce (Amazon EMR) to manage your Hadoop-based workflows. Amazon EC2 Spot Instances in particular is a pricing model targeted for batch processing use cases, providing your customers with the flexibility of ad-hoc provisioning while receiving significant price savings over other pricing models.
Start Launching Amazon EC2 Spot Instances with AWS CloudFormation
You can now use AWS CloudFormation templates to create and manage a collection of related AWS resources, including Spot Instances. To get you started, we are providing three new CloudFormation templates that are optimized to save you money and manage interruption:
Manage your asynchronous processing using Amazon SQS and Auto Scaling. Launch now!
Get Notifications about your Spot Instances
This code tutorial and sample enables you to use Amazon SNS notifications to be alerted to changes in the state of your Amazon EC2 Instances, current Spot Instance Requests, and Spot Prices within a particular region. By leveraging this new code sample, you can now setup your applications running on Spot Instances to more easily manage the potential of interruption. To view this sample application and tutorial, please click here.
Scientific researchers have complex computational workloads ranging from DNA sequence analysis to particle physics simulations. Regardless of the application, one major issue affects them all: procuring and provisioning cost-effective computation cycles. In typical scientific computing environments, there is a long queue to access shared infrastructure and purchasing dedicated, purpose-built hardware takes time and considerable investment.
Whether you are a doctoral student writing a thesis or a Pharmaceutical company performing ground breaking drug research, you should consider the following questions when evaluating where to run your applications:
How quickly can I start running my applications?
Can I parallelize my work to get it done faster?
What level of elasticity (scaling up and down) is required for my application?
How can I engineer my application to minimize the cost?
Spot Instances enable you to bid on unused Amazon EC2 capacity at whatever price customers choose. Customers whose bids exceed the Spot price gain access to the available Spot Instances and run as long as the bid exceeds the Spot Price. Historically the Spot price has been 50% to 93% lower than the on-demand price. Customers whose bids exceed the Spot price gain access to the available Spot Instances and run as long as the bid exceeds the Spot Price. Spot Instances work with other services like Amazon S3 and Amazon EMR to help you manage all of your compute needs.
Some example use cases that work well with Spot Instances include:
Genome Sequence Analysis and Data Distribution
Particle Physics Simulations
Artificial Intelligence Research
Scientific Collaboration and Centralized Data Management
Benefits at a Glance
Easy to use. AWS is designed to minimize the heavy lifting of setting up and managing your own IT infrastructure. You can get started with AWS by leveraging the AWS Management Console, a variety of third-party management tools, or the well-documented AWS web service APIs to manage and maintain your cloud infrastructure.
Cost-Effective. You pay only for the compute power, storage, and other resources you use, with no long-term contracts or up-front commitments.
Flexible. AWS enables you to select the operating system, programming language, software tools, application platform, and other services you need. This eases the migration process for existing applications while preserving options to build new ones.
Elastic. AWS enables you to increase or decrease capacity within minutes without waiting in line to get the resources you need. You can provision one, hundreds or even thousands of server instances - allowing you to speed up workloads by adding more instances and shut them down when you are finished.
Sharing and collaboration. Creates a common space where you and your collaborators can share data, results and methods.
Secure. AWS utilizes an end-to-end approach to secure and harden our infrastructure, including physical, operational, and software measures. For more information, see the AWS Security Center.
Potential Cost Savings
Spot Instances enable you to bid for unused Amazon EC2 capacity. Instances are charged the Spot Price, which is set by Amazon EC2 and fluctuates periodically depending on the supply of and demand for Spot Instance capacity. To use Spot Instances, you place a Spot Instance request, specifying the instance type, the Region desired, the number of Spot Instances you want to run, and the maximum price you are willing to pay per instance hour. To determine how that maximum price compares to past Spot Prices, the Spot Price history is available via the Amazon EC2 API and the AWS Management Console. If your maximum price bid exceeds the current Spot Price, your request is fulfilled and your instances will run until either you choose to terminate them or the Spot Price increases above your maximum price (whichever is sooner).
The following table displays the the Spot Price by instances type for the lowest priced Availability Zone (updated every 5 minutes).
Starting an Instance
Spot Instances can be requested using the AWS Management Console or Amazon EC2 APIs. To start with the AWS Management Console:
Click on “Spot Requests” in the navigation pane on the left.
Click on “Pricing History” to open a view of historical pricing selectable by instance type. This will help you choose a maximum price for your request. Pricing shown is specific to the Availability Zone selected. If no Availability Zone is selected, you will see the prices for each Availability Zone in the Region.
Click on “Request Spot Instances” and proceed through the Launch Instance Wizard process, choosing an AMI and instance type. Enter the number of Spot Instances you would like to request, your maximum price and whether the request is persistent or not. After choosing your key pair and security group(s), you are ready to submit your Spot Instance request.
Building or migrating an application to run on Spot Instances is easy. The sections below outline how to build, migrate, and test applications to be used with Spot Instances.
Building a New Application
If you have the ability to architect your application from scratch, we recommend that you spend some time reading the Common Architectures and Best Practices section of this web page that outlines many of the architectures we have seen other customers use with Spot Instances in the past.
Migrating an Existing Application
Many applications are already architected to be fault tolerant, so migrating your application to run on Spot instances may be fairly easy. During the migration process, we recommend integrating the following best practices:
Track when Spot Instances Start and Terminate: Spot Instances start asynchronously and can be interrupted when the Spot price exceeds your bid price. So, it is important to track the state of your bids and instances. The simplest way to know the current status of your Spot Instances is to monitor your Spot requests and running instances via the AWS Management Console or Amazon EC2 API.
Choose the Maximum Price for your Instance: Remember that the maximum price that you submit as part of your request is not necessarily what you will pay per hour, but is rather the maximum you would be willing to pay to keep it running. Use the Spot Price history via the AWS Management Console or the Amazon EC2 API to help you set a maximum price.
Ensure your Application is Fault Tolerant: Because Spot Instances can be terminated without warning, it is important to build your applications in a way that allows you to make progress even if your application is interrupted. There are many ways to accomplish this, two of which include adding checkpoints to your application and splitting your work into small increments. Using Amazon EBS volumes to store your data is one easy way to protect your data.
Testing Your Setup
When using Spot Instances, it is important to make sure that your application is fault tolerant and will correctly handle interruptions. While we attempt to cleanly terminate your instances, your application should be prepared to deal with an immediate shutdown. You can test your application by running an On-Demand Instance and then terminating it suddenly. This can help you to determine whether or not your application is sufficiently fault tolerant and is able to handle unexpected interruptions.
Because Spot Instances can be terminated with no warning, it is important to build your applications in a way that allows you to make progress even if your application is interrupted. There are many ways to accomplish this, two of which are splitting your work into small increments (via Grid, Hadoop-based, or queue-based architectures) or adding checkpoints to you application. The sections below provide an overview of several commonly used architectures leveraged by existing Spot customers.
Map Reduce Architecture
Apache Hadoop is an open source software framework that supports data-intensive distributed applications. It enables applications to work with thousands of nodes to process petabytes of data through two main components: (1) a fault‐tolerant, distributed storage system and (2) a technique called MapReduce that supports efficient, exhaustive analysis over massive distributed data sets. Hadoop is developed for commodity hardware, can store data with or without schema, and provides linear scalability at petabyte scale. Customers such as Backtype and Fliptop use Amazon Elastic MapReduce, a managed Hadoop service that simplifies Hadoop cluster provisioning, configuration, and management, along with Spot Instances to significantly reduce the cost of their large scale data processing.
Amazon Elastic MapReduce makes it easy to mix Spot Instances with On-Demand or Reserved Instances within the same data processing cluster. This reduces cost and accelerates processing time while removing the risk of cluster failure due to Spot market fluctuations. If Spot Instances are interrupted due to a change in the Spot Price, tasks running on those instances are simply added back to the data processing queue to be handled by the remaining On-Demand instances. Customers can either continue data processing with the reduced cluster size or dynamically add additional instances to the cluster to replace the interrupted instances.
Several example use cases that are ideal for using Spot Instances with Elastic MapReduce include applications in which the customer can scale out to increase the speed of execution, in which flexibility in the time of completion can be used to gain significant cost savings, persistent Hadoop clusters in which significant fluctuations in load require frequent resizing, and for reducing the cost of Hadoop application testing.
As an example, imagine we have a job that typically is executed with 4 On-Demand instances over 14 hours, which would normally cost $28. Now, imagine that we could add 5 additional Spot instances (because the job scales non-linearly) and the job is able to execute in 7 hours. The total cost to run the job would now cost $15.75, assuming that the Spot price was 90% less than the On-Demand price. By adding Spot instances, there would be a 50% savings on time and a 44% savings on cost as shown below:
Grids are a form of distributed computing that enables a user to leverage multiple instances to perform parallel computations. Customers, like Numerate, Scribd, and University of Barcelona/University of Melbourne, use Grid Computing with Spot Instances because this type of architecture can take advantage of Spot Instance’s built-in elasticity and low prices to get work done faster at a more cost-effective price.
To get started, a user will break down their work into discrete units called jobs, and then submit that work to a “master node”. These jobs will be queued up, and a process called a “scheduler” will distribute that work out to other instances in the grid, called “worker nodes”. Once the result is computed by the worker node, the master node is notified, and the worker node can take the next operation from the queue. If the job fails or the instance is interrupted, the job will automatically be re-queued by the scheduler process.
As you work to architect your application, it is important to choose the appropriate amount of work to be included in your job. We recommend breaking your jobs down into a logical grouping based on the time it would take to process. Typically, you will want to create a workload size less than an hour, so that if you have to re-process the workload it doesn’t cost you additional money (you don’t pay for the hour if we interrupt your instance).
Many customers use a Grid scheduler like Oracle Grid Engine or UniCloud to setup a cluster. If you have long running workloads, the best practice is to run the master node on On-Demand or Reserved Instances, and run the worker nodes on Spot or a mixture of On-Demand, Reserved, and Spot Instances. Alternatively, if you have a workload that is less than an hour or you are running a test environment, you may want to run all of your instances on Spot. No matter the setup, we recommend that you create a script to automatically re-add instances that may be interrupted. Some existing tools can help you manage this process like StarCluster.
Many customers, like DNAnexus and Litmus, have built queue-based architectures, with the ability to handle the potential for a job to fail. Many of these types of applications can easily be extended to leverage Spot by integrating the Spot provisioning APIs.
As an example, imagine an application that leverages Amazon EC2 Spot Instances and Amazon SQS. The application has three SQS queues: To-Process, Processed, and Exception. Based on the depth of the queue, the master node will use the Spot provisioning APIs to scale up or down Spot Instance worker nodes. Alternatively, the Spot instance can be started as a persistent bid, so if it fails, it will automatically be restarted. Once a Spot instance is started, the application will determine which queues to leverage by reading from the user data passed into the instance at launched time or configuration stored remotely in Amazon SimpleDB or Amazon S3. Worker nodes running on Spot Instances will then select the next job from the To-Process queue and lock the job. Locking the job will prevent other worker nodes from trying to compute that same job until a specified amount of time passes or the job is completely processed. If the job is successfully processed, the worker node will post a reply with the results to the Processed queue where the master node can perform any additional logic. Alternatively, if the job fails to be processed because it takes too long or the worker node is interrupted, the job will be moved to the exception queue, enabling the master node to perform any additional specialized logic, like re-queuing the job. If the job failed because the Spot Instance went away, then the master node can also choose to start a new Spot Instances if necessary too.
When using a queue-based approach, ensure that processing a unit of work is idempotent (can be safely processed multiple times) to ensure that resuming an interrupted task doesn’t cause problems.
Depending on fluctuations in the Spot Price caused by changes in the supply or demand for Spot capacity, Spot Instance requests may not be fulfilled immediately and may be terminated without warning. In order to protect your work from potential interruptions, we recommend inserting regular checkpoints to save your work periodically.
One way customers like BrowserMob manage this interruption is by checkpointing your data. The best practice is to choose the maximum amount of time you are willing to re-process and checkpoint at least that frequently.
There are multiple methods to checkpoint your application, including:
Amazon EBS: Customers map an additional Amazon EBS volume to their Spot instance, and output the state of their application to the volume on a regular basis. If you leverage this method, it is important to ensure your buffers are flushed on a regular basis, to ensure all state is on the Amazon EBS volume.
Amazon S3: Amazon S3 is a durable store where customers can write data. If your application allows you to output results in the form of separate files as you process data, you could use Amazon S3 to store your results. Then, you can just pass around the bucket URL to any process that needs to read the results.
Amazon RDS: If you need a structure data store, you can leverage Amazon RDS to store any of your results. Since Amazon RDS allows you to use MySQL or Oracle Databases, you can setup your query to not commit work until you explicitly issue a “commit” command. This method will ensure that you naturally rollback if the process is interrupted.
When using a checkpointing-based approach, ensure that your workload is idempotent between checkpoints so that your workload can be safely processed multiple times if you resume an interrupted task.
DNAnexus provides a unified system of data management and sequence analysis for DNA sequencing centers and researchers. DNAnexus uses Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances to conduct all of its DNA analysis, while Amazon EC2 On-Demand Instances handle the company’s interactive services, such as its client front end portal and visualization tools. DNAnexus also relies on Amazon Simple Storage Service (Amazon S3) to meet the company’s extensive storage demand, which will grow from terabytes into petabytes of data.
Numerate is a Bay Area biotechnology platform company focused on making the drug design process more data-driven, efficient, and predictable. The company has incorporated Amazon EC2 as a production computational cluster and Amazon S3 for cache storage. Numerate enjoys around 50% cost savings by using Amazon EC2 Spot Instances after spending just 5 days of engineering effort.
University of Melbourne / University of Barcelona
In support of the Belle Experiment, an international High Energy Physics (HEP) experiment, a joint team from the University of Barcelona and the University of Melbourne uses Amazon EC2 and the DIRAC distributed computing software framework to define and steer the execution of experiment simulation and data reprocessing. Moving to Amazon EC2 Spot Instances enabled the University of Melbourne to save 56% per instance hour with negligible changes to their application.
Featured Solution Providers Specializing in Scientific Computing
BioTeam Inc. is an independent consulting company owned and operated by scientists focused on “bridging the gap” between science & high performance IT.The deep strength and broad scope of our employee expertise allow us to offer a wide range of professional services. The company has been using Amazon AWS to solve customer-facing problems since 2007. Many years of experience operating in traditional HPC, Cluster & Grid Computing environments allows BioTeam to offer practical services to clients considering cloud computing. Learn more about BioTeam.
Cycle Computing is a leader in software for running high performance and high throughput computing clusters using open technologies on Amazon EC2. Cycle’s solutions support scientific, financial, business and engineering applications. Fortune 500 companies rely on CycleCloud™ offering and CycleServer™ software, combined with open source frameworks like Condor, SGE, and Hadoop, to deploy mission critical business applications including drug discovery, risk management calculations, bioinformatics, computational fluid dynamics on public clouds like Amazon EC2 and on internal resources. Learn more about CycleComputing.
Eagle Genomics uses Amazon’s EBS, EC2, RDS, S3, Load Balancing and Auto Scaling, as well as command-line tools, to handle and analyze genomic data for pharmaceutical, agricultural and animal health companies, as well as academic centers. Eagle Genomics has recently been using Spot instances in the development of a novel microRNA discovery pipeline for the ARK Genomics at the Roslin Institute, Edinburgh, UK.
Video Tutorial: How to Launch a Spot Instance
Watch this video tutorial to learn how to launch your first Spot Instance. This tutorial covers placing a bid, determining when the instance is fulfilled, and canceling/terminating the instance.
Video Tutorial: Common Spot Use Cases
In this video, we will walk through example spot instance use cases. As a part of this video, we will cover several customer examples including Numerate, Clarity Solutions, Ooyala, and BrowserMob and how they leverage Spot Instances in their architectures.
Guide: How to Track Spot Instance Activity with the Spot Notifications
This code tutorial and sample enables you to generate and manage Amazon SNS notifications for changes in the state of your Amazon EC2 Instances, current Spot Instance Requests, and Spot Prices within a particular region. By leveraging this code sample, you can now setup your applications running on Spot Instances to more easily manage the potential of interruption.
Video Tutorial: How to Launch a Cluster on Spot
Chris Dagdigian from BioTeam provides a quick overview of how to start a Cluster from scratch in around 10-15 minutes on Amazon EC2 Spot Instances using StarCluster. StarCluster is an opensource tool that was created by a lab at MIT that makes it easy to setup a new Oracle Grid Engine Cluster. During this presentation, Chris walks through the process of installing, setting up, and running simple jobs on a cluster. Additionally, Chris leverages Spot Instances, so that you can potentially get work done faster and potentially save up to 93% off of the On-Demand price. If you are interested in this tutorial, you may also want to see our StarCluster CloudFormation Template.
AWS Public Data Sets provide a centralized repository where data can be shared and seamlessly integrated into AWS cloud-based applications. Examples include: the 1000 Genomes Project - an international public-private consortium building the most detailed map of human genetic variation to date; Ensembl’s Annotated Human Genome Data for MySQL - that includes genomes for over 50 species including humans; and, Sage Bionetwork’s Human Liver Cohort – that characterizes gene expressions in liver samples. Please visit the Public Data Sets Web page for more details.
Academic Researchers on AWS
With AWS in Education, educators, academic researchers, and students can apply to obtain free usage credits to tap into the on-demand infrastructure of Amazon Web Services to teach advanced courses, tackle research endeavors and explore new projects. If you are interested in learning more about this program, please visit the AWS in Education web page for more details.