AWS HPC Blog

Petrobras optimizes cost and capacity of HPC applications with Amazon EC2 Spot Instances

This post was contributed by Marcelo Baptista, Lucia Maria de A. Drummond, Luan Teylo, Alan L. Nunes, Vinod Rebello, and Cristina Boeres.

Petrobras, one of the largest integrated energy companies, has long embraced the transformative potential of high performance computing (HPC) to tackle complex challenges and unlock new opportunities. The company invested heavily in a powerful HPC infrastructure designed for seismic data processing, reservoir simulation, and other oil and gas workloads.

Historically, Petrobras primarily used on-premises solutions to run its HPC applications. To overcome traditional HPC limitations, they launched a research project with Universidade Federal Fluminense’s (UFF) Cloud+HPC Lab to develop a hybrid system combining on-premises and cloud resources. With AWS ParallelCluster and Amazon EC2 Spot Instances, Petrobras attained virtually unlimited computing power, resulting in improved efficiency of their complex simulations which boosted their availability and, most importantly, their productivity.

Today we will tell you about the framework they created to enable this: Sim@Cloud. Using this, Petrobras achieved savings ranging from 43% to 90% while keeping performance at a maximum. Sim@Cloud minimizes wait time, selects the right instance, and manages instance revocation.

Minimizing waiting time with AWS

During periods of high demand, Petrobras faced extended waiting time for job executions. This is a common provisioning dilemma, but easily summarized:

  • Over-provisioning: when resources are scaled for peak demand, resulting in unnecessary expenses, and wasted energy during low-demand periods.
  • Under provisioning: if resources are under-provisioned and insufficient to meet peak demand, job execution delays occur, impacting company decisions, holding up time and productivity of highly qualified and expensive professionals, and potentially resulting in financial losses due to inefficiency.

This trade-off between over-provisioning and under-provisioning is challenging, as either case leads to wasted resources. Figure 1 illustrates this scenario using actual Petrobras data. It shows that over the course of a typical week, in a general way, the on-premises resources meet demand, but resources are idle. On certain days, usage reaches 100% of the company’s computational capacity. At this point (represented by the green line), jobs queue, causing frustration among engineers (have you ever seen a frustrated engineer? I have).

Therefore, we must ask: what is the most effective way to manage this?

Figure 1 - One week of job submissions on Petrobras’s on-premises infrastructure. Curves above 100% (green line) represent peak demand, which results in workloads being manually shifted to the cloud (represented in orange).

Figure 1 – One week of job submissions on Petrobras’s on-premises infrastructure. Curves above 100% (green line) represent peak demand, which results in workloads being manually shifted to the cloud (represented in orange).

In this cat-and-mouse situation, the native elasticity of AWS offers a solution: allocate on-premises resources for daily tasks and move peak requests to the cloud. This is precisely what Petrobras is implementing through its partnership with the Cloud+HPC Lab at UFF and Sim@Cloud.

What about costs?

Petrobras’s workloads take from days to weeks to complete and use hundreds of compute nodes, typically. Their primary concern when moving to the cloud was cost. To address this, the Cloud+HPC Lab at UFF, led by Professor Lúcia Drummond, launched a project focused on optimizing and reducing the cost of running HPC workloads on AWS.

In the first two years of the project, Petrobras selected CMG’s reservoir simulations as the first workload to migrate to AWS during peak demand periods. This application is a substantial part of the company’s workload and is notable for its native, lightweight checkpoint-and-restart capabilities.

Checkpoint and restart allow them to use Amazon EC2 Spot Instances to keep costs as low as possible while ensuring transparency to their end-users. From the users’ perspective, sending a batch of simulations to Slurm and waiting for results is the same whether it runs on-premises or on the AWS Cloud.

A single user’s workload can comprise more than 300 simulation jobs. Each job can request more than a hundred instances at once. Finding the right instance and the right time and the right AWS Region is key to running the HPC workload and optimizing costs.

To solve those challenges, the group developed the Sim@Cloud. This integrates natively with Slurm and other internal Petrobras tools, like BR-Kalman.

Sim@Cloud

Sim@Cloud tackles the challenges by supporting the selection of the correct instances and pricing model (either Spot or On-Demand) for running simulations. It also oversees the execution so it can manage potential Spot Instance revocations.

Sim@Cloud has two main components: the Launcher and the Execution Manager, both responsible for specific management procedures during a simulation’s execution. To illustrate the behavior of the framework, Figure 2 presents the architecture and a step-by-step example of its operation.

Figure 2 - Architecture of the Sim@Cloud framework developed by the Cloud+HPC Lab and Petrobras.

Figure 2 – Architecture of the Sim@Cloud framework developed by the Cloud+HPC Lab and Petrobras.

First, a user submits their simulation request to the cluster job scheduler (Slurm). The user defines parameters for execution, like the number of cores, a batch file for execution, and the output directory.

With these parameters, Slurm starts the Launcher on the head node. It invokes the ML-Predictor, a machine learning component that estimates the execution time of the simulation based only on Slurm features. The Launcher then invokes the Instance-Selector module. Considering the estimated time, application characteristics, and environment factors like checkpoint overhead, the optimal Amazon EC2 instance type, purchasing choice, and Region for the simulation task are determined. With this decision, it submits the simulation job to Slurm, which allocates the chosen instance and starts the job.

The Execution Manager module monitors the simulation’s progress via the Dynamic-Predictor module. It predicts the remaining execution time based on the log from the simulator. It uses this information to select a new instance to resume the simulation if EC2 reclaims a Spot Instance.

When running on a Spot Instance, the Execution Manager uses the Checkpoint-Recorder module to checkpoint at regular intervals. When necessary, this module restarts the application from the latest checkpoint available. The Execution Manager receives the interruption metadata alert from AWS, which is sent two minutes before an interrupted Spot Instance is released. After receiving the notification, the Checkpoint-Recorder starts a final checkpoint to preserve the simulation’s current progress. In the case of a Spot interruption, the Instance-Selector uses the remaining time predicted by the Dynamic-Predictor to determine the best instance to resume the simulation.

This process repeats until they complete the simulation. The Launcher saves the execution information in the History-Database and notifies the user. The database stores the information related to the simulation execution, such as the AWS Region, purchasing option, instance type, price history, and the simulation’s execution time. It facilitates further assessment of execution performance and costs.

Sim@Cloud also limits the number of revocations: after reaching the limit, it restarts the job using available On-Demand capacity.

Note that Sim@Cloud uses a shared file system in the on-premises cluster and a cache in each different AWS Region. Amazon FSx for NetApp ONTAP links each AWS Region eligible to run the simulation. Through this cache architecture, data is copied to another Region, boosting availability, and delivering swift access to the data pertaining to the simulation’s execution. This can also help organizations comply with data sovereignty requirements, as the primary data hosting environment (for example, adhering to Brazilian government rules on cloud service data handling and security).

Experimental results

We imitated a two-minute warning for Spot interruption, to study the effects on cost and total run time with Sim@Cloud.

By using AEMM (AWS EC2 Metadata Mock), we developed another tool to mimic Spot interruptions. We calculate the instance’s interruption time using a Poisson distribution, update the instance’s metadata to include a termination alert and shutdown the Spot Instance. The discrete probability distribution is well-suited to predict the occurrence of events within a specific time window, including the timing of Spot Instance interruptions.

By sampling the Poisson distribution with parameter λ representing the average number of interruptions per hour and multiplying the result by 3,600 seconds, we determine the termination time, defining τ as the duration of availability for the instance. The tool tracks the instance’s uptime and generates a termination alert at seconds.

Our evaluation considered four termination rates: (λ) – 0.1, 0.5, 0.8, and 1.0. With increasing λ values, the probability of a Spot Instance exiting early also increases, and for λ = 1, it indicates a high probability of Spot reclamation within the first hour.

For our tests, we used a semisynthetic reservoir simulation model workload, referred to as Pre-salt. This is a representative workload of the complex pre-salt reservoirs in offshore Brazil, explored by Petrobras. To save time and budget, we created a version that runs for 20 years instead of 50 years. Throughout the rest of this post, we will refer to these models as Pre-Salt and Short-Pre-Salt, respectively. These models are typically simulated in a production environment with the compositional reservoir simulator CMG GEM, which provides better results for the complexity of the Brazilian Pre-Salt. However, for this work, we opted to perform the simulations using the CMG IMEX black-oil simulator, always using 40 threads, as it is faster for the purpose of our tests.

Sim@Cloud lowered costs and reduced execution time in various scenarios when compared to the baseline On-Demand Instance in the SA-EAST-1 Region. Figures 3a, 3b, and 3c display the average makespan — the time a message takes to be processed within an integration process — and cost of the Sim@Cloud instance selection scheme for three executions per failure rate. We show this for the Short-Pre-Salt and Pre-Salt simulation models. Makespan it is a performance metric that calculates the time a message takes to be processed within an integration process.

Figure 3a - Costs and makespans for a first execution of the Short-Pre-Salt model

Figure 3a – Costs and makespans for a first execution of the Short-Pre-Salt model. The dashed lines represent the baseline execution (using the cheapest On-demand Instance in the home AWS Region SA-EAST-1).

Figure 3b - Costs and makespans for a second execution of the Short-Pre-Salt model

Figure 3b – Costs and makespans for a second execution of the Short-Pre-Salt model. The dashed lines represent the baseline execution (using the cheapest On-demand Instance in the home AWS Region SA-EAST-1).

Figure 3c - Costs and makespans for the execution of the full Pre-Salt model

Figure 3c – Costs and makespans for the execution of the full Pre-Salt model. The dashed lines represent the baseline execution (using the cheapest On-demand Instance in the home AWS Region SA-EAST-1).

Even in the worst-case scenario (λ = 1), the cost savings were as high as 43.87% for Short-Pre-Salt Execution2 (Figure 3b) and up to 90.39% for Short-Pre-Salt Execution1 (Figure 3a). We achieved this at the expense of a 9.28% increase in the makespan of the former case because it needs to checkpoint when using Spot Instances. In contrast, Short-Pre-Salt Execution1 achieved an average reduction of 10.42% in makespan. The process of recovering from a Spot Instance interruption includes overheads for reclamation and restart, such as selecting a replacement instance, initializing it, and restarting the simulation. With the rise in failure rates, the makespan also went up, as shown in Figure 3b. For λ = 0.5 and λ = 0.8, the cost savings for Short-Pre-Salt Execution2 were 81.78% and 81.46%, respectively, with makespan increasing by 0.70% and 2.48%. For Execution1, the cost reduction was even better: 90.13% and 90.07%, respectively, while the makespan decreased by 13.52% and 18.35%.

The scenario with λ = 0.1 also showed notable improvements. Cost reduction up to 91.58% (Execution1) and 85.74% (Execution2) while decreasing makespan by 15.97% and 3.92%, respectively. In this scenario, the selection heuristic gained a Spot Instance with better performance than the baseline instance at a lower price. The short execution time of Short-Pre-Salt means that no Spot interruptions occurred, avoiding additional overheads for migrating the simulation. The selection heuristic shows how to search for cost-effective resources in Spot purchasing.

For the Pre-Salt model, the results shown in Figure 3c are analogous to those of the Short-Pre-Salt model. Savings in both costs and makespan were achieved in the scenarios.

It is important to note that execution times are also affected by the availability of Spot Instances in a Region at the time of job submission. Spot Instance prices may change during runtime, but the primary cause of increased costs in both scenarios with λ = 1.0 was Sim@Cloud opting for an On-Demand Instance to end the simulations after exhausting the permitted Spot interruption limit.

Benefits

This study was productive and brought new possibilities to the way we manage HPC workloads, and how Amazon EC2 Spot Instances are key to cost optimization. Using Spot Instances can lead to substantial savings on EC2 costs when contrasted with on-demand rates.

Using Amazon FSx NetApp for ONTAP with FlexCache for efficient caching also led to a significant shift – lowering storage expenses and ensuring effortless data accessibility in the cloud with impressive performance metrics. Even with the inherent nature of Spot Instance interruptions, solutions like Sim@Cloud have shown that we can achieve substantial cost savings without compromising workload integrity.

Cloud-based solutions increase the accessibility of HPC applications by providing transparency to end-users, reducing technical barriers, and giving to users (scientists and engineers) the additional capacity needed in the situations of peak demand.

Conclusion

This project and its positive results were possible because of the close collaboration and support provided by the triad of UFF, Petrobras, and AWS. As we saw from the results, Sim@Cloud is effective in reducing overall costs, regardless of the simulation’s execution time. Despite higher failure rates, longer simulation models can achieve a shorter makespan at a low cost.

These results show the maturity of AWS in handling HPC workloads. Large companies, such as Petrobras, can benefit from this – and the scale and capacity of AWS — to solve typical provisioning problems, or even extraordinary ones.

AWS worked closely with Petrobras and UFF. This close relationship is helpful as it saved time that would otherwise be used just understanding the environment, rather than working on research and solutions. This was crucial for the success of the project and the development of Sim@Cloud, avoiding potential bugs and guiding the group toward the best decisions.

Acknowledgements

This work was performed by several engineers and analysts, to whom we are most grateful: Alan L. Nunes (UFF), Cristina Boeres (UFF), Daniel B. Sodré (UFF), Felipe A. Portella (Petrobras), José Viterbo (UFF), Luan Teylo (INRIA/Bordeaux), Lúcia M. A. Drummond (UFF), Maicon Dal Moro (Petrobras), Marcelo Baptista (AWS), Paulo F. R. Pereira (Petrobras), Paulo J. B. Estrela (Petrobras), Renzo Q. Malini (Petrobras), Vinod E. F. Rebello (UFF).

Publications

If you would like to know more about this project, here is a list of different papers you can read.

Marcelo Ferreira Baptista

Marcelo Ferreira Baptista

Marcelo Baptista is a Solutions Architect at AWS, with over 30 years of IT experience. Specialist in DevOps, Computing and HPC, and assists customers with their technological challenges.

Alan L. Nunes

Alan L. Nunes

Alan L. Nunes is a PhD student at the Fluminense Federal University (UFF, Brazil) and the University of Bordeaux (UB, France). His topics of interest include High-Performance Computing, Cloud Computing, Distributed Systems, Federated Learning, and MapReduce.

Cristina Boeres

Cristina Boeres

Cristina Boeres is an Associate Professor at Instituto de Computação, Universidade Federal Fluminense, with a Ph.D. in Computer Science from the University of Edinburgh (United Kingdom).

Felipe Portella

Felipe Portella

Felipe Portella is an IT consultant at the Brazilian energy company Petróleo Brasileiro S.A. (PETROBRAS), specializing in HPC for petroleum reservoir simulation. He holds a degree in Informatics (2003) and an M.Sc. in Computer Science (2008) from PUC-Rio, Brazil. In 2024, he completed his doctorate in Computer Architecture at UPC-Barcelona Tech in partnership with the Barcelona Supercomputing Center, Spain.

Luan Teylo

Luan Teylo

Luan Teylo is a tenured researcher at INRIA Bordeaux, France. Since 2021, he has worked with the TADaaM Team, focusing primarily on I/O problems related to HPC platforms. He earned his M.S. and Ph.D. in Computer Science from Fluminense Federal University (UFF). His research interests include distributed algorithms, cloud computing, and scheduling problems.

Lucia Maria de A. Drummond

Lucia Maria de A. Drummond

Lucia Maria de A. Drummond completed her doctorate in Systems and Computer Engineering from the Federal University of Rio de Janeiro in 1994, when she participated in the development team of the first parallel computer at Brazil and the main article of her thesis obtained a Prize for stimulating research granted by the Ministry of Science and Technology and Compaq Computer by the Brazilian Academy of Sciences. She is currently a Full Professor at Fluminense Federal University, Brazil, and her research interests are in high-performance computing and cloud computing.

Paulo Estrela

Paulo Estrela

Paulo Estrela works as an HPC systems engineer for Petrobras, a Brazilian state-owned oil company, since 2008, building and managing supercomputers for petroleum reservoir simulations. More recently, he has been working on giving Petrobras's supercomputers some elasticity using cloud computing resources to complement on-premise capacity. Some of these supercomputers are listed among the 100 most powerful computers in the world by the top500.org organization.

Vinod Rebello

Vinod Rebello

Vinod Rebello obtained his Ph.D. in Computer Science from the University of Edinburgh (United Kingdom) in 1996. He is currently an associate professor in the Department of Computer Science of the Universidade Federal Fluminense, in Brazil. His research interests focus on various aspects of parallel and distributed computing in grids and clouds, including autonomic computing, scientific applications, resource management, scheduling and fault tolerance, and cybersecurity.