AWS HPC Blog
Optimizing compute-intensive tasks on AWS
Optimizing workloads for performance and cost-effectiveness is crucial for businesses of all sizes – and especially helpful for workloads in the cloud, where there are a lot of levers you can pull to tune how things run.
AWS offers a vast array of instance types in Amazon Elastic Compute Cloud (Amazon EC2) – each with its own performance characteristics and pricing. But choosing the right instance type and fine-tuning your workload for optimal performance can be a daunting task. We offer a managed service, AWS Compute Optimizer, which excels at optimizing long-running workloads and right-sizing servers for persistent services like database backends.
In today’s post, we’re introducing you to another option which addresses a different, but equally critical need. It’s a powerful automation tool designed for tasks that have a defined completion point and may exhibit significant performance variability. This variability can stem from changes in physics calculation methods, compilation configurations, or other workload-specific factors.
Introducing CloudInstanceOptimizer
The CloudInstanceOptimizer is particularly valuable for compute-intensive, batch-style workloads like scientific simulations, machine learning training jobs, or financial modeling, where the interplay between the workload characteristics and the underlying hardware can have a profound impact on performance and cost-efficiency.
By allowing users to not only select the optimal Amazon EC2 instance type but also fine-tune their workload parameters, the CloudInstanceOptimizer provides a level of optimization that goes beyond simple resource allocation, enabling users to squeeze maximum performance from their chosen Amazon EC2 instances, or improve their workload configuration.
Optimization methods are crucial for scenarios with large, intractable parameter spaces. Depending on the solution topology, these methods can often find the approximate global optimal configuration by exploring less than 1% of the search space.
The benefits of intelligent optimizing automation include:
- Efficiency: weeks of manual labor can be reduced to a few hours.
- Cost savings:
- Reduced human labor costs
- Lower compute costs due to more efficient optimal solution discovery
- Scalability: cloud HPC allows for simultaneous optimization of multiple workloads, further reducing labor costs
Important considerations for users:
- Optimizations rely on workload execution on EC2 instances, which incur costs.
- Costs scale with workload runtime.
- Be mindful of optimization configuration to avoid excessive resource consumption.
- Avoid attempting to run every possible setup, as this can lead to unnecessary expenses.
By leveraging these optimization methods and considering their implications, customers can significantly improve their workflow efficiency and reduce their overall costs.
CloudInstanceOptimizer goals
The CloudInstanceOptimizer addresses two critical challenges faced by AWS users:
- Determining the most suitable EC2 instance type for a given workload.
- Optimizing the workload itself to perform optimally on specific EC2 instance types.
This dual-optimization approach focuses on compute-intensive tasks like machine learning training, finite element analysis (FEM), computational fluid dynamics (CFD), risk analysis, simulations, and computer graphics rendering.
How it works
The CloudInstanceOptimizer operates in two main modes:
- EC2 instance optimization:
- Users containerize their workload and upload it to Amazon Elastic Container Registry (ECR).
- They specify EC2 instance families to evaluate and the number of replicate runs.
- The tool deploys the workload across selected instance types using AWS Batch.
- Performance metrics and runtime data are collected for each instance type.
- A comprehensive report is generated with performance comparisons and cost estimates.
- Workload optimization:
- Users containerize their workload and upload it to Amazon Elastic Container Registry (ECR).
- Users specify target EC2 instance type.
- The tool analyzes the workload and suggests input parameters that could impact parallelization strategies, compilation configurations, and workload configuration.
- These input perturbations are automatically applied and tested across multiple runs to find the best combination.
- A detailed report on performance improvements and trade-offs is provided.
Example use case #1: optimizing a finite element analysis (FEM) solver
Let’s walk through an example of how the CloudInstanceOptimizer can be used to find the best EC2 instance for either runtime or cost of a physics-based Finite Element Analysis (FEM) solver. FEM is widely used in engineering for structural analysis, heat transfer, fluid dynamics, and more. These simulations are often computationally intensive and can benefit greatly from proper EC2 instance selection and workload optimization.
Step 1: containerize the FEM solver
First, we need to containerize our FEM solver. We’ll create a Dockerfile that includes all necessary dependencies and the solver itself. We are demonstrating a real application of wave propagation for a ball bouncing on a solid. The mesh, FEM input file, and additional installation details can be found in the GitHub example folder. You can examine each detail of the Dockerfile to understand specifically what is happening for this application, but all you really need to know at this point is that a Dockerfile was created for your application. At the end of the Dockerfile, we have added some source files needed for the CloudInstanceOptimizer, namely utils.py
and monitor_system.py
, which we use to monitor the system performance profiles and save results to a user-specified Amazon Simple Storage Service (Amazon S3) bucket. It’s important to note that you can customize this container file for your specific application, as needed. We only require that you add the CloudInstanceOptimizer python source files at the end.
FROM ubuntu:22.04
# We are first installing singularity containers inside the docker
ENV DEBIAN_FRONTEND=noninteractive
# Ensure repositories are up-to-date
RUN apt-get update
# Install debian packages for dependencies
RUN apt-get install -y \
autoconf \
automake \
cryptsetup \
git \
libfuse-dev \
libglib2.0-dev \
libseccomp-dev \
libtool \
pkg-config \
runc \
squashfs-tools \
squashfs-tools-ng \
uidmap \
wget \
zlib1g-dev \
build-essential \
software-properties-common
RUN export VERSION=1.21.0 OS=linux ARCH=amd64 && \
wget https://dl.google.com/go/go$VERSION.$OS-$ARCH.tar.gz && \
tar -C /usr/local -xzvf go$VERSION.$OS-$ARCH.tar.gz && \
rm go$VERSION.$OS-$ARCH.tar.gz
RUN echo 'export PATH=/usr/local/go/bin:$PATH' >> ~/.bashrc
ENV PATH=/usr/local/go/bin:$PATH
ENV VERSION='4.1.0'
RUN wget https://github.com/sylabs/singularity/releases/download/v${VERSION}/singularity-ce-${VERSION}.tar.gz
RUN tar -xzf singularity-ce-${VERSION}.tar.gz
WORKDIR "singularity-ce-${VERSION}"
RUN ./mconfig
RUN make -C builddir
RUN make -C builddir install
WORKDIR /
# Setup the python environment
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get install python3.11 python3.11-distutils python3.11-dev -y
RUN apt-get install curl -y
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN python3.11 get-pip.py
RUN ln -s /usr/bin/python3.11 /usr/bin/python
RUN echo "alias ls='ls -al'" >> /root/.bashrc
# Now copy the ec2benchmarking code
COPY requirements.txt requirements.txt
RUN python3.11 -m pip install -r requirements.txt
#Moose singularity image
COPY ./examples/fem_calculation/moose.sif moose.sif
COPY ./src/utils.py utils.py
COPY ./src/monitor_system.py monitor_system.py
COPY ./examples/fem_calculation/runmoose.sh runmoose.sh
COPY ./examples/fem_calculation/PrismsWithNamedSurfaces.inp PrismsWithNamedSurfaces.inp
COPY ./examples/fem_calculation/stressed.i stressed.i
COPY ./examples/fem_calculation/runscript.sh runscript.sh
RUN chmod 777 runmoose.sh
RUN chmod 777 runscript.sh
WORKDIR /
Step 2: configure the benchmark
Next, we’ll create a JSON configuration file specifying the EC2 instances we want to evaluate and other parameters:
{
"region_name" : "us-east-1",
"s3_bucket_name" : "dummy-bucket",
"container_arn" : "dummy.dkr.ecr.dummy-region.amazonaws.com/dummy-container",
"logGroupName" : "/aws/batch/job",
"replications" : 5,
"review_logs_past_hrs": 1,
"docker_privileged": "True",
"job_timeout": 15,
"run_cmd": "python monitor_system.py --cmd ./runmoose.sh",
"ec2_types":["t2", "t3", "t4", "m4", "m5", "m6", "m7", "c4", "c5", "c6", "c7"]
}
Once you’ve created your container and pushed it to Amazon ECR, you’ll have the container ARN ready to copy into the JSON file. Ensure you update the region you’re in and specify the Amazon S3 Bucket where log files will be stored and results analyzed.
The “replication” input is crucial for understanding performance variability. Due to various uncontrolled variables, each workload execution on an Amazon EC2 instance will yield slightly different results in terms of runtime, costs, CPU usage, RAM, etc. Comparing the performance of a single execution between two different EC2 instances is inconclusive, as we can’t determine if differences are due to random variability or actual performance disparities. Therefore, we run replicates to measure this variability. The uncertainty in the variability itself decreases with more replicates, but we must balance this with the increased time and cost of running too many benchmarks. A few replicates are typically sufficient to gain a general understanding of performance variability.
Another important input option is the job_timeout. We need to select a value that allows the workload to complete successfully. However, some configurations or EC2 instances may not support the workload (due to insufficient RAM, too few cores, etc.). These issues can manifest in excessively slow runtimes or cause the EC2 instance to hang. The CloudInstanceOptimizer will automatically terminate these tasks once the specified timeout (in minutes) has been reached.
The ec2_types field specifies the instance families you want to include in the optimization. You can add as many or as few as you want, including both AWS Graviton (Arm64-based) and x86 processors if you’ve provided appropriate inputs (see this example for adding Graviton EC2 instances to the benchmark). The CloudInstanceOptimizer will review each subtype within the families to right-size for your application.
The run_cmd tells the program what to run in the container. This input must always start with the Python instance to use, followed by monitor_system.py
with --cmd
. Users can provide any command after --cmd
which will be executed in the container. This setup allows the monitoring system to begin gathering system information and metadata before executing the workload. Once the workload has completed, a log file will be uploaded to the Amazon S3 bucket. Users are free to modify the code to alter sampling rates or add additional instructions.
We chose this method for (1) ease of modification by users; and (2) to avoid issues with Amazon CloudWatch agents potentially combining workload system parameters with extraneous information, or difficulties in deciphering overlapping logs when EC2 instances are reused for efficiency.
"run_cmd": "python monitor_system.py --cmd ./runmoose.sh",
Step 3: deploy and run
We’ll use the AWS CDK to deploy the necessary infrastructure:
cd CDKBenchEnv/
cdk synth -c json_file_path=../examples/fem_calculation/benchmark_config.json
cdk deploy -c json_file_path=../examples/fem_calculation/benchmark_config.json
Then, we’ll run the benchmark:
cd ..
python run_benchmark.py -j ./examples/fem_calculation/fem_benchmark_config.json
Step 4: analyze results
After the benchmark completes, we can use the provided Streamlit app to visualize and analyze the results:
streamlit run ./streamlitApps/analyze_data_streamlit.py
This analysis will display performance metrics and cost estimates for each instance type, helping us choose the most suitable EC2 type for our FEM solver. In this example, if our goal is to minimize the runtime for this specific FEM workload (the next section will optimize the workload itself), we can see that the C6a/C6i/C6g EC2 instance types clearly outperforms the other types in terms of both overall runtime and consistency of results. However, among the top five C6 families, there isn’t a clear advantage. In this case, a user would be advised to choose the cheapest option from these top 5 performers.
Example use case #2: optimizing FEM configuration
Now, let’s consider optimizing the workload itself. The CloudInstanceOptimizer includes both a grid scheme and a Bayesian optimization method, which uses ML techniques to learn the relationships and interactions of variables to find the minimum of a specific variable (either system properties or custom variables). We plan to add more optimization methods in the future. Currently, the tool is designed for two main scenarios:
- A small number of perturbations, where we can run a grid of all combinations to find the best input.
- An extensively large parameter space, but with a result topology that is relatively smooth and non-noisy, allowing the ML to learn efficiently and find the best combination.
If you expect your topology to be quite noisy, we’d like to talk to you. We plan to eventually add Genetic algorithms for such scenarios, and could do with feedback and insights from your experience that might help us do this well.
In this example, we will demonstrate a simple case: finding the optimal number of parallel cores to use with the FEM calculation for a given physics model and mesh. To do this, we’ll modify our configuration to focus on workload optimization:
{
"region_name" : "us-east-1",
"s3_bucket_name" : "dummy-bucket",
"container_arn" : "dummy.dkr.ecr.dummy-region.amazonaws.com/dummy-container",
"logGroupName" : "/aws/batch/job",
"replications" : 1,
"review_logs_past_hrs": 1,
"job_timeout": 20,
"run_cmd": "python /monitor_system.py --cmd ./runmoose.sh ARGS",
"docker_privileged": "True",
"ec2_types":["c6"],
"optimization_iterations": 20,
"optimization_parallel_samples": 5,
"optimization_metric": "runtime",
"optimization_arg1": [1,64],
"optimization_arg1_type": "integer"
}
We’ve set replications to a value of 1, as the Bayesian optimization uses an ML model to find the average response. Since this method explores covariates, it inherently accounts for the effects of replicates, eliminating the need for additional runs. We’ve modified the “ec2_types” to focus on a specific family. Note the addition of ARGS in the run_cmd
input. This instructs the program where to send input perturbations. The user-provided script is expected to handle these inputs appropriately within their container.
While this optimization could be run serially, we want to leverage the scalability of the cloud. The optimization_parallel_samples
parameter tells the program how many EC2 instances to provision for workload optimization. This significantly speeds up convergence to an optimal solution, but be aware that costs will increase accordingly. The optimization_iterations
parameter sets the maximum number of input refinement batches to use. Bayesian optimization is known for its efficiency with low-noise response topologies. However, more discontinuities in the response surface will require a higher number of iterations and may eventually necessitate switching to another method, such as a genetic algorithm.
The optimization_metric
value can be set to any system parameter, runtime, cost, or custom_metric
. Note that the optimization seeks to minimize values, so if a user wants to maximize a custom metric, they should return its negative value.
Lastly, any number of optimization_arg
and optimization_arg_type
inputs can be provided (e.g., optimization_arg2, optimization_arg3, etc.). The program will replace the ARGS in the command with these values for each perturbation. The parameter space range is determined by the list given in optimization_arg1
. The optimization_arg1_type
, which can be either integer or float, determines how the optimization perturbs the inputs during exploration.
After running the optimization, we can analyze the results:
streamlit run ./streamlitApps/analyze_optimization_streamlit.py
This will show us how different parameter combinations affect performance, allowing us to fine-tune our FEM solver for optimal performance on the chosen EC2 instance type.
AWS architecture
The AWS architecture can be summarized as follows:
- The user’s workload is containerized and pushed to Amazon ECR.
- The Cloud Development Kit (CDK) stack deploys the necessary AWS Batch infrastructure, including compute environments and job queues for each EC2 instance type to be tested.
- The benchmark script submits jobs to AWS Batch, which runs the containerized workload on each specified EC2 instance type.
- Performance metrics are collected using a monitoring script within the container and stored in Amazon S3.
- After all jobs complete, the results are analyzed and visualized using the provided Streamlit
This architecture allows for easy scaling to test hundreds or even thousands of different configurations, making it suitable for organizations of all sizes, from startups to large enterprises with complex workloads.
Summary
The CloudInstanceOptimizer is a powerful tool that can significantly improve the performance and cost-effectiveness of your Amazon EC2 workloads. By automating the process of instance selection and workload optimization, it enables data-driven decision-making and can lead to substantial cost savings and performance improvements.
Whether you’re running machine learning training jobs, complex simulations, or any other compute-intensive task, this tool can help you make the most of your AWS resources. By providing a holistic view of both instance performance and workload optimization, it empowers users to achieve the best possible balance between performance and cost.
If you want to request a proof of concept or if you have feedback on the AWS tools, please reach out to us at ask-hpc@amazon.com.