Building a 4x faster and more scalable algorithm using AWS Batch for Amazon Logistics

Amazon Logistics’ science team created an algorithm to improve the efficiency of their supply-chain by improving planning decisions. Initially the algorithm was implemented in a sequential way using a monolithic architecture executed on a single high performance computational node on AWS Cloud.

Scientists were focusing on solving the algorithmic challenges and were less focused on developing the underlying infrastructure. As the algorithm matured and started being used in production, the team needed a new dynamic and decoupled High Performance Computing (HPC) architecture that could run the algorithm faster and allow for more experimentation.

A team of cloud consultants at AWS Professional Services joined Amazon Logistics’s data scientists and worked on the design and implementation of this new architecture. Throughout this blog post, we will talk about the challenges, the proposed solution, as well as the AWS Services leveraged to orchestrate the algorithm and lessons learned.

Challenges

Scientists had their algorithm deployed on a managed monolithic infrastructure using a single large EC2 instance (m5.24xlarge). The deployed solution had the capability to utilize multiple cores of a single EC2 machine using multiprocessing. Still the run time of algorithm using this infrastructure (up to ~14 hours for a single run) was not feasible for faster experimentation.

Since the existing infrastructure utilized big machines, the cost of ownership was high and not always very efficient to utilize full resource capability. This also limited the scalability of algorithm to use bigger datasets and perform more complex calculations. All these issues created a risk of delay in deliverables for supply chain planning.

Scientists were focused on developing and extending the feature of their algorithm. They were not able to allocate resources to improve and move their infrastructure to HPC architecture on AWS.

Considering the current challenges, the new architecture needed to be designed with following aspects in mind:

Multi-fold decrease in run-time of the algorithm execution
Ease of experimentation & Flexibility
Re-usability and modularity
Incorporation of Continuous Integration and Continuous Development (CI/CD) and DevOps practices
Removal of silos between infrastructure and algorithm code to create an optimized combination

Solution

New HPC Architecture Proposed

HPC applications are often based on complex algorithms that rely on high performing infrastructure for efficient execution.

These applications need hardware that includes high performance processors, memory, and communication subsystems.

For many applications and workloads, the performance of compute elements must be complemented by comparably high performance storage and networking elements. AWS can provide near instant access to a virtually unlimited computing resources and support the most advanced computing applications.

The new developed architecture combines managed service, such as AWS Batch, AWS Lambda and AWS Step Functions to improve and optimize the performance of the Amazon Logistics’ algorithm.

The monolith architecture of the original algorithm has been decoupled in multiple components and the orchestration of the workflow has been managed using the AWS Step Functions. AWS Step Functions is a fully managed service that makes it easier to coordinate the components of distributed applications and automatically triggers and tracks each step and retries when there are errors.

Decoupling the various components led to parallelized execution of some parts of the algorithm and provided the most benefit in terms of reducing the run time. Breaking the algorithm into steps also ensures that the failure of one component does not bring the whole workflow down and each component can scale independently.

The activity tasks of the workflow are integrated with AWS Batch and AWS Lambda Functions to run the different components of the algorithm. As a fully managed service, AWS Batch helps developers, scientists, and engineers to run batch computing workloads of any scale. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there’s no need to install or manage batch computing software, so you can focus your time on analyzing results and solving core business problems.

To select the EC2 instance type, the team performed benchmarking against the different EC2 instance type.

After the comparison, the selected instance types for the solution use the 3rd generation Intel Xeon Scalable processors that delivers significant leaps in compute performance, memory capacity, and bandwidth and I/O scalability.

In addition, the modularity of the workflow allows to specify the required resources for each block of the algorithm. This information can be used by AWS Batch to select the optimal resource type for the specific step of the simulation.

Figure 1: A diagram of the Step Function workflow of the solution interacting with AWS Lambda, SQS and AWS Batch.

Integrating DevOps with Science

DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity. The joint team from AWS and Amazon Logistics invested heavily in automation and DevOps throughout their journey.

Continuous Integration (CI) on AWS is a software development practice where developers regularly merge their code changes into a central repository, after which automated builds and tests are run. In case of data science, the software would be new algorithms to be tested.

By automating the process using CI, scientists can continuously build, test and deploy their new ideas and algorithms easier, faster without compromising on quality.

The joint team worked on shaping the journey of taking the algorithm from its monolith state to a more modern and decoupled optimized architecture. The team realized that investing in automation and DevOps would enable them to move fast and experiment more.

Starting with DevOps foundations, the team setup a CI/CD pipeline to automate the deployment of the optimized algorithm on AWS. The team replaced and enhanced existing build systems for various algorithm components to integrate natively with the new CI/CD pipeline.

Followed by that, the team started breaking down the algorithm components and containerizing it to run on AWS Batch.

Figure 2: A diagram of the containerized execution showing the application accessing different AWS resources such as Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS) and FSx for Lustre.

After decoupling the algorithm to a set of AWS Lambda functions and AWS Batch jobs, the team was ready to orchestrate the algorithm together using AWS Step Functions, as detailed above in the HPC proposed architecture. Features such as the Workflow Studio, a new visual workflow designer for AWS Step Functions, allowed the team to rapidly build workflows and test them. It facilitated collaboration as the entire team gained visual understanding of the new architecture being built.

Followed the promising results from building the DevOps foundations and decoupling the algorithm, the team moved with high velocity to package the solution using the cloud development kit (CDK) as modularity was a main design requirement. The joint team gained a great flexibility and speed with the first version of the packaged algorithm. Realizing the potential in hands, the team was able to publish the second version, even faster and more dynamic release of the first algorithm in less than two weeks.

Packaging the algorithm and the optimized infrastructure

In the past, the algorithmic solution and the underlying infrastructure existed within their own silos. The scientists used the same standard infrastructure to test every new idea and algorithm. Running the same infrastructure for different workloads leave a huge room for improvement.

By leveraging the Cloud Development Kit (CDK), the joint team was able to define and re-architect the underlying infrastructure as Infrastructure as Code (IaC), decoupled the algorithm using AWS Lambda and AWS Batch, and orchestrate it using AWS Step Functions.

This all resulted in an optimized and modular algorithm that allowed Amazon Logistics to innovate faster, scale, iterate and experiment frequently over not only the algorithmic code, but also the infrastructure piece.

Future Work

AWS Step Functions has a Map state, which runs a set of workflow steps for each item in an array input. Map iterations run in parallel and has a concurrency limit of 40 at a time. However, at AWS re:Invent 2022, AWS announced Step Functions Distributed Map, which extends the map statement in AWS step functions to run a sub-workflow for each item in an array or Amazon S3 dataset, which can pass millions of items to multiple child executions, with concurrency of up to 10,000 executions at a time. This new feature could potentially unlock new possibilities for Amazon Logistics.

An architectural improvement to enable the best price performance in Amazon EC2 can be the possibility to add the support of the Graviton processor in the Geo-spatial algorithm.
AWS Graviton processors are designed by AWS to deliver the best price performance in Amazon EC2 and can help to reduce carbon footprint and lower the cost per-instance compared to x86-based instance.

Conclusion

As a result of this partnership, experimentation with the algorithm was increase by 50% also time of the algorithm run and processing of data was reduced x4 times. Introducing best security standards leads to achieve Amazon security certificate. Whole partnership and sharing best practices leads to leverage customer experience on AWS infrastructure and High Performance Computing

Jan Hofmann, Research Science Manager summarized this collaboration in words “Partnering with AWS Professional Services team was a great learning experience for our team. Their expertise in high performance computing and DevOps on AWS services resulted an 4x improvement in our algorithm’s run time and changed the way our team thinks about creating scientific products using AWS services.”

To learn more, read about how to use AWS Batch, AWS Step Functions, Amazon FSx for Lustre, AWS Lambda and AWS CDK.

AWS HPC Blog