Deploying predictive models and simulations at scale using TwinFlow on AWS

This post was contributed by Vidyasagar Ananthan, Senior SA, Ross Pivovar, Senior SA, Satheesh Maheswaran, Senior SA, Cheryl Abundo, Principal SA and Adam Rasheed, Head of Autonomous Computing, at AWS.

At AWS, we work with customers who use predictive models and simulations for the design and operations of their equipment and processes. The workloads range widely including engineering design optimization, drug discovery, retail forecasting, and industrial asset optimization with the common need of performing hundreds of thousands of simulations in a price performant manner with full auditability. In Figure 1, we show four categories of predictive modeling and simulation use cases that our customers employ to improve their business operations. We see customers using simulation to perform engineering design, conduct scenario analysis to forecast outcomes under different input conditions, perform systems of systems analysis of multi-hierarchy systems to understand interaction effects of adjacent systems, and deploy predictive digital twins to improve probabilistic predictions.

Figure 1 - Four categories of AWS TwinFlow use cases for predictive modeling and simulation.

Figure 1 – Four categories of TwinFlow use cases for predictive modeling and simulation.

In this blog, we describe TwinFlow, an open-source framework we developed to build and deploy predictive models at scale on AWS for the above use cases. We describe the general challenges in deploying predictive models at scale and provide examples showing the versatility of the TwinFlow framework. The code for TwinFlow has been released and documented in AWS Samples with a series of starter examples to help become familiar in using the framework.

Challenges in deploying predictive modeling and simulation workloads

We worked with customers to identify common underlying computing patterns for their predictive modeling and simulation workloads. Even though the business use cases varied widely, we found the following common needs:

Ability to build and deploy hundreds of thousands of models across an at-scale distributed computing architecture in a computationally efficient manner. This is common in situations when modeling many different scenarios, or when deploying models for fleets.
Ability to orchestrate dynamic pipelines (dynamic directed acyclic graph (DAG)) where the workflow execution path is not known a priori and is determined by the predictive modeling algorithm at runtime (algorithms-as-a-pipeline).
Ability to run tasks within a pipeline asynchronously (as opposed to sequentially) in order to maximize task parallelization for compute performance.
Ability to run models across a heterogenous compute architecture within a single workflow pipeline including hybrid cloud and on-premises architectures. Some models run on the cloud using GPUs, CPUs, or HPC, while others must run locally on-premises.
Ability to run models written in different languages or simulation software within a single workflow pipeline.
Ability to apply probabilistic (Bayesian) methods for models to self-calibrate or self-learn based on operational data such as time series data, inspection data, and event data.
Ability for traceability and auditability for the workflows. For traceability, this includes the ability to “go-back-in-time” and recreate an analysis (say that was done a year prior), using the data, models, and techniques available at that time. For auditability, this includes the ability to know who made what changes and when.

The above requirements span a hybrid combination of high performance computing (HPC) and machine learning (ML) workloads that are not all easily addressed using today’s available frameworks. For example, for HPC simulations, workload management is accomplished using static schedulers (such as Slurm and AWS Batch), and for ML workloads, orchestration is accomplished using ML toolkits (such as KubeFlow and AirFlow) that do not dynamically scale. There are modeling frameworks today that satisfy different mixes of these requirements, however customers are left to their own devices to write custom solutions using scripts and forced to make suboptimal compromises to meet all the requirements. Our aim is to enable our customers and partners to build these solutions at scale on AWS as easily as possible while satisfying all the requirements.

Introducing TwinFlow

TwinFlow is a framework to orchestrate hundreds of thousands of compute tasks across a distributed, heterogeneous computing architecture with traceability and auditability. Figure 2 shows that TwinFlow consists of three modules to address the undifferentiated heavy lift for at-scale dynamic workflow management and infrastructure provisioning for predictive models.

Figure 2: Infographic showing the structure of the AWS TwinFlow framework with the three modules of TwinGraph, TwinStat, and TwinModules.

Figure 2: Infographic showing the structure of the TwinFlow framework with the three modules of TwinGraph, TwinStat, and TwinModules.

TwinGraph is the workflow orchestration module that stores the metadata for each action in a queryable graph for traceability and auditability. In addition to enabling dynamic DAGs, a key differentiator is that TwinGraph enables extreme scaling in graph size and accelerated graph execution. A TwinGraph pipeline can have a mix of tasks running on the cloud and locally on-premises in a hybrid compute mode. Each task can run asynchronously locally with or without containers (DockerCompose, miniK8s), remotely on K8 clusters (EKS or equivalent), in queueable batch systems (AWS Batch), or serverless (AWS Lambda, AWS Fargate). Finally, each task within a pipeline can be containerized and the models written in any language invoked using Python APIs (such as Python itself, Fortran, C, Java) or executed as runtime models from simulation software (such as Matlab, Ansys TwinBuilder, Siemens XD, Modelica). Traceability and auditability are accomplished by storing the intermediate outputs and node attributes in Amazon Neptune database or Apache Tinkergraph using Gremlin as a language medium.

TwinStat is a key component of the TwinFlow framework that contains a library of algorithms and methods that are commonly used for probabilistic modeling. Some of these methods provide additional functionality not available in existing/maintained Python packages and others are custom methods that we developed to make it easier for the developer. The functions can be grouped into the following categories: 1/ Model building; 2/ Sensitivity analysis; 3/ Model calibration; 4/ Uncertainty quantification; 5/ Optimization. We are continuously adding to this library and please refer to the AWS Samples repository for the latest list of available methods.

TwinModules is a library of helper functions to simplify deploying predictive models on AWS. The helper functions address undifferentiated heavy lift and make it easier for the developer to use the AWS Cloud. This includes functionality such as accessing/managing data sources, and managing containers orchestration. TwinModules provides functionality to enable connections to AWS data sources, provision infrastructure via managed services, and enable algorithmic-based event trigger management. It also provides predefined templates for common use cases such as virtual sensors, signal pre/post-processing, and model self-calibration.

Demonstrating the versatility of TwinFlow for customer workloads

TwinFlow is a versatile framework enabling a variety of workloads across the four use cases in different industries. Below, we describe examples to demonstrate the breadth of TwinFlow.

Simulation-based optimization: Simulation driven design optimization is used by OEMs performing engineering design of equipment and processes. For example, manufacturers perform thousands of structural analysis simulations to optimize the physical design of subcomponents. Similarly, pharmaceutical companies use simulations to design the layout of new bioreactor manufacturing lines to optimize the yield. In Figure 3, we show the results of a component-level engineering design optimization in an automotive use case involving injection of a liquid which hardens into a foam to provide structural strength for vehicle body panels. The challenge here is to find the optimal injection trajectory to maximize the contact surface, while minimizing void formation and foam wastage. TwinFlow is used to orchestrate 3 distinct models run on different compute infrastructure (foam growth on AWS Batch, trajectory optimizer on Docker, and trajectory perturbation on local Python) within the optimization algorithm. The optimization ran 1280 foam growth simulations, generating 11140 graph elements, with each simulation taking 9-11 min (including 5-6 min of post-processing).

Figure 3: a) Diagram showing structural elements of a vehicle, specifically the B-pillar which is modeled; b) Video showing a simulation run for the foam injection (click video).

In another example, we use TwinFlow for system-level engineering design optimization of an offshore wind farm (using the International Energy Association (IEA) benchmark IEA37 wind farm use case). The challenge here is to identify the optimal layout (geographic coordinates) of 64 wind turbines to maximize wind farm energy production, while taking into account the impact of the wake generated by each wind turbine within the wind farm. The approach uses an analytical wake model (Bastankhah & Porte-Agel, 2014), with added turbulence (Frandsen, 2017) solved using the open-source PyWake package. Figure 4a shows a video of the wind farm layout as the simulation optimization progresses and Figure 4b shows the simulation convergence curve. The optimization took 450 iterations to converge and 18 minutes to run with an approximate runtime of 10 seconds per simulation.

Figure 4: b) Graph showing the convergence of the simulation to maximize wind farm energy production.

Figure 4: a) Video showing the evolution of the wind farm layout optimization (click video); b) Graph showing the convergence of the simulation to maximize wind farm energy production.

Scenario analysis: It is very common to use simulations to perform what-if scenario analysis to understand the implications of different engineering design choices or policy decisions. To demonstrate the versatility of TwinFlow, we present a very different example of modeling crowd dynamics during the commute hours in the Kensington area in London. We consider two scenarios of different road closures to accommodate construction of a new building. We use TwinFlow to deploy the crowd model, orchestrate the workflow, and run the different scenarios. Figure 5a shows a video of the simulation where each dot represents an individual walking to their destination. Figure 5b quantitatively shows that the first road closure configuration results in a larger number of people (1375) reaching their destination within 350 seconds. This type of scenario analysis can be used by city planners to plan out different scenarios of road closures and pick the best one to minimize the impact to commuters.

Figure 5: a) Video showing simulation of individuals (dark red dots) walking to their destination; b) Graph showing that road closure configuration 1 results in 41% (1375 vs 975) more people reaching their destination in 350 seconds.

System of systems analysis: In engineering design of systems, we want to understand the non-linear effects of operational changes in one subsystem to the performance of other subsystems. Orchestrating the modeling of these multi-hierarchal, multi-fidelity simulations can be challenging with the overall workflow consisting of a heterogeneous mix of models (computational fluid dynamics, finite element analysis), written in different languages (Fortran, C++), and running on heterogeneous compute (CPUs, GPUs, HPC). Today, such systems of systems analysis are performed using specialized software or custom code. We used TwinFlow for modeling a combined cycle power plant consisting of a gas turbine, a heat recovery steam generator, and a steam turbine. Each of these major components themselves consist of subcomponents. For example, the gas turbine consists of a compressor, the combustor, and the turbine. The overall power plant hierarchy is shown as a graph in Figure 6a. We used a 0D thermodynamic model of the power plant running on a C5 instance, with the sub-model for compressor running serverless on Lambda. Figure 6b shows the compressor performance degradation resulting in increased NOx production in the emissions.

Figure 6: a) System diagram of natural gas combined cycle power plant; b) Graph showing the impact of gas turbine compressor efficiency on power plant NOx emissions.

Digital Twin Predictive Modeling (L3/L4): Our digital twin customers tell us that they are seeking to use predictive modeling and simulation digital twins to improve their operations and strategic planning. Example use cases include virtual sensors, anomaly detection, failure prediction, and operational scenario planning for complex industrial facilities and fleets of assets. In a prior blog, we described AWS’ Digital Twin leveling index for customers to understand these L3 Predictive and L4 Living digital twin use cases and the technologies needed to achieve their business objectives.

In our example below, we show an L3 Predictive digital twin virtual sensor to measure exit pressure for a multi-stage natural gas compressor train. Virtual sensors are prediction models used in situations where measurements with physical sensors are too difficult, expensive, or impractical. The NG compressor train virtual sensor is developed using a Modelica model in Ansys TwinBuilder and deployed on AWS using the Ansys TwinDeployer runtime engine within the TwinFlow framework. The results of the physical and virtual sensors are brought together and visualized in AWS IoT TwinMaker.

Figure 7: a) Image of an LNG compression facility; b) Video showing an L3 digital twin virtual sensor for compressor exit pressure (click video).

For our last example, we demonstrate L4 Living digital twins for predicting the individual battery performance in a fleet of electric vehicles (EV). This use case is relevant for EV manufacturers, battery manufacturers, and drivers whose considerations include predicting driving range, long-term battery health, and residual value for secondary applications. Answering these questions is challenging as each battery is unique due its specific environmental operating conditions, usage patterns, and manufacturing variability. In this example, we use TwinFlow to create the initial model of a new battery for each EV, and then individually calibrate each of the battery models using probabilistic Bayesian estimation techniques based on the operational usage pattern (routes driven) and the specific in-vehicle battery voltage measurements.

Figure 8 shows the operational history of one specific EV in the fleet and how the L4 digital twin battery model is periodically calibrated when the L4 digital twin model error exceeds the threshold. The error is calculated at the end of each route (blue dot) and if the error is above the threshold, then a model update is triggered (red square). The error for the non-updated model prediction (blue line) drifts higher, whereas the updated model prediction (green line) stays near or below the threshold.

Figure 8: a) Transparent view of an electric vehicle (EV) showing the battery pack layout along the base of the vehicle; b) graphs showing the L4 digital twin battery performance model as it evolves over time.

The above set of examples shows the breadth and versatility of the TwinFlow framework. We’ve shown examples in different industries (automotive, renewable energy, city planning, as well as variety of modeling methods (computational fluid dynamics, finite elements, agent-based modeling, custom partial differential equation solvers), and variety of use cases (engineering design, systems modeling, epidemiology, battery performance, component degradation).

Summary

In this blog, we shared our TwinFlow framework for customer and partners to develop their predictive modeling and simulation workloads for their business operations. We encourage you to learn more by downloading our framework and working through the examples that take you step-by-step through the process.

AWS HPC Blog

Deploying predictive models and simulations at scale using TwinFlow on AWS

Challenges in deploying predictive modeling and simulation workloads

Introducing TwinFlow

Demonstrating the versatility of TwinFlow for customer workloads

Summary

About the authors

Resources

Follow