Deploying Level 4 Digital Twin Self-Calibrating Virtual Sensors on AWS

This post was contributed by Ross Pivovar, Solution Architect, Autonomous Computing, and Adam Rasheed, Head of Autonomous Computing at AWS; and Kayla Rossi, Application Engineer, and Orang Vahid, Director of Engineering Services at Maplesoft.

In a previous post we shared the common customer use case of deploying Level 3 Digital Twin (L3 DT) virtual sensors to help operators make more informed decisions in situations where using physical sensors is difficult, expensive, or impractical. One of the challenges is that L3 DT virtual sensors use pre-trained models that deviate from real-world behavior as the physical system degrades and changes over time. Operators then become hesitant to base operational decisions solely on the L3 DT virtual sensor predictions.

Today we’ll describe how to build and deploy L4 DT self-calibrating virtual sensors where operational data is used to automatically calibrate the virtual sensor. This automated self-calibration allows the virtual sensor predictions to adapt and more closely match the real-world, allowing the operators to take proactive corrective actions to prevent failures and optimize performance. For more discussion on the different levels for digital twins, check out our previous post describing the AWS L1-L4 Digital Twin framework.

In this post, we use a Modelica-based model created using Maplesoft’s MapleSim simulation and engineering design software. MapleSim provides the tools to create engineering simulation models of machine equipment and exports them as Functional Mockup Units (FMUs) which is an industry standard file format for simulation models. We then use TwinFlow to deploy the model on AWS and calibrate the FMU using probabilistic Bayesian estimation techniques to statistically infer the unmeasured model coefficients – using the incoming sensor data to calibrate against. TwinFlow is an AWS open-source framework for building and deploying predictive models at scale.

Roll-to-roll manufacturing

For our use case, we’ll consider the web-handling process in roll-to-roll manufacturing (depicted in Figure 1) used for continuous materials such as paper, film, and textiles. The web-handling process involves unwinding the material from a spool, guiding it through various treatments such as printing or coating, and then winding it onto individual rolls. Precise control of tension, alignment, and speed is essential to ensure smooth processing and maintain product quality.

Figure 1: A short movie showing the dynamics of the web material handling process in roll to roll manufacturing. The web material is the sheet passing through the rollers which control the tension and speed of the manufacturing process. The driven rollers set the web speed and tension.

There are many different failure mechanisms, however for this post, we’ll focus on two failure modes: high tension failures and slip failures.

Figure 2 shows a schematic diagram of the web-handling equipment with the material spans between the rollers labeled from S1 through S12. Figure 2b shows a screenshot of the MapleSim web-handling simulation model of the roll-to-roll manufacturing line.

Figure 2: a) Schematic diagram of the web handling equipment in which each span is labeled, and b) Screenshot of the MapleSim simulation model of the web handling equipment.

Referring to Figure 2, tension failures occur when the tension within the material at a particular span exceeds a threshold (175 Newtons in this example) resulting in web deformities like wrinkling and troughing. Slip failures are more subtle and occur when the relative movement between the web material and rollers isn’t in sync, causing the web to be dragged across the roller. Slip failures are quantified by measuring the slip velocity which is the difference between the linear velocity of the web material and the tangential velocity of the roller. We consider it a slip failure when the slip velocity exceeds a threshold (0.0001 m/s in this case). It can result in difficult-to-detect defects, misalignment, and reduced product quality if not controlled.

Tension and slip are measurable variables, but it’s expensive and intrusive to add these sensors at every location along the manufacturing line. It’s more common to have a limited number of these sensors at key locations while measuring the angular velocity (rotation speed, often colloquially referred to as RPM) of each of the rollers and relying on best practices to control the manufacturing process. This results in sub-optimal process control and relies heavily on operator experience. A digital twin of this process can be used to calculate the tension and slip velocity when combined with the proper IoT data.

Both failure mechanisms are the direct result of the rollers becoming more difficult to turn because of dirt and grime fouling the bearings, increasing the viscous friction. In our example, the viscous friction of the rollers is accounted for in the MapleSim model. The model solves the equations of motion, but requires estimates of the viscous friction coefficients to use for the rollers in the model. Viscous friction (damping) is not a directly measurable variable but can be inferred by calibrating the model using the measured angular velocities.

In essence, we’re seeking to infer the viscous friction coefficient by leveraging statistics, the FMU model, and our probabilistic Bayesian estimation methods.

We use the calibrated L4 Digital Twin to create self-calibrating virtual sensors of tension and slip velocity. These L4 DT self-calibrating virtual sensors can then be used by operators to support operational decisions to control the manufacturing process.

Example failure scenario

To understand the process of combining IoT data streams with the FMU model to create an L4 Digital Twin, we’ll examine a failure scenario that we simulated. In our synthetic failure scenario, we introduced defects for rollers 3 and 9 in the form of contamination – leading to increased friction. Here, we linearly increased the bearing viscous damping coefficients from 0 on Day 15 to 0.2 on Day 23 (which we’ve shown in Figure 3).

Figure 3: Plot showing the artificially induced increase in viscous damping for rollers 3 and 9 starting from 0 on Day 15 to 0.2 on Day 23.

Figure 4 shows the angular velocity measurements (both plots show the same data with different y-axis scale) for each of the rollers. Figure 4b shows that – as expected – at Day 15, the angular velocity for rollers 3 and 9 begin to change because of the increasing viscous damping we introduced to simulate dirt build-up. Since the entire web line is linked via the material, a single roller increasing in resistance impacts the angular velocities of the surrounding rollers. In our case, we see that rollers 1,2,3,7,8,9, and 10 all experience changes in angular velocity.

By Day 21, the dirt build-up is sufficient to cause a very large change in the angular velocity of roller 9 as we’ve shown in Figure 4.

Figure 4: Incoming IoT data streams for angular velocity. Left and right figures are the same data with the scaling of the y-axis changed to enable observing tiny differences.

Figure 5 shows the corresponding span tensions and slip velocities (that we would measure with IoT sensors if they were installed) leading to tension failures in spans 4, 5, and 6 and slip failure at roller 9. Figure 5a shows spans 4, 5, and 6 exceeding the 175N threshold and Figure 5b shows roller 9 with non-zero slip velocity.

Figure 5: a) Measured span tensions during the dirt build up scenario in which spans 4,5, and 6 exceed the failure threshold for the webbing material, and b) measured roller slip velocities during the dirt build up scenario in which roller 9 eventually exhibits significant slippage around day 21.

As mentioned already, in practice, operators typically only measure the angular velocity of the rollers – and span tension might be measured using a load cell for only one span across the entire production line.

This example shows the importance of calibrating the system model to use the correct viscous friction coefficients. Without regular calibrations, the model predictions for span tension and roller slip velocity will be incorrect, resulting in the operator being unaware of potential, impending failures.

Using TwinFlow to calibrate the L4 Digital Twin

TwinFlow is the AWS open-source framework for building and deploying millions of predictive models at scale on a distributed, heterogeneous architecture. TwinFlow incorporates the TwinGraph module to orchestrate the model deployment and the TwinStat module which includes statistical methods for building and deploying L4 Digital Twins. These methods include techniques to build quick-execution (seconds or less) response surface models (RSMs) from complex simulation models that could take hours to run and would be too costly and too slow to support operational decisions.

The TwinStat module also includes methods to probabilistically update and calibrate L4 Digital Twins. In our case, the MapleSim model is exported as an FMU which already has a very quick execution time on the order of a few seconds. Since the degradation effect due to roller dirt accumulation is gradual (hours to days), our FMU execution time is more than sufficient – and cost-efficient.

The model calibration techniques in TwinStat include probabilistic Bayesian estimation methods like Unscented Kalman Filters (UKF), Particle Filters (PF), and Gaussian Processes. Each of the methods have pros and cons and preferred scenarios, which we’ll discuss in a future post. For our present use case, UKF provides the necessary accuracy while minimizing the compute time. We used UKF in TwinFlow with the MapleSim FMU model to calibrate the viscous coefficients using the measured angular velocity IoT data.

To use TwinFlow, we needed an appropriately-sized Amazon EC2 instance. For our specific scenario (fast FMU with low memory requirements) we wanted to minimize the network overhead, hardware procurement time, and container download and activation time needed to run UKF. We know that UKF scales with the number of FMU executions by 2*D+1, where D is the number of variables included in the UKF. For our example, we have 9 measured angular velocities and 9 unmeasured viscous damping coefficients for a total of 18 variables. For each incoming IoT data point, UKF will run the FMU 37 times. Thus, we select an EC2 instance with around 37 cores for optimal runtime performance. In this scenario, the FMU is single-threaded (you’d need to increase your instance size if you used a multi-threaded FMU with UKF). TwinFlow parallelizes the UKF execution with multi-processing, thus the more CPUs, the shorter the runtime.

In our calibration example, we used the synthetic dataset for the angular velocity plotted in Figure 4 to represent the real-world measured sensor data. We used UKF to iterate and determine the correct viscous damping coefficients in the FMU model to match the measured angular velocities. Figure 6a shows the values determined by UKF for the viscous damping coefficients, as well as the estimates of uncertainty (shaded regions around the lines).

We see that UKF correctly estimates the coefficients to be approximately zero for all rollers except 3 and 9, which increased to a value of approximately 0.20, giving us confidence that UKF is successfully calibrating the L4 Digital Twin to model the evolving behavior of the web-handling system.

Figure 6b shows how the UKF estimates for angular velocity lies directly on top of the IoT sensor data (open circle data points). The shaded region represents uncertainty.

Figure 6: a) UKF estimates for a) viscous damping coefficients (commonly denoted with letter ”b” ) and b) angular velocity of each roller with uncertainty estimations (shaded region). We see that the viscous damping coefficients for rollers 3 and 9 increase from 0 to 0.2, corresponding to the failure scenario we manually introduced when creating the synthetic data, and we see the UKF angular velocity estimates matching the IoT synthetic data.

We can now examine the performance of the L4 Digital Twin self-calibrated virtual sensor for tension and slip by checking the residuals are close to zero. The residual is the difference between the virtual sensor predicted value and the actual IoT measured data value (in this case our synthetic dataset).

Figure 7 shows the residuals for tension and slip velocity are near zero, meaning that the virtual sensor predicted values closely match the true values, and we can eliminate the costly physical sensors for measuring tension and slip velocity.

Figure 7: Residuals of the digital twin output comparing tension on the left and slip velocity on the right. Due to the example scenario using synthetic data that is noiseless, the residuals are almost perfectly zero.

Now, we compare the results of using an uncalibrated L3 virtual sensor model with the L4 self-calibrating virtual sensor.

Figure 8a shows that the L3 digital twin initially matches the measured angular velocity, but then misses the changes as the system performance degrades due to dirt accumulation, whereas the L4 self-calibrated digital twin virtual sensor evolves with the real physical system and matches the measured data closely. Similarly, Figure 8b and Figure 8c show that the L3 (uncalibrated) virtual sensors initially match the measured data, but then underpredict the roller 9 slip velocity and the span 6 tension as the system degrades and the operator is completely unaware of the impending failure.

The L4 (self-calibrating) virtual sensor predictions, however, closely match the measured data, allowing the operator to take corrective actions prior to failure. The slip failure is particularly noteworthy since it induces difficult-to-detect material imperfections that can make their way into the final product. We see that the L3 uncalibrated virtual sensor initially makes reasonable predictions, but as the web-handling system behavior evolves over time (due to real-world degradation), a self-calibrating L4 virtual sensor is required to adapt the model to the changing real-world conditions.

Figure 8: Comparison of a digital twins that are either calibrated or uncalibrated relative to measured data.

AWS Architecture

These are great results, so it’s worth taking a moment to explain what architectural decisions we made that allowed us to get here.

The AWS architecture used for the L4 Digital Twin virtual sensor calibration is shown in Figure 9. In our scenario, we periodically collected sensor data every 10-20 mins – a sufficient time resolution to capture the failure phenomena of interest.

It’s important to balance the trade-off of increased time resolution versus what is needed for the use case, as gathering a lot of data at high frequency results in unnecessary data storage and additional compute costs. We used an Amazon EventBridge scheduler to enable periodic calibration. We could have alternatively added logic to the container code to first calculate the error of the digital twin and decided to only calibrate if an error threshold is violated. And since Amazon EventBridge can only handle a maximum of 100 scheduling rules, we’d need to modify the architecture to use a Lambda function between Amazon EventBridge and AWS Batch, if we needed to scale to millions of tasks.

In Step 1 in Figure 9, we’ve shown how you can download TwinFlow to a temporary EC2 instance where they can customize, build, and push containers to cloud repositories, and then deploy your infrastructure as code (IaC).

Next, you can modify the example container and insert your own model into the container. The container gets pushed up to Amazon Elastic Container Registry (Amazon ECR) where it’s now available to all AWS services.

At step 4, you connect your IoT data from an edge location into a cloud database like AWS IoT SiteWise, which is a serverless database designed to handle sensors with user defined attributes.

At this point, the Amazon EventBridge scheduler calls tasks in an AWS Batch compute environment which loads the customized container, pulls data from AWS IoT SiteWise, calibrates the L4 Digital Twin, saves the calibration to an S3 bucket, makes physics predictions, and saves them back in AWS IoT SiteWise.

AWS Batch selects the optimal EC2 instance type, auto-scales up or out, and logs all task output in Amazon CloudWatch. Finally, we used AWS IoT TwinMaker to create dashboards in Grafana so operators can review the measured and predicted data in near real time – and make operational decisions.

Figure 9: AWS Cloud architecture needed to achieve digital twin periodic-calibration.

Conclusion

In this post, we showed how to build an L4 Digital Twin self-calibrating virtual sensor on AWS using an FMU model created by MapleSim.

MapleSim provides the physics model in the form of an FMU, and TwinFlow allows us to use incoming IoT data to probabilistically calibrate the L4 Digital Twin virtual sensor for span tension and roller slip velocity. In future posts, we’ll discuss how to use the calibrated L4 Digital Twin to perform scenario analysis, risk assessment, and process optimization.

If you want to request a proof of concept or if you have feedback on the AWS tools, reach out to us at ask-hpc@amazon.com.

Some of the content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

AWS HPC Blog