Serverless self-service portal for running engineering applications with on-premises software licensing

Background:

Energy companies are seeking methods to run engineering software in the cloud, especially those that need a significant amount of computing and storage resources in short periods of time to bring about simulation capacity instantaneously and reduce project turnaround (for example, Florida-based utility runs PROMOD, Siemens Simcenter customers run CFD, Ansys LS-DYNA customers run FEA on AWS).

Driven by the goal of cost reduction and operational excellence, a common ask from business users is whether or not they can run their software in the cloud with a bundle of licenses installed locally on on-premises servers. In a popular licensing model known as floating license or network license, applications communicate with a license server to check out authentication tokens when getting started and return them back at closing. These license hosts typically run in the corporate network behind firewalls. Users in the same Ethernet (sometimes through VPNs) timeshare the pool that contains the procured licenses.

Problem statement

Depending on the licensing agreement between software vendors and end users (EULA), it may or may not allow for license transfer to the cloud, and might incur more charges or modifications to proceed. For the engineering applications to access unconstrained compute power through the cloud while leaving the licenses on-premises to avoid migration effort and agreement violation risks, a hybrid deployment approach illustrated in this AWS whitepaper can be used to provide license access to the cloud instances.

Apart from the concern over licensing, another challenge for end users to perform engineering studies in the cloud is how they can access the compute instances quickly and securely without having to log in to the AWS console or equip cloud expertise.

To fill in this gap, AWS released Research and Engineering Studio (RES) as an open source, web-based portal aimed at enhancing productivity and collaboration for research and engineering. The RES web portal enables not only administrators, but also researchers and engineers to access and manage their workspaces. RES uses a project-based management model to define access permissions, allocate resources, and manage budgets for a set of tasks or activities. It is a powerful product for R&D oriented design and analysis.

However, RES’s web stack is built upon always-on instances and a load balancer in front of them, which is not necessarily cost-efficient. From the perspective of design purpose, it targets organizations which need to manage user permissions based on projects and budget requirements. In this post, we discuss a self-service portal built upon serverless architecture that provides well-tailored features similar to what RES does for engineers to run professional software with on-premises licensing. We also provide a user implementation use case.

Solution: Web portal built with AWS Amplify frontend and serverless backend

Almost every office-based engineer knows how to do web browsing and remote desktop connection, but only a few may have hands-on experience with the cloud or AWS. The objective of this self-service web portal is to create a user-centric UI experience and one-stop shop for checking out the compute resources on AWS for engineering studies that need a license from an on-premises host.

The following solution overview illustrates the user portal built upon a serverless architecture and how it manages the compute resources checkout for the engineering analysis and simulation with a hybrid cloud setup.

Figure 1. Architecture overview of self-service web portal for engineering study platform with hybrid cloud setup

AWS Amplify streamlines the frontend hosting through a fully-managed continuous integration and continuous delivery (CI/CD) and hosting service for fast, secure, and reliable static and server-side rendered apps that scale with business demands. Users can build their frontend UI or bring an existing web application built in modern web frameworks such as React, Angular, Vue, and Next.js to suit their needs. Specifically in this solution, some basic information should be included in the frontend to facilitate the use of AWS resources:

The instance types allowed to select from a pre-approved list
Compute environment info, for example, Windows or Linux, CPU/GPU/Memory
Software suite such as basic modules and add-ons
Total active licenses/available licenses

Furthermore, there should be buttons on the UI for engineers to do the following:

Select/change the instance type for their application
Start/stop/terminate instances
Open a DCV session in a new browser tab
Copy Windows login password to clipboard

An example web portal is shown in the following diagram:

Figure 2. An example self-service web portal for a user to check out compute resources

Integrating Amazon Cognito with Amplify can authenticate users in the frontend. Amazon Cognito can also be used to authorize the REST API calls made through an API Gateway or a GraphQL API made through AppSync. User IT can integrate Amazon Cognito with their SAML-compliant identity provider (IdP), for example, Active Directory Federation Services (ADFS). This enables them to authenticate the application users through either their existing on-premises Active Directory or AWS-managed Active Directory, so there is no need to grant end users access to an AWS account.

On the other side, the portal has a serverless backend comprised of AWS Lambda and AWS Step Functions as the compute layer, in tandem with Amazon DynamoDB, AWS Secrets Manager, and Systems Manager Parameter Store as the storage layer. From the compute perspective, Lambda functions are added behind each API that handle instance start, stop, termination (for administrator only), and password retrieval for a Windows instance, respectively. Another function that is optional to add is to generate a secure URL with a session ID and an authentication token for a DCV connection, if a DCV Gateway and DCV Session Manager are configured to provide a secure access session (observe instructions in this AWS post). A Step Functions workflow is created to “safe stop” the compute instance in case an engineering workspace is opened and unsaved. A detailed explanation of how it works is given in the following paragraph. From the storage point of view, DynamoDB tables are created to save user-specific information such as the selected instance type, instance state, instance uptime, private IP and private DNS, as well as shared data such as total/available licenses of a software suite and user occupancy. Secrets Manager is an ideal service for storing the login credentials for the Windows instance or the private key for a Linux instance, while Systems Manager Parameter Store is used to save the environment variables and their values in key-value pairs for Lambda functions and Amplify configurations (observe this Amplify document).

The application stack is architected around an Amazon Elastic Compute Cloud (Amazon EC2) service. Depending on the engineering software requirements, different instance types should be provided to end users with a range of options. Compute-optimized instances such as c7i and c7a should be given for compute-intensive engineering applications. Memory-optimized instances such as r4 and x1 are good choices for memory-intensive workloads. Accelerated computing instances such as g4 and g5 should be considered for applications to support graphics rendering. Instances such as hpc7a and hpc6id are good friends of simulations that need high performance computing (HPC). The users’ compute instances can connect to an FSx file system (OpenZFS or NetApp ONTAP) for data sharing. Upon configuration, the commercial software installed on these instances can communicate with the on-premises licensing server to request license tokens through a dedicated connection (either AWS Site-to-Site VPN or AWS Direct Connect), while end users can open a tab in a browser to connect with the compute instance virtual desktops through Amazon DCV Connection Gateway or directly using a DCV Server (a server-side service installed on Amazon EC2). To keep the virtualized environment secure, reliable, and up-to-date for an engineering study, EC2 Image Builder is used to generate an automated pipeline for image management.

License tracking

To track the software license usage for end users, query the application license manager periodically (if an API or local utility is provided to do so) and send the license quotas to a DynamoDB table through the dedicated connection and AWS PrivateLink. Another option is to install Amazon CloudWatch Agent on the on-premises license host that sends the license manager logs to Amazon CloudWatch. Then, the logs are parsed by a Lambda function and write the license status (for example the total number of licenses being used and the available number of licenses) to the DynamoDB table. Amplify can directly query the table from AWS AppSync and bring the license status to the UI as shown in the preceding figure.

Instance Safe Stop Workflow with Step Functions

Per the cost optimization pillar of AWS Well-architected Framework, idle instances should be stopped or terminated to avoid unnecessary costs. Amazon EC2 doesn’t charge on stopped instances, thus stopping an unused instance should suffice unless a launch from a new Amazon Machine Image (AMI) is needed to activate software updates that are needed for the study. RES can automatically stop a virtual desktop if the instance stays idle for over 15 minutes. This is realized through CloudWatch alarm actions. In reality, engineering application users don’t expect a direct stop without a notification in advance, because they may need to save their workspace in which a model or setting adjustment has been made. To mitigate against the loss of temporary data, a more considerate workflow can be implemented through Step Functions to perform a pre-stopping completion check. These tasks can be added to the state machine before actually stopping the instance: save the software workspace, copy intermediate results to shared storage such as FSx, and send a well-composed email to notify users of instance stopping. Workspace saving and data copying can be achieved by invoking an AWS Systems Manager Run Command to enter a local command on the user’s compute instance or by calling a third-party API if the software provides a RESTful interface to do so.

CloudWatch alarms can’t trigger Step Functions directly. To start a Step Functions workflow from a CloudWatch alarm, an Amazon EventBridge (previously known as CloudWatch Events) rule needs to be created based on the CloudWatch alarm state transition and the step function must be set as the target. The alarm goes to the ALARM state when a complex rule expression of monitored metrics is satisfied. It instantaneously triggers a notification about the instance’s prolonged idling status from Amazon SNS. Then, the state machine runs tasks to safe-stop the user’s compute instance when it is triggered by the event. An event-driven workflow is illustrated in the following architecture diagram.

Figure 3. Event-driven workflow for instance safe stop and notification

The state machine first enters a “Wait” state for a period of time before it proceeds to carry out the pre-stopping tasks. When the timer stops, it sends commands to the instance that saves the workspace and makes a copy of the saved file to the mounted shared storage. Alternatively, it can call vendor-specific APIs, if any, to achieve the same goal. Then, it calls a Lambda function to stop the instance. The ResultPath is used to combine the pre-stopping task result with the Lambda function’s response for the next state in the workflow (a “Choice” state) to decide what following actions should be taken (observe this developer guide). If the preceding tasks succeed, then it sends a success email. Otherwise, it sends a failure email to notify the user of the attempt to stop their instance because of prolonged idling, as shown in the following figure.

Figure 4. Instance Safe Stop Workflow in Step Functions

Implementation at ISO New England for Electromagnetic Transient simulation

ISO New England (ISO-NE) is an independent regional transmission organization that oversees the operation, planning, and wholesale electricity markets of New England’s bulk power grid. New England is rapidly transitioning from traditional sources of energy to renewables. There is almost 45 GW in the ISO-NE generation interconnection queue as of May 2024 (observe the following figure), and over 98% of them are Inverter-based Resources (IBRs) consisting of wind, battery, and solar plants. In a preceding paper, ISO-NE shared its experience running Electromagnetic Transient (EMT) simulations on AWS. The EMT studies are essential for assessing the impact of IBRs on the bulk power system, but they are computationally intensive because of very small simulation time steps (at microseconds) and complex modeling of the protection and control of IBRs. ISO-NE built a cloud-based EMT simulation solution around Amazon AppStream 2.0. This allowed engineers to use virtual desktops to run parallel EMT simulations using PSCAD and E-Tran Plus for a system impact study of a wind plant in the New England system.

Figure 5. ISO-NE Generation Interconnection Queue As of May 2024

However, ISO-NE found several deficiencies in their previous solution built around AppStream 2.0. First, they couldn’t run hyper-scale EMT simulations through vertical scaling because of limited instance type options on AppStream 2.0, in particular a lack of choices for HPC instances (horizontal scaling doesn’t work with the applications). Furthermore, it was a bit cumbersome to get the engineering software licensed because it could take significant effort to port licenses out to the cloud and set it up for compute instances, which might also reduce the license usability for studies performed on-premises. After discussing with AWS, ISO-NE realized that the proposed solution could precisely address the need for on-premises licensing and simultaneously unlock a broader range of instance types for various power system engineering applications. ISO-NE is now implementing the serverless self-service portal to further speed up generation interconnection studies and operations technical studies, which can eventually accelerate renewable integration and grid decarbonization.

Conclusion

This post puts forth a user-centric, self-service web portal built with an AWS Amplify UI and serverless backend, aiming to help business users self-manage their cloud resources for engineering analysis with little to no AWS experience. The portal’s backend controls the application stack that uses a hybrid cloud setup to address the need of running different types of engineering applications on AWS with on-premises licensing. An event-driven architecture and serverless computing services such as AWS Lambda and AWS Step Functions are used to make the solution cost-effective and scalable. Amplify is introduced to streamline and automate the process of web building, deploying, and testing as well as reduce the web hosting cost and effort. Users such as ISO New England are implementing this solution to improve the computing efficiency of EMT simulation studies of renewable resources integration.

AWS for Industries