AWS HPC Blog
Improve engineering productivity using AWS Engineering License Management
This post was contributed by Eran Brown, Principal Engagement Manager, Prototyping Team, Vedanth Srinivasan, Head of Solutions, Engineering & Design, Edmund Chute, Specialist SA, Solution Builder, Priyanka Mahankali, Senior specialist SA, Emerging Domains
For engineering companies, the cost of Computer Aided Design and Engineering (CAD/CAE) tools can as high as 20% of product development cost. This is why optimizing their usage is so important. If your company uses metered licenses, you may want to optimize for cost. If your licenses are paid for in advance, you may want to go beyond cost and look at engineering productivity. This becomes even more important when you realize that engineers represent up to 50% of product development costs.
Regardless of what you optimize for, better visibility leads to cost savings and faster time to market which are competitive advantages.
In today’s post, we’ll look at a new approach to getting the telemetry and fidelity you need to get on top of this optimization curve – which is why we developed a new tool called Engineering License Management (ELM) on AWS.
What’s missing?
There are many solutions in the market for viewing and analyzing your license usage. However, customers across industries – from aerospace, automotive, to chip design – told us they’re missing multiple data points.
These data points may not all apply to your industry. But, they’re the sum of the customer inputs that drove us to build ELM.
User granularity
Since the tools collect total usage, they ignore who is using them. Without this, customers can’t identify the top users. They also can’t identify users who may be misusing the tools (keeping licenses so long they expire or just submitting a suspicious number of jobs).
Quantifiable productivity impact
When asking for additional licenses, management will often ask to understand the tradeoff: productivity loss vs. cost of buying more licenses. If you can’t quantify the amount of time users are waiting for a specific license it is hard to build the business case for the additional expense. The same applies to the question of which teams are impacted. Your management will also want to understand why you are asking for 20 licenses and not 5 or 50. Existing solutions do not provide a good visualization of the number of licenses required to resolve the current wait times.
Simulating shared versus dedicated licenses
Sharing licenses between geographies and subsidiaries can improve their utilization (and productivity) as the company pools all its resources. However, software vendors charge more for these shared licenses – up to 30%. Building the business case to centralize licenses requires data collection from distributed license servers into a single-pane-of-glass to analyze the potential savings.
Custom reporting
Commercial solutions offer a lot of built-in reports. However, customers often raise gaps between their specific needs and these cookie-cutter reports. There’s a clear need for custom reports. For example, a customer asked for a report comparing license usage and wait times across teams. This is not a standard report, and was meant to evaluate if they need to create semi-separate license pools per team.
Why is it missing?
The limitations we face in managing engineering license are a result of using tools designed for enterprise licenses. For example, SQL server, Oracle databases, or an operating system are enterprise licenses. They’re launched on a server and are used for months or years. It’s a “dedicated” license and can be tracked alongside the server.
Enterprise licenses can easily be tracked using AWS License Manager, but this isn’t the case for engineering licenses. Most existing solutions settle on collecting the total number of licenses used and available every few minutes. This is why they miss the user-level data and short jobs. And with short jobs representing up to 50% of the utilization in some cases, it’s a real problem. It’s also why they can’t report on the waiting licenses – this isn’t part of the collected data.
CAD and CAE licenses have 3 properties that make them different.
First, they’re shared between users. This is why user-level reporting is needed. But users iterate on their design work – they send a job, and then improve the design based on the results. These iterations may vary from a few seconds to multiple- days, but they’re still much shorter than the years an enterprise license is used.
Second, they’re very expensive, so worth the optimization effort. If buying another license would cost $100 no company would limit their engineers. But engineering licenses can cost more than $1 million annually.
And third, these licenses are often checked out and back in with high frequency, making tracking hard and increasing the need for more granular data.
These differences between enterprise and engineering licenses are the reason the traditional approach of periodically collecting the license stats from each license server aren’t good enough. Engineering license management requires user-level statistics. It also requires us to see the specific sessions: license request + license check out + license check in as a single entity.
ELM Initially supports the FlexLM license server but it’s modular: we can process other license server logs with a small effort. The only prerequisite is that the logs include each license operation (request, check-out, and check-in) with timestamps.
Better data, better visibility
To overcome these limitations ELM uses a more detailed dataset – the license server logs. These logs include every license check-in, check-out, license deny – and the reason the license was denied. This detailed dataset allows us better visibility into the usage patterns. It helps to work through a few scenarios.
When a license is at 100% utilization, job using this license shouldn’t start. However, scheduler misconfigurations and users not declaring the licenses they use can allow these jobs to start anyway. This leads to failed jobs, wasted compute resources and frustrated users. ELM aggregates data per session, including the declined requests that caused the job to delay starting. This visibility helps customers find the root cause of these inefficiencies. We should note that this number will include only jobs actively trying to start, and not jobs you defined as low priority (with a “license-wait”). License-wait jobs don’t try to start until the license is available, so the FlexLM server won’t see them.
ELM can also analyze how “bursty” the license usage is for better granularity (Figure 2). This is important for understanding the cost of not buying licenses. A license showing high utilization variance can probably take on additional load without increasing the wait times too much. But a license varying only between 80% and 100% is close to saturation, and any growth will increase the average wait times.
Using the session-level data, ELM can detect which users are suffering the highest wait times, and how much are they impacted. This helps quantify the business impact of the license shortage. Without quantifiable data on which teams / projects it’s impacting, it’s harder to get your leadership to invest in additional licenses.
ELM can process data within minutes of it being uploaded. If you stream the data from your license servers, you get near real-time view into your utilization. This can used to troubleshoot license wait times. Exposing this to the engineering teams allows them to communicate directly when one team needs another team to release a few licenses.
Some licenses are used both through the scheduler (batch) and manually (for example from a remote desktop tool). These bypass the scheduler altogether, which means it is not “aware” of where these licenses are being used and can’t report on their usage.
Thinking more strategically, the data collected by ELM can be the basis for an observability data lake. By bringing in data from your schedulers and license servers you can analyze the end-to-end flow of a CAD/CAE job. Customers are starting to build data lakes like this to drive operational efficiency for their HPC environments. For example, the ability to report on both batch-launched licenses and manually launched ones we just discussed.
Demo
How to get started?
This is the first version of ELM, and AWS is inviting early customers to test it. The code is available upon request through your AWS account team. We’re releasing ELM under the MIT license, to allow customers the flexibility to customize it and build solutions on top of it. You can deploy it in your AWS account, and the README file includes a list of the prerequisites you need to build and deploy the solution.
If you want help customizing the solution to your specific needs, AWS Professional Services offers this type of service, as well as our HPC partners. We’re keen to get your feedback, too. Reach out to us at ask-hpc@amazon.com so we can make this a tighter fit to the problems you’re trying to solve.