Scaling chip design processes with high velocity in the cloud

Shift left – a path to faster and better SOC quality

A revolution is happening in the chip design industry. A revolution that allows projects to finish earlier, with fewer bugs, while budget stays in control.

In this article, I’ll share my personal view on a few related bottlenecks that directly affect time to market and the quality of the silicon. I’ll also touch on EDA licenses utilization, engineering headcount, and discuss the compute solution that is already at your grasp, a few clicks away.

But before diving into the cloud world, let’s explore a few real-life examples.

What are users saying about boosting chip-design productivity

You don’t have to read them all… feel free to jump to the next paragraph if you “got the idea”…

Arm was able to see “8x to 10x improvement turnaround time when running EDA workloads in the cloud”. Freeing up engineering teams to innovate was a key issue.

Anupama Asthana, Director of Grid Solution Engineering at Qualcomm saw similar results. In this video, she talked about the performance boost and hybrid flexibility in static timing analysis of up to 27%.

Sanjay Gajendra, Chief Business Officer at Astera Labs, shared that the company completed their PCIe5 design in less than a year on AWS, compared to the years it would have taken using an on-premises HPC environment.

There are other semiconductor companies enjoying the fruits of this revolution. Not all of them are publicly talking about the edge they are gaining from using AWS.

Well, not every workload would be ~50% faster as NXP’s circuit simulations, but you’ll be able to experiment what works best for you.

How fast “new” becomes “old”

If you’re in R&D, compute may be IT’s responsibility. Right? After all, you’re “Just” the user.

However, a couple of important points that directly affect every day work directly and frames the picture:

On-premises servers (i.e. in-house IT infrastructure) refresh rate is 2-4 years on average . Hardware gets outdated. FAST!
Compute refresh cycles are affected by operation’s yearly budget. But, new servers supply time is un-certain even when budget is available. If you need more, NOW, you may have an issue.
The variety of on-premises compute may not provide the best resource for the job (I’ll explain why shortly).

We all witness the pace of innovation. The process technology keeps shrinking, allowing more CPUs on a die, more dies packed together. The compute capabilities are getting better and faster every year. Running on last year’s server, a three years old server (or even older), cannot compete with the latest generation compute that came out this year. Cutting runtime, getting results faster at scale, with more cycles a day overnight runs with results ready to be debugged the morning after instead of the next day will all have huge impact on the quality of the silicon and time to market.

But there’s another angle,

When your company made the decision which servers to acquire, what was the compute variety? Was Intel hardware selected? AMD? ARM? Other? Accelerators? How much memory defined in the purchase order? How many CPUs per machine? Lots of decisions that were made by guesstimating the needs a few months in advance.

The point is – If you limit yourself to a few types of machines/vendors you’re missing the others. What if… a workload runs much faster on different CPU type or architecture that you have access to? Would you be able to measure the difference on premises? How easy would it be to experiment? (Spoiler Alert… it’s VERY EASY to do in the cloud… in minutes…reducing CAPEX and margin of error).

In my previous role, I experienced cases where I had business-critical place & route runs. Timing closure was a challenge and the team had to try multiple parallel runs. Each run took around three days. We had to close timing ASAP… or… lose the business. The discussion with IT went like this:

(Based on a true story… names have been changed to protect the innocent)

Me: We need more compute capacity ASAP (I gave a number…). This directly affects time sensitive multi-million dollar business opportunity.

IT: We already used this year’s budget. You can work with management to fund H2’s operation plan budget. We’ll be happy to look into that next year.

Me: But, we need it NOW!! The business wouldn’t wait for us!

IT: Sorry, no budget left. But even if we had the budget, there’s a shortage in supply chain, we could have gotten some hardware installed four months from ordering. How about taking a couple of machines from another project and secure it for you now, would that do?

Me: I’ll take everything you got. (Beggars can’t be choosy), but it is far…far… from what we really need.

IT: Sorry…

Me (Thinking to myself): “If there was just a simple way to get compute on demand in minutes…mmm”

Every design or verification task has different compute needs

A typical project requires a variety of compute tasks through the design cycle, validation, front-end, backend, etc…. Each flow would work best with different compute, specific storage I/O access and different amount of memory.

Let’s talk again about the compute pool that your company has. Let’s assume for the sake of the discussion, that it is compute farm is new and park the scenario of running on old/slow compute that hasn’t reached refresh cycle.

Now, let’s analyze some chip design flows:

Verification

For verification you may want fewer CPUs, running as fast as possible. Regressions start slim as you’re building up the environment (less compute is required – so on-premises paid for machines are staying idle) and as the project progresses, you need more and more compute (beyond what you actually have onsite). What do you do today? Has your company acquired compute for the maximum peak? Average peak? What happens if you need more compute resources? Imagine needing 20 additional high memory servers for timing analysis, and having them available in seconds, funded by the money you saved by potentially shutting down idle compute nodes.

We talked about new HW vs old one. Another aspect is HW that is challenging to own for short durations. Think about special accelerators, or special servers with CPU overclocking that you can’t just buy in the open market. For verification, if the CPU is overclocked to run 30% faster, you’ll get the results faster.

Formal is another verification topic – proving multiple properties. Every property could be run with multiple engines in parallel and therefore, you can uncover more bugs upfront…
If you use more compute, you finish your project design and verification faster. And by more, there are already examples of tens of thousands of jobs running in parallel for a short period of time and getting JasperGold results faster (tens or thousands of times faster…).

Backend / Analog

For place & route you will want to use multicores with high memory per core machines and high throughput to storage.

EDA vendors now have AI features. Multiple automatic Place & Route runs would be invoked in order to get best PPA (Power/Performance/Area), but those runs require a lot of parallel high-end servers. What’s the impact of taking these servers from your project’s server pool? Would you need to sacrifice other runs for that?

Remember that these are not the same servers you will use for verification. You’ll want the runs to finish ASAP when those runs are on the critical path of the project. What about equivalence checking? Characterization? Analog runs? Different machines here again…

You get the idea…

Changing the way we think about hardware

We are traditionally taught to forecast the compute we need in advance, plan for the average plus, and cope with shortage. We’re taught that we only have the servers pool on our compute farm (oh… and BTW you can’t take the stronger ones now because a peer project is more urgent and those are allocated to them…).

The cloud revolution turns the cards and changes the game entirely. You have virtually no compute limitations. Take as many as you need. Take new hardware. Take faster/stronger machines when you need those or more mainstream ones when you’re not on the critical path of the project. Experiment what works for you best. Optimize for cost or performance – the options are nearly limitless.

Some semiconductor companies are operating ONLY in the cloud. Others, that already invested in on-premises compute are going hybrid – leveraging the cloud for peak usage or for special compute that’s inefficient to own throughout the entire year.

The trend is clear. The benefits are clear. There’s no reason to stay behind when it’s so simple to get actual compute needs fulfilled.

Hmm…, so is cloud more expensive than on-premises?

With all the benefits discussed, you would ask yourself – is cloud more expensive?

But the real answer is “it depends”. It’s all in your control.

You can probably pay less or be on par if you match the same workloads, and reduce the time of idle licenses. Some customers started the journey to the cloud because of cost reduction. But… you also have the option to choose… do you want better compute for a specific workload today? Are you willing to pay a bit more for runs to end in a week rather than two months? You decide the cost/performance tradeoffs for the current stage.

Apples to apples

The correct way to look at cost is on the project level.

The real cost for taping out a chip is heavily leaned on cost of engineering (people) and licenses, then cost of compute (including rent space, taxes per meter, electricity, on-going maintenance: backup/restore/sw updates… etc).

A wrong way to run a calculation would be to compare the price of on-premises server for the refresh cycle period, then you compare it to same period of on-demand pricing. This is far from the real cost.

Moving to the cloud, you are neutralizing: rent, electricity, physical security, maintenance of the HW… All these are being taken care of for you. You’re also not paying for compute when you don’t use it, you pay by the second, only when you running simulations (on demand model).

Chip designers can experiment with multiple compute types in minutes the right compute for the job, without having to order physical HW. Without waiting months for installing just to find out it’s not what you need. The innovation speed increases dramatically.

The above chart is a graphical representation of predicted (guessed…) on-premises capacity, vs actual use. The blue line represents refresh cycle investments over time. As project starts, compute may be underutilized and you paid for compute you do not need, but as project progresses, you might not be able to handle peaks. That can be business critical as discussed earlier.

“Hidden” project costs

We discussed few ways to pull in schedule. Using high end compute at scale.
If you’re the first to market, it can make a vast difference. (or the other way around… if you’re second or third – you may be losing the market).

By pulling in two months of your project, taking into account a chip-design team that consists of 50 engineers (with SW it is probably more…). You are looking at an immediate cost saving of ~$1M (based on average chip design salary – Glassdoor 2022 report).

You may also be able to save project cost by pulling in schedule (engineering headcount cost) and licenses, when those could be released for your next project earlier.

While some companies specifically discussed the schedule pull in, others claim they will not cut the project time even when it is technically possible. For those companies, the benefit of getting things done better and faster is quality: a cleaner tape-out database.

According to Semiengineering / IBIS – taping-out in 16nm costs $106M, 7nm costs $297M, 5nm $542M. Actual cost may be different for your chip. Whatever the number is, it is not negligible.
Major bugs might require another tape-out…

This has huge cost implications. The team will not work on the future generation, as debugging takes an enormous toll, and then FAB costs, waiting again for FAB production, packaging, … forget the engineering and manufacturing cost, but you could lose the market.

Let’s say it all gladly went well, well… you should still want to analyze wafers to improve yield. Improving yield is money on the table. AWS offers methods to analyze huge datasets with AI to get the insights for yield improvement.

How about security and licenses?

Take a look at the banking industry, defense industry, automotive industry, chip design industry,…

The list goes on… (look for the public use cases at the bottom of the linked pages). If those industries don’t have security issues working with the cloud, why would you think you do?

Security is a solved problem for EDA and other industries. Cloud alliance was formed so FABs and EDA companies could collaborate and allow smooth cloud based tapeouts. When you provision the compute environment with AWS, you have full control over the data, over the location of the data is stored, the encryption and who has access to it.

EDA Licenses are no different for on-premises or cloud. It doesn’t really matter where your license server is located. We do see EDA companies offering special cloud licensing models as Synopsys (FlexEDA) and Cadence (Cloud Passport) but, that’s an article for a different time.

ARM, NXP, Qualcomm, Annapurna and many others already enjoy the increased productivity.

Summary

Starting with cloud compute doesn’t mean you have to throw away what you already have. Engineering doesn’t have to feel a change in their working environment. Hybrid models allow using local resources and expand to the cloud when you need it.

This is a paradigm shift. You can start as small or as big as you wish. Staying behind shouldn’t be an option – If you want to get projects out faster and cleaner, potentially at a lower cost, drop us a note below in the comments section. Our team will be happy to assist.