AWS HPC Blog

Hyper Metal: Scaling AWS Instances Up with TidalScale

In computing there are two methods for compute and application scalability – scale out or scale up.  The ability of AWS to scale out on demand has unblocked millions of application and middleware developers who have leveraged the virtually unlimited horizontal scale of the cloud. This builder community embraced the concept of explicit parallelism (scale out) in order to gain all the benefits of the Cloud. Alternatively, the buyer or operator community, based primarily on-premises, continued to utilize scale up because their applications and middleware in general do not support horizontal scaling.

What if there was a way to create a software-defined instance, and enable scale up by aggregating multiple scale out instances?  Then we could avoid refactoring  legacy single system applications to scale out, yet enjoy the pay-as-you-go pricing, elastic growth, and managed infrastructure of the cloud.  In this blog, we will cover just how the buyer community can enjoy the simplicity of scale up, while leveraging the all the advantages of the AWS cloud.

Hyper Metal Based Scale Up

Through a partnership with TidalScale announced April 12th, 2022, customers now aggregate the CPUs, memory, network, interrupts, and storage of multiple AWS bare metal instances into a single system image capable of running unmodified operating systems, middleware and applications. Each bare metal instance was designed by AWS with an optimal ratio of CPU performance to memory bandwidth, both of which are virtualized to form a software-defined server from several bare metal instances.

Background

In Amazon Elastic Compute Cloud (Amazon EC2), metal instances supporting the Elastic Fabric Adapter (EFA), and leveraging a cluster placement group can form a low latency cluster interconnect between instances making it possible to “scale up” to a single system image from a number of metal instances today.  You can learn more about the history and development of the underlying TidalScale Non-Uniform Memory Architecture (NUMA) technology used to form the software-defined server here.

The Mechanics of Software Defined Servers

To enable vertical scale up from horizontally-scaled AWS metal instances, TidalScale virtualizes processors, IO, and memory, and creates Scalable Coherent Shared Memory. An efficient, scalable, coherent distributed memory has long been sought after as this IEEE paper by C. Gordon Bell and Ike Nassi details, and can now be implemented as a software layer on top of existing memory and NUMA virtualization, the full realization of which can be managed by TidalScale WaveRunner.

Figure 1 TidalScale WaveRunner Management Interface showing virtualization of AWS metal instances

Figure 1 TidalScale WaveRunner Management Interface showing virtualization of AWS metal instances

Optimizing Software Defined Servers

At the core of the TidalScale solution is a proprietary bare-metal hypervisor, called the hyperkernel, that currently  leverages Intel’s VT® Virtualization Technology on modern processors to virtualize all the physical resources of an instance. It is a software-based solution that runs on each metal instance (worker node) to aggregate the memory, CPUs, and I/O of multiple AWS metal instances to create a single software-defined instance.

Bare-metal instances may be added or removed as the workload demands, increasing or decreasing the overall available resources of the software-defined instance and thus ensuring appropriate sizing. TidalScale can eliminate the need for costly and complex sizing, migration, and procurement exercises giving users the flexibility to resize server capacity at any time based on current workload demand.

TidalScale communicates over a high-speed, low latency EFA network between all the worker nodes running the hyperkernel and presents an aggregated, unified view of all the physical resources from every metal instance to the operating system and application stack. From the perspective of the operating system, it is running on a single large instance that sees the memory, CPU, and I/O capabilities as the sum of the individual server resources composing the software-defined instance.

While the software-defined instance is in operation, TidalScale incorporates real-time machine learning (integrated into the hyperkernel) to continually update CPU and memory locations between physical servers to produce optimal performance regardless of shifts in the application workload. Continually adapting based on workload changes, the software predicts optimal placement of memory and CPU thread, locations, and triggers migrations whenever necessary. TidalScale can migrate memory pages to the CPU thread that will need the memory or migrate a CPU thread to the block of memory pages it will need to ensure the best possible performance. Because everything in this system is built from software, the system is able to “self-adjust” to reduce latency to memory and to IO devices.

Ancillary Advantages

As mentioned earlier, once individual metal instances are virtualized and mobile via the hyperkernel, it is possible to add and remove instances, on a running system image without disruption.  This allows the removal and replacement of hardware from the software-defined server to address any hardware related issues, without the failover logic required in clustered environments.  This reliability management capability is called TidalGuard.

It is also possible to allow fine-grained control of the CPU count that is seen by the operating system, and thereby reduce costs for applications whose license is based on cores. This allows customers to find the licensing sweet spot that meets their performance needs, without having to pay for licensing of unused cores just because they are seen by the operating system.

Scaling Storage

In a previous blog post, a Virtualized Storage Array (VSA) was introduced that can achieve millions of sub-millisecond shared NVMe IOPs to any particular EC2 instance.  With the latest revision, the VSA can also be connected to a TidalScale AWS software-defined instance, allowing storage to scale with linearly with compute.

Conclusion

Utilizing TidalScale in AWS allows customers to aggregate a number of horizontally-deployed, bare metal AWS instances into single system image software-defined server, which in turn, enables applications to scale far beyond a single instance.  The TidalScale solution is now available in the AWS Marketplace, and you can learn more about the unique features of TidalScale on AWS from a talk we jointly delivered with their team.

Bring us your most challenging application workloads – we would love to help.

Randy Seamans

Randy Seamans

Randy is an industry storage veteran and a Principal Storage Specialist and advocate for AWS, specializing in High Performance Storage, Compute (HPC) and Disaster Recovery. For more Storage Insights and Fun, follow him at https://www.linkedin.com/in/storageperformance.

Justin Stanley

Justin Stanley

Justin is the Principal Solutions Architect, focused on Public Sector Healthcare related organizations. Justin is an out-of-the box thinker who is always ready to help customers find compelling solutions at AWS.