AWS Spatial Computing Blog
Availability and Disaster Recovery for NVIDIA Omniverse Enterprise Nucleus
NVIDIA Omniverse is a revolutionary platform, which allows creators and organizations to collaborate in real-time on 3D designs and simulations. It offers a wide range of integrations and tools to enable teams to work together, and to bring their ideas to life.
One of the fundamental aspects of NVIDIA Omniverse is the ability to author content in your traditional application. Omniverse has connections to popular CAD tools like Autodesk Revit, PTC CREO, as well as content creation tools such as Autodesk 3ds Max, Autodesk Maya, and Blender, see a full list here: Connecting to Omniverse. This breadth of support allows multi-functional teams with conflicting data formats to collaborate in a common digital space in real-time.
NVIDIA Omniverse Nucleus is the database and collaboration engine of the Omniverse platform. With Omniverse Nucleus, teams can have multiple live users connected using different applications at once. Nucleus enables efficient live synchronization between NVIDIA Omniverse applications. Changes to Universal Scene Description (USD) files, the core Omniverse data format, are transmitted in real-time between connected Omniverse clients.
As companies look to leverage NVIDIA Omniverse to drive their digital innovation, it is important to consider where, and how, the Nucleus server is configured. With many teams and companies spread throughout a country, or globally, it is important to understand why it’s ideal to deploy Nucleus in the cloud, and how to ensure quick recovery in the event of a server failure.
Deploying Omniverse Enterprise Nucleus on AWS with SoftServe
As a member of the NVIDIA Service Delivery Partner – Professional Services (SDP-PS) program, SoftServe has an experienced team of AI, ML, and DevOps experts. Amazon Web Services (AWS) and SoftServe have developed this Nucleus reference architecture to help customers accelerate their digital transformation and reduce the time to deploy Nucleus on AWS.
The SoftServe professional services team works with customers to set up Nucleus cloud deployments by automating and provisioning AWS resources, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, Amazon Simple Storage Service (Amazon S3) Buckets, AWS Identity and Access Management (IAM) roles, networking, auto-scaling, load balancers, etc. SoftServe delivers these AWS resources as a managed deployment and customers receive the solution, documentation, and training. The Nucleus deployment on AWS solution allows for customization and extensibility to add additional cloud resources as required.
Solution Overview
- End users of the Omniverse tools are supported by on-premises graphics workstations. These workstations have high-end NVIDIA GPUs, the Omniverse clients, and additional Digital Content Creation tools connected using Nucleus Connectors.
- Depending on network security requirements, the AWS component of this hybrid deployment can be privately connected to the on-premises network via VPN connection or AWS Direct Connect connection. A managed private certificate authority can be deployed with Amazon Route 53 for private DNS resolution. Amazon Virtual Private Cloud (Amazon VPC) private link endpoints maintain private communication between Amazon EC2 instances and services such as AWS Systems Manager Agent (SSM Agent), Amazon S3, and Amazon CloudWatch.
- An Application Load Balancer (ALB) is deployed in public subnets to redirect client requests from HTTP to HTTPS and then to the NGINX reverse proxy servers. The ALB also balances traffic load across the reverse proxy servers if multiple have been provisioned.
- The reverse proxy is a NGINX server deployed in a highly available multi-AZ auto scaling group. The reverse proxy routes requests based on paths to the specific Nucleus ports.
- The Nucleus server is comprised of Docker containers orchestrated by a Docker Compose stack provided by NVIDIA. Nucleus data is stored on Amazon Elastic Block Store (Amazon EBS) volumes.
- When deployed, AWS Systems Manager Run Commands pull the Nucleus Docker container images from the NVIDIA Container Registry and configure the Nucleus instance on Amazon EC2.
- Access to the NVIDIA Container Registry is required for Docker to pull the appropriate images.
- Auto scaling Lifecycle hooks, backed by AWS Lambda, support runtime configuration of the NGINX proxy instances when they scale up and when the instances terminate.
- Triggered by the Nucleus ASG On Terminate Lifecycle Hook, the Nucleus failover procedure uses AWS Step Functions to pull the Nucleus backup data from Amazon S3 and reconfigure the newly launched EC2 instance. During this time, it is expected to have a downtime of a few minutes while the new EC2 instance is launched and configured.
- Triggered periodically by Amazon EventBridge, the Nucleus backup procedure uses AWS Step Functions and the NVIDIA nucleus-tools to perform incremental backups of the Nucleus data to Amazon S3.
- CloudWatch aggregates logs from the Amazon EC2 instances and facilitates metric monitoring and alarms. The Nucleus stack also exposes metrics about its load characteristics (such as number of requests per user, per request type, etc.). These metrics are exposed to be consumable by Prometheus.
High Availability
Production teams expect reliable and consistent access to the data stored in Nucleus. To address this expectation, features of high availability have been implemented in this solution.
Using an ALB, Route 53 requests are sent to a single DNS host name and dynamically routed across multiple Availability Zones (AZs). To ensure encrypted connections, an AWS Certificate Manager (ACM) SSL/TLS certificate is associated with the ALB which terminates the front-end connection and decrypts the requests.
NGINX reverse proxy servers route traffic to specific ports on the Nucleus server. An Amazon EC2 Auto Scaling group ensures the reverse proxy instances are deployed across multiple AZs and that the number of instances will scale up or down depending on the current request load. By default, this solution scales-out depending on the CPU usage of the reverse proxy instances.
The maximum number of reverse proxy instances, the scaling mechanism, and the number of AZs to scale across are configurable to ensure high availability for each use case.
Backup and Restore
The Omniverse Nucleus on AWS solution implements backup procedures at different levels:
- Snapshots of Amazon EBS volumes
- Copy and transfer of the Nucleus data to an Amazon S3 Bucket
These backup features are configurable and automated by using an AWS Step Functions state machine, which is triggered by a Lambda function on a configurable schedule. Using the NVIDIA nucleus-tools, incremental copies of the Nucleus data are synchronized with the Amazon S3 Bucket. Since the backup happens incrementally, it is best to allow frequent backups reducing the file transfer size and the point of recovery time.
Disaster Recovery
When managing centralized datastores such as the Omniverse Nucleus collaboration engine for digital assets, companies need to protect the continuity of the business and avoid work disruptions.
To maintain a Recovery Time Objective (RTO) of a few minutes, this solution implements incremental Nucleus data backups and automated configuration procedures. This includes periodic, incremental backups of the Nucleus data to an Amazon S3 Bucket but also serverless processes using AWS Lambda, Auto Scaling Groups, and AWS Step Functions for automatically launching and reconfiguring Nucleus instances running on Amazon EC2.
When an instance failure is detected by the Nucleus Auto Scaling Group, a new instance is automatically launched and the failover Step Function procedure starts. The Step Function procedure pulls the Nucleus backup from S3 and, with AWS Systems Manager and the NVIDIA nucleus-tools, uploads the data into the new Nucleus instance.
This approach allows customers to recover quickly from unexpected incidents that affect the availability of the Nucleus server. The recovery process is configurable and works with a health check and Lambda functions to implement the failover process.
Infrastructure as Code
One of the key objectives of building the Omniverse Nucleus on AWS reference architecture is to allow customers to provision the Nucleus server in an automated fashion by using AWS Cloud Development Kit (AWS CDK). By using Infrastructure as Code (IaC), customers receive source code of the solution that can be deployed in a repeatable way. For customers that require customizations, AWS CDK allows customers to add AWS resources or modify the solution as required by their needs.
This solution also deploys an AWS CodeCommit repository and an AWS CodePipeline CI/CD pipeline that is used to automate modifications to the Nucleus deployment on AWS.
Conclusion
With AWS, customers can connect distributed users all over the globe to NVIDIA Omniverse Enterprise Nucleus. With the breadth and depth of AWS, high availability and disaster recovery techniques can be implemented for Nucleus deployed on AWS. This includes load balancing, auto scaling, backup, restore, of data in Nucleus. All of this ensures teams can collaborate in real-time with reliable access to their data.
Working alongside SoftServe professional services teams, customers can quickly deploy Nucleus in their AWS accounts and customize the solution for their business needs.
For a technical deep dive, please review this open-source solution from AWS and SoftServe:
NVIDIA Omniverse Nucleus on Amazon EC2
SoftServe – AWS Premier Partner
As an AWS Premier Tier Services Partner, SoftServe consistently helps customers to implement repeatable solutions in the AWS cloud through deep industry experience, innovation, and advanced technologies.
SoftServe can help you to transform your 3D workflows and enable your teams to achieve a new level of collaboration in 3D production quality with NVIDIA Omniverse Enterprise.
For more information about SoftServe and NVIDIA Omniverse Enterprise, please go to our website:
SoftServe – NVIDIA Omniverse Enterprise