AWS Thinkbox Deadline adds multi-regional support to Spot Event Plugin
Amazon Web Services (AWS) has announced AWS Thinkbox Deadline release 10.2.1 that includes the addition of AWS multi-regional support to the Spot Event Plugin, which allows Deadline customers to easily scale rendering by launching and managing Spot Fleets in multiple AWS regions from a single Spot Event Plugin.
In order to leverage the elasticity of the cloud, customers want the ability to provision computing resources to match the work they need to accomplish. Traditionally, this involved calculating how much work is needed, and manually invoking the required compute instances. In an earlier version of Deadline, the Spot Event Plugin (SEP) was added to solve this. It allows the scheduler to automatically invoke (and later turn off) compute resources as-and-when required, based on the work in the render queue. This feature works within the guidelines of specified EC2 Spot Fleet configurations, and can repeatedly grow or shrink spot fleet requests based on workload demands, simplifying the entire process of managing cloud resources. Studios with extraordinary compute demands or multi-office presence are turning to AWS’s multiple regions to solve their compute needs, something we’ve made easier by adding multi-region support to Deadline’s SEP.
This blog discusses how the SEP works, and now evolved, to cater for multi-region render workflows while maintaining its simplicity.
“We have artists based all over the world and the enhanced Spot Event Plugin enables us to use the global footprint of AWS to target render machines across different regions. This is particularly useful when we’re trying to render complex shots that require a lot of computing power and we are able to see all of this in real time through Deadline. This has allowed for a truly collaborative process, which increases productivity but, most importantly, elevates our creativity as a studio” – Sam Reid, Head of Technology | Untold Studios
How does the Spot Event Plugin work?
Figure 1: Deadline’s Spot Event Plugin provisions Amazon EC2 Spot compute in a single AWS region, which is supported in either (a) hybrid or (b) all-in architecture
The Spot Event Plugin (SEP) can scale Amazon Elastic Cloud Compute (EC2) Spot instances dynamically based on the number of queued jobs and tasks in the Deadline queue. Each provisioned Spot instance runs a single instance of the Deadline Worker application that controls the rendering process of the desired application. The SEP associates a JSON based Spot Fleet Request (SFR) to a Deadline Group, allowing multiple SFRs with different hardware, software and AWS specifications to be tailored for different types of jobs, which Deadline allocates based on their Group assignment. The SEP controls the lifecycle of multiple SFRs by scaling up and down the TargetCapacity variable of the SFR.
In order to maintain the SFRs, the SEP regularly determines the optimal number of Deadline Workers for each group, then creates or modifies the SFR appropriately. In this fashion, the SEP ensures the SFR always matches the amount of work for a specific group, either up to the maximum set by the user or down to zero instances if there are no tasks. When a Worker is either no longer rendering, or the Worker has received a Spot interruption notice, the Worker will mark itself as offline, remove itself from the Workers list in the queue and then self-terminate (something that Arnold Schwarzenegger struggled with as a T-800).
For every Group that is configured, the SEP scales the target number of Spot instances based on the current state of the farm during HouseCleaning (a background process that Deadline periodically initiates to perform a tidy of the Deadline ecosystem). Following are some of the factors taken into account:
- State: Only queued tasks are considered eligible to work on.
- Allow Lists: If Workers are listed on an allow list, then they are the only ones that will try to render jobs associated with this limit. If a job has an allow list, then no Spot instances will be started for it.
- Limits: The number of available Limit stubs (Machine, Plugin, License) you have available will constrain the number of instances to start (Limits are used to control many types of resources in Deadline, including third-party ‘floating’ licenses).
- Concurrent Tasks: The number of tasks a Worker can dequeue at a time. If Concurrent Tasks are enabled on a job on its plugin, then they will be taken into account.
Ensuring Spot Fleet health
The Resource Tracker (RT) is a regional service that monitors the health of your SFRs and Spot instances started by the SEP. If enabled in the SEP (true by default), the RT is automatically deployed the first time an SFR is created in an AWS region. The RT, which is comprised of AWS Lambda functions, Amazon SQS notifications and Amazon DynamoDB, is automatically deployed to each AWS region where the SEP creates a new SFR. It can be considered a fail-safe to your Deadline cloud setup.
The RT monitors the heartbeat reported by each Amazon EC2 instance running a Deadline Worker and terminates instances that are failing Deadline health checks, helping you avoid extra costs. The RT also monitors the overall health of your SFRs and when the number of instances that fail their health checks exceeds the fleet termination threshold (20% by default), it cancels that SFR and prevents new SFR launches until the underlying issue is solved and you can restore normal operation.
The RT provides custom Amazon CloudWatch Events, Amazon SNS Notifications for email/Slack notification, and a built-in user-interface in Deadline Monitor to unblock SEP if it encounters an error state. The cost of running RT per month (24/7) typically equates to less than 0.028% of your total compute cost for 500 render nodes.
“Having automated support for Resource Trackers across all regions in a multi-region Spot request, makes it easy to confidently expand rendering across studio sites, without worrying about orphaned resources” – Daniel Marshall, CEO | Konsistent Consulting
How to configure the Spot Event Plugin
The following steps are required in order to use the SEP:
- Security: Create AWS Identity and Access Management (IAM) policies, IAM user and IAM roles. AWS provides specific IAM managed policies for Spot Event Plugin, SEP Worker, and Resource Tracker to make this easy as well as instructions to configure permissions, using the AWS principle of least-privilege.
- Image: Using the recommended steps, create an Amazon Machine Image (AMI) with Deadline client software configured to connect to your RCS in each AWS region you plan to use.
- Spot Fleet Request (JSON): Follow these steps to create a Spot Fleet Request (SFR) using the AMI from the previous step via AWS Management Console or API and save the configuration locally as a JSON file.
If you want to create a different AMI or additional Spot Fleets for different Deadline Groups in the SEP configuration, repeat the Image and/or Spot Fleet Request (JSON) steps. To simplify your configuration, use EC2 Launch Templates and reference them in your SFR. For each Launch Template you create, ensure you apply a tag with the key of ‘DeadlineTrackedAWSResource’ and the value of ‘SpotEventPlugin’, as that will ensure the Spot instances SEP spins up are properly tagged and tracked by Resource Tracker.
Figure 2: Launch Templates referenced in a Spot Fleet Request must contain the ‘DeadlineTrackedAWSResource: SpotEventPlugin’ key/value tag for Amazon EC2 instances
Worker naming considerations for multi-region support
Deadline Workers should always have unique names. The name applied to a Worker launched by the Spot Event Plugin will by default be the hostname of the Amazon EC2 instance (defaults to the EC2 instance’s private IP address). We highly recommend that you use one of the following options to ensure that all launched Deadline Workers have unique names:
- Use Launch Templates in your SFR configuration and set the Hostname type to Resource name. This will make it so that all Amazon EC2 instances launched with the Launch Template use the resource name of the EC2 instance as the hostname instead of the private IP address. EC2 instance resource names are unique across AWS Regions. Example: ec2-instance-id.region.compute.internal
Figure 3: Amazon EC2 Launch Template configuration to ensure unique EC2 instance hostname
- Ensure you set unique/non-overlapping CIDR blocks across AWS regions for all subnets that the SEP will launch EC2 instances. Setting unique CIDR blocks will ensure that the private IP addresses/hostnames of all launched EC2 instances will be unique across AWS Regions and therefore, all Deadline Worker names. Please note by default, CIDR blocks are identical in the default VPC wizard in each AWS region, which will cause a clash in Deadline Worker names if ignored.
The Spot Event Plugin settings can now be configured inside of Deadline Monitor. You will need to supply the IAM credentials you created as part of the Security setup. The Spot Fleet Request Configurations setting is a JSON dictionary. It represents one or more one-to-one mappings between a Deadline Group and a Spot Fleet Request which has the syntax:
An additional Spot Event Configuration Utility can be used to edit existing SFR configurations, or create configurations from existing ones.
Figure 4: Deadline’s Spot Event Plugin configuration options dialog with typical settings applied
- When configuring SEP, it is essential to determine how you are going to configure the EC2 Workers to connect to the Deadline Repository. You should either configure Deadline at AMI creation or configure Deadline at instance launch via an EC2 user data script:
- If one or more of your EC2 instances are running inside a private subnet, you can use Virtual Private Cloud (VPC) endpoints to privately connect to supported AWS services such as Amazon S3 or Amazon EC2. Because these endpoints are powered by AWS PrivateLink, Amazon VPC instances do not require public IP addresses to communicate with resources of the service. Traffic between an Amazon VPC and a service does not leave the Amazon network.
- EC2 Instance Metadata service (IMDS) v1 or v2 must be enabled for all EC2 instances when using SEP.
You are now ready to automatically scale your rendering into AWS via Deadline.
“The Spot Event Plugin brings native support for multi-region rendering, making it even easier for pixitmedia to leverage the reach of AWS and provide this global rendering resource wherever and whenever users need it when using the power of pixstor and ngenea, our software-defined storage platform and data orchestration and management system” – Ben Leaver, CEO | pixitmedia
Global render farm
Figure 5: Deadline’s Spot Event Plugin provisioning compute in multiple AWS regions
With the new multi-region option in SEP, Deadline 10.2.1 now allows for additional capacity and for customers to have a single-pane of glass view of their entire studio-wide render farm via Deadline Monitor. For studios that already have infrastructure deployments in multiple regions, the feature allows them to render to their preferred AWS region or easily move jobs around if required, while adhering to typical resource limits such as floating license limitations. To enable, we switch the SEP to run in Multi-Region mode in Deadline Monitor, which can be found in the Configure Events plugins window, under ‘Spot’.
Figure 6: Deadline’s SEP configuration dialog, displaying the new ‘Multi-Region’ option
Using Multi-Region mode requires a slight change to the SFR configuration in JSON, to be able to identify the region where a request will be submitted for a specified Deadline group. Multi-Region mode can be used in a single AWS region. The Spot Fleet configuration is still a one-to-one mapping between a Deadline Group and Spot Fleet Request Configuration, with the additional requirement that Deadline Groups must be unique and specific to a single AWS region. A Multi-Region Spot Fleet configuration has the syntax:
The order of the AWS regions listed in the SFR configuration JSON is the order in which they will be managed by the Spot Event Plugin. This allows you to prioritize the use of certain AWS regions. The SFR configuration allows for the use of wildcards as part of the Deadline Group name to associate multiple similarly named Groups with a single SFR. This helps to reduce overall JSON length.
That’s it. You now have a fully armed and operational death star global render farm (lightsabers not included).
“Native support for multiple regions in the Spot Event Plugin makes it trivial to leverage AWS’ global reach, and scale geographically for our rendering” – Daniel Marshall, CEO | Konsistent Consulting
Advanced monitoring for a global render farm
Deadline provides a SQS notifications feature to help you monitor your Deadline events at a per-job and per-task level and is a perfect companion to the SEP but operates independently. This feature requires all your Workers to connect via the Remote Connection Server (RCS). On AWS, an Amazon SQS queue, serverless Amazon Lambda function, and Amazon DynamoDB should be provisioned in your AWS account. State notifications are collected by RCS and sent every minute to the SQS queue where they await processing, de-duplication by Lambda, and insertion into a DynamoDB store. This construct allows customers to build comprehensive telemetry pipelines outside of Deadline of their queue and accurate Amazon EC2 cost tracking within the context of a particular customer’s project; per-sequence, per-shot, or per-frame which is only 60 seconds behind reality.
Figure 7: Deadline’s SQS Notification architecture. Deadline’s RCS sends notifications to a SQS queue, which in turn is processed by Lambda and duplicate events are removed before being stored in a DynamoDB database
To configure SQS notifications:
- Ensure all Deadline Workers connect to your repository via RCS.
- Create an SQS queue in your primary AWS region. Use ‘Standard’ queue because of its high throughput and Lambda will handle any message de-duplication.
- Ensure each machine running an RCS instance has an IAM role assigned to it that has an IAM policy which includes IAM access to sqs:SendMessage to your SQS queue.
- In Deadline Monitor, enter the URL of your SQS queue.
Figure 8: Deadline’s SQS notification queue URL setting in Deadline Monitor UI
The Deadline SQS notification documentation provides an example of notification processing that includes the IAM, DynamoDB, Lambda Python function, and CloudWatch log configuration, together with Deadline example notification messages.
In this post, we explained how to setup Deadline’s Spot Event Plugin (SEP) for both AWS single-region use and now with multi-region support. This allows customers to create Spot Fleets in multiple AWS regions from a single SEP and to scale rendering to additional capacity while providing a single-pane of glass view via Deadline’s Monitor application to an entire workforce for multi-office working, all from a single Deadline installation.
Additionally, we explained how the SQS notification feature in Deadline can help provide rich telemetry and accurate Amazon EC2 cost tracking at a per-job and per-task level for further downstream customer processing with only 60 seconds of latency.
Download: AWS Thinkbox download page on the AWS Management Console.
About Untold Studios
Untold Studios is a BAFTA, EMMY and GRAMMY nominated studio, shaping culture through music, TV and advertising. Untold Studios develops original programming, produces music and advertising content and crafts world-class VFX, all enabled by next-generation technology.
About Konsistent Consulting
Konsistent Consulting is an APN approved systems integrator, working with VFX and post production facilities globally providing consultancy and technical solutions.
Specialists in software-defined storage and data management solutions for media and entertainment, pixitmedia is a Kalray company (Euronext Growth Paris: ALKAL), a leading provider of hardware and software technologies and solutions for high-performance, data-centric applications markets from edge to cloud.