Leveraging Customer-Managed Fleets with AWS Deadline Cloud

AWS Deadline Cloud is a scalable, fully managed render scheduler from Amazon Web Services (AWS) for teams that need a better way to create ground-breaking visual content. To power the elastic workloads creative teams require, Deadline Cloud organizes Workers into Fleets. The Workers run Jobs given to them by Queues, which are collections of Jobs submitted by users.

Deadline Cloud has two Fleet types; Service-Managed Fleets (SMF), where all the compute happens as part of the AWS Deadline Cloud managed service, and Customer-Managed Fleets (CMF), where the compute happens in your own AWS account, or even on premises. CMF requires more setup and places more responsibility on you, making it the right choice when you need more flexibility in your rendering configuration. Read on for best practice CMF architecture, auto scaling, and security.

Queues and Fleets

In the diagram, the Farm has multiple Queues (queue_1, queue_2, queue_3). When a job is submitted to queue_2, the Queue assigns part of the job to a Worker in one of the Fleets associated with that Queue, either fleet_2 or fleet_3. fleet_1 is not associated with queue_2, so workers in fleet_1 cannot be assigned the Job. Fleet_3 is also associated but its minimum specification is below the Job requirements, so it cannot run the Job.>

Figure 1: How a Job goes from a user to being assigned a specific Worker in a Fleet.

Deadline Cloud runs Jobs defined according to the Open Job Description (OpenJD) standard. When a Job is submitted to Deadline Cloud, part of the Job submission process is to explicitly assign it a Queue. A Queue is the primary permissions boundary; you control who has access to what Queue, and that controls how much they see of your Farm.

Queues have a many-to-many relationship with Fleets. One or more Fleets can be associated with a Queue, and one or more Queues can be associated with each Fleet. When a Queue receives a Job, it determines which Fleets are capable of running each Step within that Job. Steps are an OpenJD concept that break up a Job into groupings of work, and each Step contains requirements and smaller units of work called Tasks.

When you define a Fleet, you specify the minimum spec for each host within the Fleet. A Step won’t schedule to a Fleet if its requirements exceed the minimum spec of the fleet. Consider a fleet set to scale elastically that currently has 0 hosts. If a Queue assigns a Job to the Fleet that exceeds the minimum specification of the Fleet, it might not spin-up a host that can run the Step.

If a Job is submitted to a Deadline Cloud Queue, and the minimum requirements for the Step are greater than the minimum Worker specifications for all associated Fleets, the Step won’t run because it won’t find a compatible Fleet.

Customer-Managed Fleet architecture

An architecture diagram which shows an Auto Scaling group with Amazon EC2 Spot instances deployed in a public subnet within a Virtual Private Cloud (VPC), all deployed to Availability Zone A. The public subnet also has an internet gateway.

Figure 2: The sample Customer-Managed Fleet architecture deployed by the CloudFormation Template in the AWS Deadline Cloud documentation.

The Deadline Cloud documentation contains a sample template for deploying a simple CMF with AWS CloudFormation. The template creates an Amazon EC2 Auto Scaling group (ASG) deployed into a public subnet. This is a quick, simple, and easy to debug setup that’s ideal for your first foray into Deadline Cloud CMF, but a production deployment could use several tweaks.

An architecture diagram showing a Virtual Private Cloud (VPC) spanning two Availability Zones (A and B). Each zone has a private subnet containing AWS PrivateLink interface endpoints, an Amazon S3 gateway endpoint, a floating license server, high performance storage, and Amazon EC2 Spot instances in an Auto Scaling group. Zone A also has a public subnet with a bastion host, NAT gateway, and internet gateway.

Figure 3: An ideal Customer-Managed Fleet architecture, showing the use of private subnets and multiple availability zones.

The ideal architecture uses private subnets, AWS PrivateLink endpoints, and optionally a NAT Gateway. Deploying the ASG across multiple availability zones increases the availability of instances, and a bastion host allows access to resources in your private subnets. Access to Deadline Cloud’s Usage Based Licensing (UBL) should be configured with the AWS Command Line Interface (AWS CLI) or CloudFormation. If you have a floating license server for your software, a Queue environment should be configured, allowing your Workers to use your floating license server first.

A private subnet is a subnet that doesn’t have a direct route to or from the internet. Deploying Amazon EC2 resources in a private subnet, such as the Amazon EC2 Spot Instances commonly used in the ASG for Deadline Cloud rendering, results in those instances receiving only a private IP address.

These instances can still access the internet if a NAT Gateway is deployed and set up with appropriate routing from the private subnets to the NAT Gateway for outbound internet traffic. This makes it easy to download resources and security patches to instances in the private subnets, but allowing those instances to access the internet might be undesirable.

This architecture uses two methods to route traffic to AWS services from your private subnets without requiring internet access. For Amazon S3 traffic, a gateway VPC endpoint uses IP prefix lists to route S3 traffic, and has no charge for use. For Amazon CloudWatch Logs traffic, required for the Workers to send their logs to CloudWatch, and Deadline Cloud traffic, best practices dictate deploying interface VPC endpoints using PrivateLink. If you need access to other AWS services from your resources in a private subnet, PrivateLink endpoints should be added for those. You are billed for each endpoint and each hour that endpoint remains provisioned, as well as data processing fees for each Gigabyte processed.

AWS Deadline Cloud endpoints

AWS Deadline Cloud offers two PrivateLink endpoints, *.deadline.management and *.deadline.scheduling. The scheduling endpoint is used for Worker resource API communication with Deadline Cloud, while the management endpoint is used for everything else. For Deadline Cloud Workers operating in an ASG, the scheduling endpoint is required. If your Workers are running custom scripts which call the Deadline Cloud API, they might additionally require management API access.

The AWS CLI allows you to set up a third type of endpoint, a Deadline Cloud UBL license endpoint. Once deployed, a Queue environment should be configured to set an environment variable to point to the DNS of the license endpoint. Queue environments use an OpenJD environment template to describe the environment, so a simple Queue environment for setting a Foundry Nuke license might look like the following:

specificationVersion: environment-2023-09
environment:
  name: NukeLicenseEndpoint
  variables:
    foundry_LICENSE: 6101@<Deadline_Cloud_License_Endpoint_DNS>

Where <VPC_Endpoint_DNS> is the DNS name of your Deadline Cloud license endpoint, as returned by aws deadline get-license-endpoint. You can typically combine multiple floating license servers and the UBL endpoint in that environment variable, depending on the software.

Auto scaling

An architecture diagram showing AWS Deadline Cloud emitting Fleet Size Recommendation Change Events to Amazon EventBridge. EventBridge has an Auto Scaling Event Rule which triggers the Auto Scaling Lambda AWS Lambda function.

Figure 4: The architecture that the CloudFormation Template in the AWS Deadline Cloud docs deploys to drive Amazon EC2 Auto Scaling in a Customer-Managed Fleet.

Your Fleet’s ASGs will use events emitted from Deadline Cloud to Amazon EventBridge to invoke an auto scaling AWS Lambda to adjust the desired Fleet size for your ASG. The Auto Scaling Event Rule and Auto Scaling Lambda are located within your account, and can be customized as required by your requirements. Deadline Cloud will continue to send fleet size recommendation change events until the Fleet matches the desired worker count.

The ASG should be configured with scale-in protection enabled, preventing the ASG from terminating instances, as it has no information regarding which EC2 instances are working on what Job. When Deadline Cloud scaling recommendations are turned on, your Fleet’s Worker lifecycle changes, and the Fleet’s Workers will now shut down the operating system when instructed to by Deadline Cloud. Without scaling recommendations, the Workers will remain running, and no attempt to shut down the operating system will be made. With Workers shutting down the OS, the instances should be configured to terminate instead of stop, preventing them from remaining in a stopped state, which increases costs.

Maintaining fleet health

An architecture diagram showing a custom health check for an AWS Deadline Cloud Customer-managed Fleet Auto Scaling group. An Amazon EventBridge Scheduler triggers an AWS Lambda Health Check Lambda function at a selected rate, such as every 10 minutes. The Health Check Lambda compares the instances in the Auto Scaling group to the Worker instances registered in the Deadline Cloud Fleet. Instances in the Auto Scaling group that are missing from the Fleet are removed and noted as unhealthy. The Lambda then reports all unhealthy instances to Amazon CloudWatch, where Health Check Metrics are tracked. If errors remain over subsequent Health Check Lambda runs, a Health Check Alarm is raised in CloudWatch.

Figure 5: An example architecture of a custom health check for an ASG will monitor and maintain the health status of their ASG.

Ensuring your fleet remains healthy and free of stalled or misbehaving instances should be achieved by building a custom health check for your ASG. This can lower the risk of an accidental change in your Amazon Machine Image, launch template, or network configuration running undetected.

An Amazon CloudWatch metrics view showing multiple graphed metrics over time. The graph displays three metrics plotted as lines rising and falling over time: WorkerHealthCheckFailureCount, WorkerHealthCheckCount, and RecommendedFleetSize. The y-axis shows the metric count values, while the x-axis shows the time range.

Figure 6: An example CloudWatch metrics view which could be driven by a custom health check for your ASG.

Using an EventBridge Scheduler, you schedule the regular invocation of a custom health check lambda, which cross-checks the ASG instance list against the Deadline Cloud Worker list, removing anomalies and reporting metrics to CloudWatch. Those metrics then drive a CloudWatch Alarm, which should be further configured to use Amazon Simple Notification Service to email or page a user.

Users and groups

AWS Deadline Cloud is designed to integrate with AWS IAM Identity Center to offer fine-grained permission control for your users and Queues. To provide another line of defense against malicious users, the Queue’s permissions design and structure should be extended to how the Deadline Cloud runs Jobs on your CMF infrastructure.

Figure 7: How the AWS Deadline Cloud Worker agent utilizes users and groups to extend the Queue’s permission boundaries to Customer-Managed Fleets.

With this, when the Deadline Cloud Worker agent receives a Job to run, it will set up a temporary directory with any required files. It sets permissions that only allow itself, and members of shared group membership, to access those files. It then starts up an OpenJD Session to run the Job as a separate process, with a user unique to the Queue which assigned the work.

A diagram illustrating that the Deadline Cloud Worker agent should be configured to have secondary group membership in each Queue's user group. The first Queue's user (queue_1_user) has primary membership in a shared Queue user group (queue_1_group). The second Queue's user (queue_2_user) has primary membership in the shared Queue user group (queue_2_group). The Deadline Cloud Worker agent user (worker_agent_user) has secondary membership in both queue_1_group and queue_2_group.

Figure 8: The Deadline Cloud Worker agent should be configured to have secondary group membership in each Queue’s user group.

This unique Queue user allows for the Deadline Cloud Worker agent to extend the Queue permissions boundary. When different users and groups are specified for each Queue, a Job kicked off from one Queue wouldn’t normally be able access the working files of a different Queue. The Deadline Cloud Worker agent user must share group membership with each Queue’s users to allow it to place files for that user. If used in CMF attached storage, the Queue user should share its primary group membership with your users who write files to that storage, allowing it to read dependencies owned by those users and output files which your users have permission to read.

Maintaining a 1:1 relationship between Queues and Fleets is another strategy for using permissions to achieve greater security.

Conclusion

This post provided a comprehensive architecture walk-through for leveraging your own infrastructure with Customer-Managed Fleets in Deadline Cloud, as well as best practices on CMF auto-scaling and security. Readers are encouraged to learn more about Deadline Cloud concepts, CMF set up and management, Deadline Cloud security best practices, and how environments should be configured with OpenJD.

AWS for M&E Blog