Accelerate Thinkbox Deadline by bursting to the cloud with Amazon File Cache

As the prevalence and complexity of computer-generated images (CGI) has increased in film, TV, and commercials over the years, so has the industry’s demand for massive compute for render farms to process and render CGI elements. Render farms massively reduce render times by dividing large renders into smaller tasks and distributing those tasks among a large number of compute instances called “render nodes” for parallel processing. The effectiveness of a render farm is directly linked to its number of nodes and the total sum of its nodes’ compute power. Building a render farm at the scale required by modern VFX and animation workloads can be very expensive and difficult to maintain on-premises. Therefore, many studios augment compute resources by running renders in AWS.

AWS Thinkbox Deadline, one of the most popular tools for administering and orchestrating render farms, has built-in AWS integrations, like the Spot Event Plugin which makes managing Amazon Elastic Compute Cloud (Amazon EC2) compute resources easy. Deadline can be further expanded with custom scripts to work with services like Amazon File Cache, which a previous blog post showed can dramatically reduce render times by eliminating file transfer bottlenecks. In this blog post, we provide an overview on how to configure an existing Deadline-managed render farm to work seamlessly with an Amazon File Cache cache hydrated by an on-premises NFS file server. The cache makes sure that digital assets are available to render nodes in the cloud without disrupting existing workflows. This gives studios the flexibility to seamlessly burst into the cloud to meet their rendering needs.

Solution overview

The solution uses a Deadline Event Plugin and a Deadline Task Script to automate the data hydration, data eviction, and data write-back. Data hydration is the syncing of data to the cache from a linked data repository, such as the on-premises NFS file server. Data eviction is the process of releasing stale data from the cache. And data write-back is the syncing of new or modified files from the cache to on-premises storage. Using these concepts correctly makes sure that renders include the latest revisions of digital assets residing on-premises, and that the final rendered image is saved back on-premises.

Prerequisites

The following prerequisites are required before continuing with this post.

Make sure that your NFS network share is accessible from your VPC using AWS Direct Connect or VPN.
Set up your Deadline render farm to use the Spot Event Plugin to provision Amazon EC2 Spot render nodes. Refer to the Deadline Documentation for instructions.

1. Create an Amazon File Cache

Follow the instructions on this blog post on linking Amazon File Cache to on-premises file systems for guidance.

2. Connect your worker nodes to the cache

In this section, we edit the User Data of our Spot Instances to mount the cache on startup and configure Path Mapping in Deadline to automatically map file paths from the NFS file system and to the cache.

Navigate to the Caches section of the Amazon FSx Service in the AWS Management Console and select your cache.
Select Attach.
Note the prerequisites and copy the mount command.
Edit your spot fleet’s user data script to include the mount command.
- How you should edit the User Data depends on how the configuration is being managed. For example, if the configuration is being managed by an RFDK template, then the User Data should also be managed through RFDK.
(Optional) Create a Deadline region to separate the Amazon EC2 spot nodes from the rest of the render farm.
- Navigate to Tools > Configure Repository Options > Region Settings, and select the Add button.
- Create a ruleset to apply the region to our Amazon EC2 spot nodes by navigating to Tools > Configure Repository Options > Auto Configuration, and selecting the Add button. Follow the documentation on rulesets to properly configure the ruleset.

The Configure Repository Options Window showing an example ruleset

To make sure that the render nodes look for the files on the cache instead of their original location on the cache’s data repository, we must create new path mapping rules in Deadline with the following steps:

Open Deadline Monitor.
Select Tools > Super User Mode to enable Super User Mode if it isn’t already on.
Select Tools > Configure Repository Options.
Navigate to the Path Mapping section.
Create a new rule that maps the path to the NFS file server on the submitting client to the path of the cache on the worker nodes. Make sure to include the data repository association’s cache path.
To support hybrid render farms, limit this path mapping to Amazon EC2 spot workers by specifying the correct region.

$The Create Path Mapping Rule dialog showing an example mapping between Z:\digital-assets\ and /mnt/cache-mount-point/data-repo/digital-assets/$

3. Create a Deadline Event Plugin

Amazon File Cache automatically hydrates files missing from the cache and evicts old files as the cache fills up. If files in the data repository change between renders, then an event plugin can be used to evict and rehydrate stale files on demand.

To create a new Event Plugin to evict and rehydrate files before each render, follow these steps:

Create a folder in the “events” folder of your Deadline Repository named “FileCacheHandler”.
Create a parameters file inside the “FileCacheHandler” folder named “FileCacheHandler.param” that specifies the editable parameters of the event plugin like the following example:

[State]
Type=Enum
Items=Global Enabled;Opt-In;Disabled
Category=Options
CategoryOrder=0
CategoryIndex=0
Label=State
Default=Disabled
Description=<html><head/><body><p>How this event plug-in should respond to events. If Global, all jobs and workers will trigger the events for this plugin. If Disabled, no events are triggered for this plugin.</p></body></html>

[Paths]
Type=string
Label=Root Paths
Description=<html><head/><body><p>A semicolon separated list of paths. Whenever a job is submitted with an export path that is inside or matches one of these paths, the post task script will be added to it. If Cache Eviction is on, a new job will be created that evicts all the files in the matching path before the original render job starts.</p></body></html>
Category=Options
CategoryOrder=0
CategoryIndex=1
DisableIfBlank=false
Default=

[PostTaskScript]
Type=string
Label=Post Task Script
Description=<html><head/><body><p>The full path to the post task script. Must be accesible by the worker. Path mapping will be applied to this path.</p></body></html>
Category=Options
CategoryOrder=0
CategoryIndex=2
DisableIfBlank=false
Default=

[CacheEviction]
Type=boolean
Category=Options
CategoryOrder=0
CategoryIndex=3
Label=Cache Eviction
Default=True
Description=<html><head/><body><p>Whether or not we should evict from the cache all files in the root path that matches the output path of the job. A new job will be created to handle the eviction before the original render job starts.</p></body></html>

[CacheHydration]
Type=boolean
Category=Options
CategoryOrder=0
CategoryIndex=4
Label=Cache Hydration
Default=True
Description=<html><head/><body><p>Whether or not we should pre-hydrate the cache with all files in the root path that matches the output path of the job. A new job will be created to handle the hydration before the original render job starts.</p></body></html>

Create a Python script inside the “FileCacheHandler” folder like the following example named “FileCacheHandler.py.” The script identifies jobs that output to the cache, attaches a post task script to them, and creates file eviction and hydration jobs as needed.

# Imports

import tempfile
import os
from Deadline.Events import DeadlineEventListener
from Deadline.Scripting import ClientUtils, RepositoryUtils
from datetime import datetime

# Functions

def GetDeadlineEventListener():
    """Called automatically by Deadline to get an instance of the File Cache Handler Listener

    Returns:
        FileCacheHandlerListener: An instance of the File Cache Handler Listener
    """
    return FileCacheHandlerListener()


def CleanupDeadlineEventListener(eventListener):
    """Called automatically by Deadline to clean up the listener

    Args:
        eventListener (DeadlineEventListener): Should be the same instance created by
        GetDeadlineEventListener() earlier
    """
    eventListener.Cleanup()

# Classes

class FileCacheHandlerListener(DeadlineEventListener):
    """Create a new instance of the File Cache Handler Listener

    Args:
        DeadlineEventListener (DeadlineEventListener): The DeadlineEventListener base class that
        Deadline expects
    """

    def __init__(self):
        self.OnJobSubmittedCallback += self.OnJobSubmitted

    def Cleanup(self):
        del self.OnJobSubmittedCallback

    def OnJobSubmitted(self, job):
        self.LogInfo("File Cache Handler: OnJobSubmitted")
        paths = self.GetConfigEntryWithDefault("Paths", "")
        postTask = self.GetConfigEntryWithDefault("PostTaskScript", "")

        # Return early if we have empty paths or post scripts are already set
        if job.JobPostTaskScript or not paths or not postTask:
            return

        # Check for a match with one of our paths
        matchingPaths = []
        for p in paths.split(";"):
            path = os.path.abspath(p)
            if len(path) == 0:
                continue
            self.LogInfo("File Cache Handler: Checking Path " + path)
            for d in job.JobOutputDirectories:
                outputDir = os.path.abspath(d)
                try:
                    if os.path.commonpath([path, outputDir]) == os.path.commonpath(
                        [path]
                    ):
                        matchingPaths.append(path.replace("\\", "/"))
                except ValueError:
                    # os.path.commonpath() will raise error if input isn't
                    # part of the same root drive. We should quietly ignore
                    # in this case. We let other errors bubble up.
                    pass

        if not matchingPaths or len(matchingPaths) == 0:
            return

        postTask = postTask.replace("\\", "/")

        if self.GetBooleanConfigEntryWithDefault(
            "CacheEviction", True
        ) or self.GetBooleanConfigEntryWithDefault("CacheHydration", True):
            # If the job is not already part of a batch,
            # we make it part of its own batch
            if job.JobBatchName == "":
                job.JobBatchName = "{name} Batch {date}".format(
                    name=job.JobName, date=datetime.now().isoformat()
                )

            self.LogInfo(
                "File Cache Handler: Cache Data Eviction or Cache Data "
                "Hydration Enabled. Creating Cache Handling Job"
            )
            oldJobInfoFilename = os.path.join(
                ClientUtils.GetDeadlineTempPath(), "old_job_info.job"
            )
            oldPluginInfoFilename = os.path.join(
                ClientUtils.GetDeadlineTempPath(), "old_plugin_info.job"
            )

            self.LogInfo("File Cache Handler: Creating Cache Job Files")
            RepositoryUtils.CreateJobSubmissionFiles(
                job, oldJobInfoFilename, oldPluginInfoFilename
            )
            jobInfoFilename = ""
            pluginInfoFilename = ""

            with tempfile.NamedTemporaryFile(
                mode="w", dir=ClientUtils.GetDeadlineTempPath(), delete=False
            ) as jobWriter:
                # Put plugin on first line
                jobInfoFilename = jobWriter.name
                jobWriter.write("Plugin=CommandLine\n")
                with open(oldJobInfoFilename) as oldJobInfo:
                    self.LogInfo("File Cache Handler: Reading old cache files")
                    for line in oldJobInfo:
                        key = line.split(sep="=", maxsplit=1)[0]
                        if key in [
                            "Plugin",
                            "Frames",
                            "LimitGroups",
                            "OverrideJobFailureDetection",
                            "FailureDetectionJobErrors",
                        ] or key.startswith("Output"):
                            continue
                        else:
                            jobWriter.write(line + "\n")
                    jobWriter.write("Frames=0\n")
                    jobWriter.write("FailureDetectionJobErrors=1\n")
                    jobWriter.write("OverrideJobFailureDetection=True\n")

            with tempfile.NamedTemporaryFile(
                mode="w", dir=ClientUtils.GetDeadlineTempPath(), delete=False
            ) as pluginWriter:
                pluginInfoFilename = pluginWriter.name
                pluginWriter.write("Arguments=")

                for matchingPath in matchingPaths:
                    if self.GetBooleanConfigEntryWithDefault("CacheEviction", True):
                        pluginWriter.write(
                            'nohup find "{path}" -type d -print0 | xargs -0 -n 1 {cmd} ; '.format(
                                path=matchingPath, cmd="sudo lfs hsm_release"
                            )
                        )
                    if self.GetBooleanConfigEntryWithDefault("CacheHydration", True):
                        pluginWriter.write(
                            'nohup find "{path}" -type d -print0 | xargs -0 -n 1 {cmd} ; '.format(
                                path=matchingPath, cmd="sudo lfs hsm_restore"
                            )
                        )

                pluginWriter.write("\n")
                pluginWriter.write("Executable=bash\n")
                pluginWriter.write("Shell=bash\n")
                pluginWriter.write("ShellExecute=True\n")
                pluginWriter.write("SingleFramesOnly=False\n")
                pluginWriter.write("StartupDictionary=\n")

            evictionJob = RepositoryUtils.SubmitJob(
                [jobInfoFilename, pluginInfoFilename]
            )
            evictionJob.JobName = job.JobName + " Cache Handling"
            RepositoryUtils.SaveJob(evictionJob)

            job.SetJobDependencyIDs([evictionJob.JobId])
            job.JobResumeOnCompleteDependencies = True
            job.JobResumeOnDeletedDependencies = True
            job.JobResumeOnFailedDependencies = True
            RepositoryUtils.PendJob(job)
        else:
            self.LogInfo(
                "File Cache Handler: Cache Data Eviction and Hydration "
                "Disabled. Skipping to next step"
            )

        self.LogInfo(
            "File Cache Handler: Attaching Post Task Script {script} to {job}".format(
                script=postTask, job=job.JobName
            )
        )

        job.JobPostTaskScript = postTask
        RepositoryUtils.SaveJob(job)

Create the post task script in a path accessible to the render workers, such as the Amazon File Cache. I named my example “WriteBackPostScript.py” and made it look like this:

# Imports

import subprocess
import os
from Deadline.Scripting import FrameUtils, RepositoryUtils

# Main

def __main__(deadlinePlugin, *args):
    task = deadlinePlugin.GetCurrentTask()
    job = deadlinePlugin.GetJob()

    outputDirectories = job.OutputDirectories
    outputFilenames = job.OutputFileNames
    for i in range(0, len(outputDirectories)):
        outputDirectory = outputDirectories[i]
        outputFilename = outputFilenames[i]
        for frameNum in task.TaskFrameList:
            outputPath = os.path.join(outputDirectory, outputFilename)
            outputPath = outputPath.replace("//", "/")
            outputPath = outputPath.replace("\\", "/")
            mappedOutputPath = RepositoryUtils.CheckPathMapping(outputPath)
            deadlinePlugin.LogInfo(
                "Mapping Path: {path} to {mappedPath}".format(
                    path=outputPath, mappedPath=mappedOutputPath
                )
            )
            deadlinePlugin.LogInfo("Frame: {frameNum}".format(frameNum=frameNum))
            mappedOutputPath = FrameUtils.ReplacePaddingWithFrameNumber(
                mappedOutputPath, frameNum
            )
            deadlinePlugin.LogInfo(
                "Writing back file: {path}".format(path=mappedOutputPath)
            )
            subprocess.run(["sudo", "lfs", "hsm_archive", mappedOutputPath])

Select Tools > Synchronize Monitor Scripts and Plugins.
Select Tools > Configure Events.
Select the FileCacheHandler Event.
Set the State to “Global Enabled”.
Enter a comma separated list of directories in Root Paths. All files and sub-directories of root paths that match the render job are modified on cache data eviction and hydration. Therefore, the path should only encapsulate the files that could change between render submissions.
Enable both Cache Data Eviction and Cache Data Hydration by setting both to True.

The Configure Event Plugins Window showing the FileCacheHandler settings.

Clean up

In this section, we execute the steps necessary to delete a cache without losing any data.

Connect to the terminal of a computer that has the Amazon File Cache cache mounted.
Check the status of the files on the cache with the following command. This returns the names of files on the cache with changes that weren’t saved back to the data repository: nohup find <path/to/cache> -type f -print0 | xargs -0 -n 1 sudo lfs hsm_state | awk '!/\<archived\>/ || /\<dirty\>/' | wc -l
If the command from Step 2 shows that some files are dirty or unarchived, then run the following command to archive them individually: sudo lfs hsm_archive <path/to/unarchived/file>
- To archive an entire folder, run the following command: nohup find <path/to/folder> -type f -print0 | xargs -0 -n 1 sudo lfs hsm_archive &
Repeat steps 2 and 3 until no more files have unsaved changes. This may take some time depending on how much data must be written back.
Navigate to the Caches section of the Amazon FSx Service in the Console.
Select your cache.
Select Actions > Delete cache.
Type the name of your cache and select Delete.
Remove the lines pertaining to the cache from the user data of your spot workers.
Select Tools > Configure Events.
Select the FileCacheHandler Event.
Disable the Event Plugin by setting the State to “Disabled” and select OK.
Select Tools > Configure Repository Options in the Deadline Monitor.
Select Mapped Paths.
Select any rules pertaining to your cache, and select Remove.

The Configure Repository Options window showing example path mapping rules

Conclusion

In this post, we explored a solution that uses Amazon File Cache and Deadline scripts to automatically sync digital assets between on-premises storage and a cloud cache. This friction-less integration between AWS and on-premises hardware lets studios preserve their current workflows, while maintaining the ability to instantly tap into the power of the AWS cloud when they must accelerate their rendering past the limitations of their on-premises compute capacity. We hope that this solution and solutions like it empower our customers to continue to push the boundaries of what’s possible in the world of computer-generated graphics.

AWS for M&E Blog