How we enabled uncompressed live video with CDI over EFA

In the field of high performance computing, a lot of the tools and techniques we develop flow into the rest of the industry, solving lots of problems and unlocking new possibilities for everyone. The notion of clustering computers together to solve ever greater problems than any single machine could handle has arguably had the most profound impact on the wider IT industry (and along the way lead to a new industry – this thing called cloud).

In this blog post, we’re going to take you into the world of broadcast video, and explain how it led to us announcing today the general availability of EFA on smaller instance sizes. For a range of applications, this is going to save customers a lot of money because they no longer need to use the biggest instances in each instance family to get HPC-style network performance. But the story of how we got there involves our Elastic Fabric Adapter (EFA), some difficult problems presented to us by customers in the entertainment industry, and an invention called the Cloud Digital Interface (CDI). And it started not very far from Hollywood.

Latency is a problem

In 2018, Fox’s broadcast engineering team were making plans to migrate to the cloud. They’d made a lot of progress, but were getting stuck on a problem involving latency – specifically, the impact of all the small latencies throughout the video production pipeline. In the operations rooms, we overheard the same comment repeatedly regarding live workflows: “… this can’t move to the cloud – it’ll add at least 30 seconds of latency”. But the average ping time between Los Angeles and the US-West-2 AWS Region is 22.7 milliseconds, round trip. That means just over eleven milliseconds one way – less than the 16 milliseconds it takes for a single frame to flash up on the screen at 60 frames per second (fps) – the way many of us watch TV now. Where was this discrepancy coming from, and – most concerning to AWS – why was it an industry-wide perception? Was it true for some reason?

After a lot of discussion with customers, it became clear to us that the cloud overall had taken on a conflated generalized latency reputation in the broadcast industry, based on existing workflow patterns that use a process of segmented delivery. The idea is that video streams are broken up into a series of HTTP-based files or chunks which get shunted through the production chain, one after the other. In TV broadcast there a lot of steps needed to get a video feed onto the screen – they’re all important, even if you’re unaware of them at home. The latency expectation came from the notion that in these multi-step workflows, encoding and decoding would have to happen between every hop, adding aggregate latency and generational loss. The most common encoding method in the cloud is H.264, and segmented using standards like HLS (HTTP Live Streaming) or DASH (Dynamic Adaptive Streaming of HTTP). When this was overlaid with actual production broadcast workflows, this would definitely have added latency and a noticeable loss of visual fidelity at every hop in the chain.

Figure 1: Example of a live television playout and distribution workflow. All instances, services, and applications run hot/hot in two Availability Zones to enable seamless failover. This workflow uses CDI between multiple hops in the production chain which leads to single digit frame latency and no encoding generational loss between hops.

Broadcasters in the process of creating live content need near-instantaneous interaction with their content and applications. This means the technologies they choose for live production and playout are quite different from those in the distribution path.

In their on-premises environments they keep the broadcast chains tight by connecting machines together using uncompressed video links built on standards like SDI (“Serial Digital Interface”) or SMPTE ST 2110. This means literally connecting the output from one device (maybe the raw camera feed) to the input of the next device in the chain. In some TV studios you can even see the workflow by just following the cables along the hallways. It allows totally uncompressed video to be transferred between machines in real-time. The latency between devices connected like this is so small that when an operator presses a button on a control surface, the visual response seems instant. For those not in the broadcast industry, the best analogy is the experience you have when playing a video game on a home console, or maybe flicking a light switch. It’s near instantaneous. If there were a measurable latency between your controller and the console, it would make the game almost unplayable.

It’s so important that this has become one of our tenets for live broadcasting using the cloud: when participating in the creation of real-time content, customers must be able to interact with the system in real-time and see a visual response in real-time. It’s hard.

Real-time Video

First, some terminology. The whole pattern of moving video through a production workflow, from system to system, using TCP/IP is called OTT. This refers to moving video “over the top” of non-dedicated networks (like the internet, when we’re talking about the distribution stages) which weren’t originally designed for video. And remember, there’s a big difference between production (flicking that light switch), and transmission (the final stretch to your TV set).

There are three elements that can lead to latency in these OTT workflows. First, there’s the time required to encode and package. Next, there’s the transmission latency. Finally, there’s the buffering of the segments and decoding on the devices. The bulk of the latency in legacy OTT workflows was introduced by these encoding, packaging, and decoding steps. The transmission latency, on the other hand, is roughly the speed of light plus some packet-processing time, leading to that ~22 milliseconds or so.

In order to resolve the latency and quality concerns introduced by compression, we needed to either remove compression entirely or use a different sub-frame, visually lossless compression mechanism. Our solution was ultimately to do both, but in this post we’ll focus on the uncompressed technique. We also had to understand that customers were initially thinking of cloud as only a place where OTT workflows existed, leading to that 30-second-latency reputation. By introducing tools for real-time transport instead of packaging workflows, all the concerns around latency and quality have been (mostly) eliminated.

As we mentioned before, there’s more than one method for transmitting uncompressed video encapsulated in IP. The leading standard on-premises is SMPTE ST 2110 which separates the video, audio, and metadata elements into “essences” and encapsulates them in RTP (Real Time Protocol, an application-layer protocol over UDP) streams, generally using ST 2110-20 uncompressed video. The video bitrates can be daunting: the image size of a high definition 1080p frame is 1920×1080, which is 2,073,600 pixels per frame. The typical color format used in SDI and ST 2110-20 is 10-bit 4:2:2 which means an average bits per pixel (bpp) of 20, or 5,184,000 bytes per frame. At 60 fps, we have 5184000 x 60 = 296.63 MByte/s = 2.373 Gbit/s when transferred. 4K Ultra-high-definition (UHD) content is four times larger. These kinds of “elephant flows” are known to be lossy over shared networks, which isn’t an acceptable outcome for critical live video traffic. We needed another approach to ensure reliable delivery within the time budget.

Remember though: the customer requirement is not to support a particular bitrate or choose an acceptable loss profile. It’s to consistently deliver a fixed amount of data (a frame) reliably, within a fixed amount of time. At 60 fps that’s 16.6 ms. So, the challenge came to be: can we transfer a frame of video between two EC2 instances in its entirety before the receiving instance needs it, and can we do that reliably for the entire duration the flow exists?

One problem leads to another familiar story

When phrased this way, the notion that we needed to move large blocks of data extremely quickly and reliably over busy, shared, real-world networks, it started to take on the dimensions of an HPC challenge.

Customers doing weather forecasting, simulating protein structures for drug design, or trying to scale machine learning applications also face this problem. Their “codes” (as HPC people mostly refer to them) use massive-parallelism techniques (lots of cores, and the more the better) to shrink the run time for solving hard math problems from decades to days. In traditional settings, on-premises supercomputers are very specifically designed to create intricate network topologies (“fat tree”, “3D torus”, and “Hypercubes”) to minimize latency at every single hop. They also have elaborate congestion management algorithms so network drivers have a plan B if congestion happens somewhere on the fabric and a packet gets dropped. The trouble, though, is that there are always packets getting dropped. Congestion on a shared network isn’t an exception, it’s just normal. That’s true on a supercomputer, and it’s no different in the cloud – we just have orders of magnitude more use cases, leading to all sorts of unpredictability if you’re following the life of individual packets.

Our approach to solving this problem was to worry less about individual packets and worry about whole herds of them. We arrived at this idea by talking to HPC customers and asking them about their codes. The HPC industry had also grown to be quite fussy about latency, though at a few orders of magnitude smaller-scale. They fretted about how many microseconds it took for individual 32-byte packets to transit a fabric from one compute node to another. When we dug deeper, though, the programmers writing the code were trying to transfer kilobytes or megabytes of data – ie thousands of packets. Trying to optimize for a single packet in these circumstances was really misleading – it’s like thinking that you can speed up the commute for all New Yorkers by giving them all formula-1 cars.

Traditional network fabrics in supercomputing moved packets in-order over short hops on complex networks with multiple pathways. The solution that we developed for HPC customers took an unorthodox approach. Relaxing the in-order requirement allowed us to spray the packets over lots (and lots) of pathways all at once, like a swarm. If any single packet went missing, we accounted for it, and resent it. But we didn’t stop any of the other streams of spraying packets while we did that. The net effect was a much faster, and more reliable delivery than the reliable datagram protocols themselves, so we decided to call it the Scalable Reliable Datagram (SRD).

When we added kernel bypass techniques to reduce latency even further, we had a programming interface (in HPC language, a “libfabrics provider”) capable of moving some serious data, in the tightest of practical timeframes. We surprised ourselves, because this newly born Elastic Fabric Adapter (EFA) became popular for codes with reputations of being the most latency-sensitive. And it scaled way beyond our expectations.

You can read more about that in our post from a few months ago, but the short version is that customers love it, and they’ve been deploying more and more workloads with EFA and using it for things we’d never thought about, which brings us back to uncompressed video.

Reinventing SDI for the cloud

In testing, we convinced ourselves that video frames could also be reliably moved in time between instances using EFA. When we wrapped this in a friendly ST 2110-like API, we’d invented something new: the AWS Cloud Digital Interface, or AWS CDI. This is the virtual version of all those SDI cables that use to snake their way through broadcast facilities, carving out a workflow in physical form. Only now, the pathways are dynamic. Which means the production workflows can quickly adapt as their needs change. Customers can experiment with new techniques for processing video, or creating effects. Having their infrastructure as code, opens a lot of possibilities to evolve and improve the experience for their viewers.

When we designed EFA for HPC users though, our focus was on stitching together the largest-sized Amazon EC2 instances, and initially EFA was only deployed on the c5n.18xlarge. Our broadcast customers don’t always need that many cores, however. Part of their strategy for reducing their cost base was to right-size the infrastructure for each application.

Figure 2: CDI transmits the frame buffer using EFA. SRD is a multipath, self-healing transport. This creates a kernel bypass method that effectively enables a memory copy from one framebuffer to another.

While we always had a plan for EFA to deploy on a greater variety of difference instance families, it was the broadcast customers that helped us understand the need to deliver it on smaller instance sizes, leading to today’s announcement. Now, instances on EFA-enabled families one step down from the largest (with 50 Gbit/s or 25 Gbit/s network bandwidth) can take advantage of EFA’s capabilities and use CDI.

This lowers the price of running CDI-enabled applications on Amazon EC2 by a lot. Prior to this announcement, customers wanting these uncompressed video capabilities on GPU-enabled instances would have used g4dn.metal which currently has an on-demand price of $7.824/hour (in US-East-1). They can now use the g4dn.8xlarge which still has 50 Gbit/s networking, but comes in at $2.176/hour, representing a 72% discount. For those not needing a GPU, the c5n.9xlarge is now EFA-enabled, and that’s a 50% price reduction to customers over the c5n.18xlarge or c5n.metal. For a full list of instances that now support EFA, visit the EFA documentation on our website.

Conclusion

We always learn a lot from our customers, and strive to deliver technology that solves actual customer problems. Often that means digging deeper than normal to understand where the problem comes from, and it takes effort to not get distracted by “the usual ways” some communities have solved these problems before. This often lets us find solutions others have missed. CDI, like EFA, is a great example of that. We hope the introduction of CDI on smaller instance sizes provides evidence to customers that truly live workflows with similar latency, quality, and cost profiles similar to on-premises settings are achievable.

AWS is committed to solving even the most demanding challenges in the broadcast industry. Although there are many video-over-IP tools available, CDI is the only tool available today that provides customers and vendors with real-time, live performance in the uncompressed domain, just like SDI and SMPTE ST 2110 have provided at home.

CDI is an open-source project. You can find out more on our resources page, and SDK is on GitHub.

AWS HPC Blog