AWS for M&E Blog

Build a resilient cross-region live streaming architecture on AWS

Introduction

Customers with live streaming video use cases often express a desire for flexibility in how they deploy live streaming services. Not all live channels are alike, and the ability to balance resiliency against cost is crucial. AWS Media Services from Amazon Web Services (AWS) are designed with redundancy in mind. Stateful services for video transport and encoding, such as AWS Elemental MediaConnect and AWS Elemental MediaLive, can be deployed in multiple availability zones within an AWS Region. Stateless services for video transcoding, origination, and distribution, including AWS Elemental MediaConvert, AWS Elemental MediaPackage, and AWS Elemental MediaTailor, natively span availability zones without any additional user configuration.

Until now, there hasn’t been an easy option for customers to fail a live streaming service seamlessly from one region to another. This capability is particularly important for high-profile, high-value live streaming events. This blog post explains how this is now possible and what is involved in doing so.

An example of automatic seamless switching

Fig 1: Seamless switching between regions

Before diving into technical detail, let’s examine what switching between regions looks like from an end-user perspective. In this example, timecode and overlays show where video segments originate from.

  • The player is initially served segments from Ohio (us-east-2), which in this case is where the primary video pipeline is located.
  • At 01:54:38:00, segment delivery switches to the backup origin in Portland (us-west-2), following a failure in Ohio.
  • At 01:54:46:00, segments are once again available and therefore served from Ohio.

Architecture walkthrough

An architecture diagram depicting live video sources being sent to two separate AWS regions. In each region MediaLive and MediaPackage independently process video while maintaining time synchronization. Amazon CloudFront can pull content from both regions. User devices access content from a CloudFront endpoint.

Fig 2: Cross-region architecture example.

Let’s explore one potential architecture and highlight its key components. Figure 2 depicts the primary region (1) using MediaLive’s standard channel class. In this mode, MediaLive runs redundant pipelines in two different availability zones. In contrast, the secondary region (2) is based on MediaLive’s single pipeline channel class. Since MediaLive synchronization is performed at the channel level, you are free to choose the in-region redundancy model for your use case. Premium events might require standard channels in both regions. Regular events channels may be adequately covered with a standard channel in region 1 and a single pipeline channel in region 2—or even single pipeline channels in both regions.

The on-premises encoder, in this case AWS Elemental Live, encodes the live production feed. Redundant encoders each insert the same UTC timecode on corresponding video frames. Downstream, MediaLive uses this timecode to synchronize its outputs using Epoch Locking. This decoupled approach allows each processing path to function independently, even if its partner region were to fail. This is crucial for maintaining service continuity.

MediaLive creates outputs using the DASH-IF CMAF Live Ingest (Interface-1) protocol. This is key for cross-region failover. When using this format, MediaLive maintains a consistent output segmentation cadence. Aside from the protocol itself, MediaLive CMAF Ingest output behaviour differs from HLS outputs in three key areas:

  1. Ad markers do not trigger the generation of a new output segment.
  2. Outputs do not pause if the channel’s input is lost. MediaLive fills gaps and continues to send valid video segments (back and silence) to the origin.
  3. MediaLive signals input loss to downstream systems through MP4 segment level signalling.

MediaPackage is enhanced to ingest CMAF IF Ingest streams. From this, it can produce HLS/TS, CMAF and DASH outputs that align in terms of segmentation, naming, and encryption. Because segment ingest is aligned to epoch time, multiple regions will start with the same DRM key at the same segment boundary when key rotation is enabled.

If the contribution encoder, contribution network, or the MediaLive service is interrupted, MediaPackage considers the channel to be stale. In this state, it returns 404 error codes to requests made to the channel endpoints. Amazon CloudFront is configured to leverage its origin failover feature. If a player makes a request for a non-cached object, CloudFront will forward the request to the primary origin. If it receives a 404 from the origin due to one of the aforementioned situations or the origin itself is not available, CloudFront redirects the request to the secondary origin. Since manifests and segments are aligned and identically named, CloudFront can return an object from the secondary origin to complete the request. The playback device continues playing the stream without any noticeable effect to the user experience. Traffic automatically returns to the primary origin when available or when MediaPackage stops returning 404s for stale endpoints object requests.

Considerations and configuration

For a completely seamless playback experience, video frames should be time locked as early in the video chain as possible. If you are using Elemental Live as the source encoder, it can reference timecode embedded within the video signal or reference a local LTC source to synchronize outputs together. Output locking in this way ensures outputs across multiple encoders have the same presentation time stamp (PTS) and picture on each video frame. For further details, refer to the Output locking section of the AWS Elemental Live User Guide. If video signals are not aligned at the source, downstream synchronization and failover is still possible; however, users may see the picture jump forward or backwards in time when failing between regions.

When configuring your encoder / transcoder to push to MediaPackage for cross-regional failover, your configuration must adhere to the following restrictions:

  • All video framerates within the CMAF output group must be consistent. They can all be fractional framerates, or all integer framerates, not a mix of the two. The combined use of framerates that are multiple of one another (like 25fps and 50fps or 29.97fps and 59.94fps) are allowed.
  • The transition from fractional framerates to integer framerates (or the other way around) across two encoding sessions is forbidden. Framerates across encoding sessions can be multiples of one another: 25fps to 50fps or 50fps to 25fps are allowed transitions, but 25fps to 30fps or 30fps to 25fps, for example, are not.
  • Output segments sequence numbers should not repeat, or go back in the past across two encoding sessions.

Configuring MediaLive

Output locking mode

Under the channels General Settings, ensure you have Enable Global configuration set. This allows you to configure the channels Output Locking Mode to EPOCH_LOCKING.

A screenshot showing the MediaLive Global Configuration console page. There are several settings, the relevant one to this post is the output locking mode, which is set to Epoch Locking.

Fig 3: MediaLive output locking mode.

Timecode configuration

Ensure the MediaLive channel is configured to use the timecode in the input source by selecting EMBEDDED. This is enabled by default, but if set incorrectly, will cause misalignment of video frames across regional channels. This assumes timecode is available in the input to MediaLive, which should be the case if you are using Elemental Live as the contribution encoder. If you are unable to embed timecode in the source, choose SYSTEMCLOCK. In this mode, MediaLive will signal its output timecode based on its NTP clock.

A screenshot showing the MediaLive Timecode Configuration console page. Timecode source is set to EMBEDDED.

Fig 4: MediaLive timecode configuration settings.

Create an output group

You must use the CMAF Ingest group to push streams to MediaPackage. Ensure you use a MediaPackage V2 ingest point configured with a CMAF Input type. Refer to creating a CMAF Ingest output group instructions in the MediaLive User Guide for more details.

If you have prior experience managing live streams on MediaLive and MediaPackage, you may be familiar with MediaLive’s Input Loss Action setting. This setting controls the HLS Output behavior when the pipeline input disappears. In a redundant (standard channel) configuration, this should be set to PAUSE_OUTPUT to cause MediaPackage to failover to its alternate input.

This approach changes with CMAF Ingest. We do not have an Input Loss setting for this output type. With CMAF Ingest, MediaLive continues to output valid video (black frames and silence) on input failure, and instead reports input loss via segment level signaling. MediaPackage optionally uses this signaling as a failover trigger, providing greater flexibility than was previously possible and ensuring that there is always a stream for players to receive, even if it is a degraded one.

Configuring MediaPackage

Your MediaPackage channel must be configured with the CMAF Ingest Type. This is defined at channel creation time and cannot be changed subsequently. The CMAF Ingest channel type is only available for MediaPackage V2 channels.

A screenshot of the MediaPackage channel creation page, highlighting the new CMAF input type, required for cross-regional failover.

Fig 5: Selecting CMAF Ingest from the MediaPackage channel creation page.

Endpoint error behaviour

MediaPackage provides several options to control which error scenarios trigger stale state.

A screenshot of the four error scenarios available to configure for each MediaPackage endpoint. Stale manifest, Incomplete manifest, Missing DRM key, and Slate input are all selected.

Fig 6: Configuring MediaPackage endpoint failover

  • Stale manifests. MediaPackage stops receiving ingest streams on its input, indicating the encoder or network path has failed.
  • Incomplete manifests. MediaPackage is receiving ingest segments, but there are gaps in the timeline. This results in the ingested stream presenting an incomplete timeline for some renditions, potentially generating playback problems.
  • Missing DRM Keys. MediaPackage could not retrieve an encryption key on a key rotation operation. This results in use of the old encryption key after the key rotation time. While this will not affect playback, it could be problematic from a business or content rights entitlement perspectives.
  • Slate Input. MediaPackage detects and flags that the ingest stream(s) include a significant proportion of slate content (black video frames, audio silence). While playback will continue, a degraded user experience results and may be a valid reason to failover to a different video path and origin.

When MediaPackage is configured to react to these scenarios, it will return 404 responses on manifest requests until all problematic segments are evicted from the exposed manifest window. As regards media segments requests, MediaPackage stops returning 404 responses as soon as the endpoint failover conditions clear.

We generally recommend enabling options 1, 2 & 4 on the primary MediaPackage origin and none on the backup origin. This is to ensure the backup origin endpoint is available to CloudFront, and therefore the player. Activating option 3 depends on your business requirements, and not solely on playback sessions continuity considerations.

CloudFront origin groups

Your CDN should be configured with two origins, one for each regional MediaPackage. CloudFront supports primary and secondary origins via its origin group feature. We also recommend enabling CloudFront Origin Shield.

Once you have configured your origins, create an origin group and add the two origins from the list, ensuring the primary one is first. You then define the error codes that, if received from the primary origin, will cause the request to be routed to the secondary origin. Refer to the Amazon CloudFront Developer Guide for more details.

A screenshot of the CloudFront origin group configuration. It has two MediaPackage origins configured, the primary is the first in the list. There is a free text box to define the group name and a list of 4xx and 5xx HTTP error codes, all of which have been enabled to trigger failover.

Fig 7: Setting up a CloudFront origin group across two MediaPackage origins.

Comparing options

Customers have previously built architectures to enable cross-region failover. These options typically limited the features available or required custom development to build client-side logic to handle failover. The following table compares these options to the new approach using MediaLive CMAF Ingest in conjunction with MediaPackage V2.

Table 1: Comparing cross-regional failover options.

Conclusion

For many use cases, the redundancy options available within an AWS Region are sufficient to meet resiliency requirements. Premium services and events often demand even higher levels of reliability by moving or duplicating components. In this post, we explained how to synchronize live streaming channels across regions. This allows platforms to withstand a broader range of failure scenarios than was feasible in the past. Furthermore, switching from one region to another can occur automatically and without any noticeable disruption to the user experience. For further information, please refer to the MediaPackage User Guide.

Jamie Mullan

Jamie Mullan

Jamie is a Solutions Architect for AWS Edge Services covering AWS Elemental Media Services, Amazon CloudFront & AWS WAF for customers in the UK&I. He has a Software Developer and DevOps Engineering background and now helps customers architect and deploy video solutions on AWS.

Andrew Fayle

Andrew Fayle

Andrew is a Senior Solution Architect, specializing in AWS Media Services in Australian and New Zealand. He has spent more than ten years building and operating large scale OTT solutions in the Telecommunications industry before joining AWS. Andrew now focuses on enabling customers to build innovative solutions making the most of AWS Elemental Media Services.

Christer Whitehorn

Christer Whitehorn

Christer is the Lead Solutions Architect specializing in AWS Media Services for the Asia Pacific region including Japan and Greater China. He has spent more than twenty years working with multiscreen video, broadcast playout and compression headend solutions.