AWS Partner Network (APN) Blog

Achieving Resilient Cloud Broadcasting with Amagi CLOUDPORT on AWS

By Arpit Malani, Engineering manager – Amagi
By Bharath S, Sr. Partner Solutions Architect – AWS
By Raghu Mukund, Strategic Account Manager – AWS
By Girish B, Sr. Solutions Architect – AWS

Amagi-AWS-Partners-2024
Amagi
Amagi-APN-Blog-CTA-2024

Broadcasters must plan for events outside of their control, such as natural disasters and broad infrastructure outages, so mitigating the risk of degraded viewing experiences and revenue losses at high-profile events is critical.

Amagi has a customer-focused approach driven by business requirements to achieve different levels of playout service resiliency on Amazon Web Services (AWS).

Amagi is a market leader in cloud broadcast and streaming TV technology, with over 2,000 channels deployed in over 40 countries. The core of Amagi’s offerings is CLOUDPORT, a cloud-based software-as-a-service (SaaS) playout platform that delivers broadcast-grade quality while offering the flexibility of cloud deployment.

Amagi is an AWS Specialization Partner and AWS Marketplace Seller that offers cloud broadcast and targeted ad solutions to broadcast TV and streaming TV platforms.

Efficient Redundancy Deployments: Striking the Perfect Balance

Amagi fulfills diverse customer requirements for redundancy and recovery time by using AWS resiliency options that leverage multiple AWS Availability Zones (AZs) over single/multiple regions. The deployment options range from deployment of playout and automation in a single AZ to minimize cost and complexity, to multi-region that provides geo-level resilience for critical events.

For customers with the highest redundancy and resiliency requirements, Amagi provides a beyond multi-region option that offers on-premises automation and playout in addition to multi-AZ/multi-region deployments.

Amagi-Cloudport-Resilience-1

Figure 1 – Amagi’s tiered resiliency offerings.

CLOUDPORT’s Disaster Recovery Configuration Modes

In addition to playout and automation deployments spanning single or multiple AZs/regions, the configurations can be hot, warm, or cold depending on desired recovery times.

  • Hot configuration offers the shortest recovery time and instant switching between primary and secondary regions. Both regions remain active, minimizing recovery time and ensuring high availability. For over-the-top (OTT), an active-active configuration serves as origins to the content delivery network (CDN).
  • Warm configuration keeps automation and internal data in sync. Playout processors are only activated in a failover, allowing rapid recovery in minutes. Costs are lower than the hot configuration, while retaining the same recovery point.
  • Cold configuration operates with minimal “rescue content” capability. In a disaster, full automation and playout restart from scratch can take up to an hour. This option provides degraded service continuity with the lowest operating cost.

Customers select a configuration based on their business requirements. If the focus is signal recovery at a highly optimized cost, Amagi offers a pre-built cached playout chain to provide quick service recovery in the event of a catastrophic event.

Amagi’s distributed architecture achieves an instant recovery of the control plane with sub-one minute recovery time objective (RTO) in multi-AZ option and recovery time objective (RTO) of 10 minutes for regional failover across multi-region respectively. This ensures playout continuity during outages with quick control plane recovery.

Solution Architecture

Amagi CLOUDPORT is a versatile playout platform that supports most industry-standard audio, video, and subtitle formats. It provides an intuitive interface with asset management, analytics, and configurable rules.

Key features include automation for scheduling, quality control, and monitoring; multi-layered graphics support; and a wide range of input formats like MXF, MOV, MPG, TS, and output formats including MPEG-TS, HLS, SRT, Zixi, NDI, and RIST. The platform handles video codecs AVC and HEVC; audio codecs AAC, AC3, and EAC3; as well as captions in CC-608, CC-708, DVB subs, and DVB teletext.

Amagi-Cloudport-Resilience-2

Figure 2 – Amagi playout multi-region architecture.

In the diagram above, the secondary region does not actively generate service 24×7. It remains in a warm (standby) state, prepared to take over transmission if issues arise in the primary region. If a switch from the primary to the secondary region is necessary, it is automatic through Amazon CloudFront, ensuring a seamless and swift transition and allowing uninterrupted service availability.

Design Aspects

Amagi’s CLOUDPORT SaaS platform is built on AWS services with built-in support for multi-AZ, allowing for quick and seamless switching between availability zones.

Video transport relies on AWS Elemental Media Services like AWS Elemental MediaLive, AWS Elemental MediaConnect, AWS Elemental MediaPackage, and AWS Elemental MediaStore. These services can withstand failures at the AZ level, and the business functionalities are based on a microservices architecture running on Amazon Elastic Kubernetes Service (Amazon EKS).

Further, the system’s components are configured as microservices to enable dynamic scaling to meet specific customer requirements. CLOUDPORT is deployed and managed on a self-managed EKS cluster, offering maximum configuration flexibility.

Various aspects (such as database, caching, media storage, and archival) rely on managed services like Amazon Aurora, Amazon ElastiCache, and Amazon Simple Storage Service (Amazon S3). Multi-region networking enables communication between EKS pods and services across regions through virtual private cloud (VPC) peering.

Amazon EKS’s use of a VPC container network interface (CNI) enables IP communication between pods in different regions. Amazon Route 53’s public and private hosted zones are key for domain name service (DNS)-based services and endpoints across regions. Route 53 readiness checks allow swift failover during region impairments, while failback strategies ensure service restoration post-disaster.

Seamless Cloud-Based Media Asset Management

CLOUDPORT playout receives media feeds from customers to offer playout services using their media asset management and source systems. It receives live content from AWS Elemental MediaConnect for secure, low-latency transmission using AWS Direct Connect.

Content integration systems upload content and schedules from on-premises storage to Amazon S3, which stores petabytes of video assets crucial for channel playout, while S3 Multi-Region Access Point allows access to input files from multiple regions for redundancy.

Playout services automatically retrieve and process data from S3 for transcoding. Inputs include various video files that undergo verification and completeness checks, with issues notified via Amazon Simple Notification Service (SNS). Processed media assets are stored in a “transmission-ready” S3 bucket.

High Availability and Data Redundancy

Amazon Aurora Global Database is utilized to ensure high availability and data redundancy across regions. It replicates data across multiple regions and enables global access and failover capabilities. If there’s an outage in the primary region, it can automatically promote a cross-region replica to become an independent cluster that serves as the new primary database.

When a failover occurs, a DNS switchover updates the endpoint so applications can seamlessly connect to the newly-provisioned primary cluster in the alternate region. This ensures uninterrupted data access and application functionality during an outage.

Once the primary region is stable again, a planned failover reverts the architecture back to original state. Aurora Global Database enables smooth cross-region reader deployment so applications can read from replicas in different regions, improving read scalability and minimizing load on the primary database.

Amazon ElastiCache Global Datastore provides a distributed, in-memory caching solution with automatic failover between regions for disaster recovery (DR). In the event of a disaster affecting the primary region, the secondary cluster becomes the primary, ensuring a seamless transition of workload and data access to minimize application disruption.

Once the primary region recovers and syncs, it returns to a secondary role as a read replica or cache node. This dynamic capability allows the architecture to adapt to changing circumstances and ensures highly available data access across regions, even in the face of unexpected disasters.

Multi-Region Redundancy with CLOUDPORT Switcher and Amagi LIVE

The CLOUDPORT Switcher microservice allows for seamless, frame-accurate video switching to AWS Elemental MediaLive, which writes content to MediaStore or S3. These services originate content for Amazon CloudFront, providing primary and failover CDN paths for continuous video delivery across regions.

Amagi LIVE utilizes redundant contribution streams through separate MediaConnect direct connects to ensure network fault tolerance. MediaConnect streams are sent to multiple receive locations over separate direct connects, enhancing redundancy. This architecture withstands regional failures, ensuring reliable video delivery.

Dual-Region Redundancy

After integrating all components, pods interconnect with storage services in the primary us-east-1 region. To ensure redundancy, these storage services duplicate across the secondary us-west-1 region.

In a regional failure event, the secondary storage services promote to take over, so the control and video path continue without interruption despite the outage. This demonstrates failback reverting changes, returning the system to its initial state after regional failure. This swift recovery design maintains full operational capability, safeguarding service continuity and minimizing downtime.

Failover Strategies: Flexibility and Customization

In the event of a failover, the specific actions required by CLOUDPORT depend on the chosen failover solution. In a hot configuration with no traffic option, the connection to the headend will be established. In warm or cold modes, missing instances start automatically and when the configuration completes the service resumes.

Users can also initiate a manual failover using the WebUI module. This triggers the CLOUDPORT re-configuration process and activates the instances. Alternatively, the CLOUDPORT API can trigger failover after an integration with the customer’s end-to-end monitoring and control systems. This offers flexibility in implementing failover strategies that best suit the specific requirements and preferences of the customer.

Tuning and Testing for Disaster Recovery

Before commissioning the playout solution, CLOUDPORT undergoes exhaustive tuning and testing. Tuning involves verifying settings in a realistic scenario, adapting input formats and downlinks to the available network capacity and quality of service, optimizing Kubernetes clusters and Amazon Elastic Compute Cloud (Amazon EC2) instance types for the channel types and numbers.

Testing involves selecting and validating the failover options against the customer’s requirements. Testing includes end-to-end functionality and video quality tests, verification of content upload and verification mechanisms, schedule amendment tests, and verification of program reuse from cloud archives. Failover and failback scenarios are tested and validated against functional and non-functional requirements like RTO/RPO.

For a robust disaster recovery solution, defining regular testing plans is crucial to ensure the DR mechanism remains reliable over the life of the service. To achieve this, CLOUDPORT utilizes a separate test endpoint and the test version of the failure reconfiguration AWS Lambda function. It conducts regular, pre-defined DR drills to validate the strategy and identify weaknesses and areas for improvement to enhance DR preparedness. Periodic testing of the failover mechanism is critical to ensure effectiveness, especially in the case of on-premises architecture modifications.

Information gathered from the tuning and testing phase helps define options and configurations based on program audience and business impact in the event of a failure. This can be a per-channel customization with the hot option used for critical channels, and warm or cold options configured for niche channels.

Ensuring alignment and compatibility with the source, on-premises system, and communication endpoints is crucial to a successful recovery strategy in case of a disaster.

Conclusion

The approach highlighted in this post presents a straightforward and efficient architecture for implementing a cloud-based disaster recovery solution for channel playout on AWS.

By utilizing Amagi’s CLOUDPORT SaaS product and an integration system, this solution supports hybrid operations and offers various configuration options to strike the optimal balance between RTO/RPO and operational costs. Emphasizing the significance of operational procedures and regular failover testing, CLOUDPORT ensures its reliability and functionality when required.

You can learn more about Amagi CLOUDPORT on AWS Marketplace.

.
Amagi-APN-Blog-Connect-2024
.


Amagi – AWS Partner Spotlight

Amagi is an AWS Specialization Partner that offers cloud broadcast and targeted ad solutions to broadcast TV and streaming TV platforms.

Contact Amagi | Partner Overview | AWS Marketplace