Introduction to top-down monitoring of live video workflows
Introduction to Top-Down Monitoring of Live Video Workflows with Amazon CloudWatch and Media Services Application Mapper
Live video workflows are a collection of software and hardware resources that process and deliver live streams to consumers. Traditional live video workflows include encoding appliances, networking hardware, and other forms of on-premises and appliance equipment to package and originate the streams. As content providers shift away from managing data centers and focus on more on content, more live video workflow processing is migrating to the cloud to take advantage of its benefits. As a result, cloud-native capabilities such as serverless code execution, transmission redundancy, and physical durability of data are available to video workflows as a by-product of the environment.
For many decades we have become accustomed to uninterrupted live television viewing. Awards shows like the Oscars or sporting events like World Cup are broadcast to millions of viewers worldwide. Interruptions and black-outs during events like these are unacceptable to broadcasters and most importantly, to viewers. The industry has responded by providing the required levels of resilience, by deploying a number of redundancies in their infrastructure to process and deliver live video. For example, multiple data connections are routed along different paths to their destination and duplicate video processing equipment is deployed at separate locations to contend with the possibility of an outage at any one location. Even with redundant systems deployed, monitoring of the infrastructure across providers and participants of the workflow is critical to ensure any outage or impact to redundant systems is resolved quickly.
As redundancy as a way to provide resilience becomes pervasive throughout the entire video delivery workflow for on-premises and cloud deployments, a new problem related to monitoring becomes apparent – with a well-designed cloud workflow, you can have invisible problems hidden by redundant systems. With live video workflows designed to continue providing live output even with failures within the workflow, it is no longer possible to accept that if playback of the live event is happening then everything upstream is working fine.
Amazon CloudWatch is a fundamental building block for monitoring video workflows in the cloud and on-premises. CloudWatch is the monitoring service for AWS cloud services, including AWS Media Services such as AWS Elemental MediaLive, AWS Elemental MediaPackage, and AWS Elemental MediaConnect. CloudWatch is responsible for collecting and filtering log data from executing code. It can aggregate metrics for analysis and display through dashboards. Most importantly, CloudWatch acts as a notification source for events and alarms generated by metrics or emitted directly from other AWS services or custom code.
The next level of visualization and monitoring is Media Services Application Mapper (MSAM). MSAM is an open source tool that was released to GitHub and AWS Answers in the second half of 2018. The primary use of MSAM is to provide a simplified, operational view of AWS Media Services workflows. Media Services Application Mapper automatically discovers configured cloud services and the logical data connections among them. The tool currently works with AWS Media Services and supporting services, like Amazon S3 and Amazon CloudFront. The result is a high-level graphical view of the configured services that make up live video workflows. This provides a full picture for all building blocks of your workflows, so you can monitor encoding, packaging, origination, and distribution in a holistic way to spot and react to issues fast.
The following image shows an MSAM-generated diagram of a single streaming channel that packages video for HTTP Live Streaming and delivers the stream to the viewer through CloudFront. MSAM automatically discovers the logical connections among services, regardless if they are represented as a numeric ID, ARN, URL, or any other kind of identifier.
MSAM receives pipeline events from MediaLive and will display those by changing the visual representation of the affected MediaLive node on the diagram. The user can also connect a combination of CloudWatch alarms to nodes on the diagram. Any alarm for a service within any region can be associated to any number of nodes on the diagram. When the alarm is triggered, MSAM will receive that alarm, store it, and display it on the nodes with which the operator associated it. MSAM works with pre-configured metrics and alarms from existing services, and also custom metrics and alarms defined by an operator.
The following image is a slightly more complex workflow. This is a hybrid workflow with resources running in an on-premises data center and in the cloud. MSAM is showing a data throughput alarm affecting multiple services. A display like this, plus seeing the specifics of the alarm, can quickly direct the operator to the area of the workflow experiencing a problem. In this example, the result of the error is triggering alarms across four nodes, but as video/data is flowing from left to right, it is easy to identify that the start of the error is likely the lower of the two firewall nodes.
Even this level of detail can be too much when an operations team is tasked with monitoring hundreds or even thousands of streaming channels. The above workflow is created from a combination of seven on-premises and cloud resources. If each channel consists of at least five cloud services, then 500 streaming channels means 2,500 or more nodes and connections to display graphically. To help with this level of scale, MSAM also supports a high-level tile view. The concept behind a tile is that it aggregates all the services, events, and alarms into a single visual node on the screen. When an alarm assigned to a node aggregated by the tile is triggered, the tile indicates the alarm and provides a navigation path into a detailed view to help investigate the problem.
The above image shows the MSAM tile view with the top-left tile in an alarm state. Tiles are sorted alphabetically by default. In the case of alarms or events, those tiles are pushed to the top-left of the display and sorted by total count of alarms and events to indicate urgency.
Media Services Application Mapper is designed to help operators understand connectivity among AWS Media Services at a large scale, and to help monitor and diagnose problems with video workflows quickly. MSAM is currently available from GitHub and AWS Answers. The tool is being actively developed and updated regularly by AWS Elemental with the help of third-party contributions to the open source project. See the links below to learn more, get involved, and start using MSAM today.