AWS for M&E Blog
Back to basics: HTTP video streaming
Consumption of video content has transformed dramatically over the last two decades. Just 10 – 15 years ago, viewing of production content was inflexible and limited to a TV set connected to a satellite or terrestrial antenna or to a cable connection, generally through a set-top box (STB) receiver. In recent years, however, the ability to watch live content on any device has become the norm.
In this blog, we touch on a basic video delivery mechanism: HTTP streaming. HTTP video streaming is an underlying technology used to bring content to internet-connected devices.
Live video delivery pipeline
To start, let’s look at typical video delivery pipeline for a standard TV channel:
TV channel production is quite complex even if we only focus on what is commonly referred to as ‘delivery’ – transportation of video from source to its destination. Delivery in this context is divided into two parts: contribution and distribution.
Contribution covers all video delivery for production and pre/post production stages. This type of delivery is more sensitive to latency and is needed to keep video as pristine as possible for further processing.
Distribution covers deliveries to external consumers. In turn, distribution is usually split into primary distribution – business to business (B2B), and secondary distribution – delivery to end customers/viewers or direct-to-consumer (D2C). Distribution is more demanding to traffic optimization and scaling as usually delivery is one to many, making bitrate aspect critical.
Managed and unmanaged networks
A managed network is a type of communication network that is built, operated, and secured by a service provider for contribution. The primary advantage of a managed network is the ability to manage and configure necessary quality of service (QoS) to allow reliable transmission of video from the source to destination. In most situations, this type of delivery is one to one, or one to few. The exception is IPTV, where a managed network is used for D2C delivery with UDP/RTP protocols (usually multicast).
An unmanaged network does not offer guaranteed QoS as it lacks an owner. The most common example of an unmanaged network is worldwide internet delivery. A lack of support for QoS features means that the internet is not suitable for traditional live video delivery. However, transmission protocols with built in QoS features like SRT, Zixi, RIST and RTMP can be used. These protocols have become quite popular for unmanaged networks or for delivery over the internet. Using these protocols helps to resend data at the transmitter level if there’s been any packet loss during transportation. We mention these protocols to provide a more complete picture of current contribution and primary distribution models, but they are not used for D2C distribution due to scalability limitations.
Another mechanism for live video delivery is broadcast, which is used for satellite (DTH), terrestrial (DTV) and cable TV (CATV). TV operators that employ this type of delivery use a specific radio frequency range dedicated for TV signal distribution and transmit modulated signals over the air or over cable as radio waves. Using this type of delivery, where a signal is always available, a receiver (STB) needs to connect to the right frequency to demodulate and then decode the video signal. Before internet video delivery, it was the most common way to deliver video.
The paradigm has shifted to unmanaged networks in the secondary distribution space over the last ten years. This is because managed networks limit reach to customers directly connected to your infrastructure. In contrast, secondary distribution via unmanaged networks allows you to theoretically to reach the entire world.
Over-the-top and adaptive bitrate streaming
Over-the-top (OTT) in the D2C space refers to video distribution over the internet, bypassing traditional linear or pay TV services to bring content directly to consumers on their internet-connected devices—their phones, smart TVs, tablets, computers, set-top boxes, gaming consoles, and digital media players.
This is a shift away from fixed lines, and places dedicated to watching video, and towards consuming video anywhere internet access is available.
Less dependence on managed networks and expensive broadcast systems simplifies and makes the video distribution infrastructure less expensive and more aligned with readily available internet delivery networks. Delivery network capacity grows at an incredibly fast pace as the internet becomes more important in our daily lives.
One challenge of video delivery over the internet is dealing with varying and unstable network speeds. Additionally, the connection of a viewer may vary from 3G to 5G on a phone, or from 10Mbit/s xDSL to a 1Gbit/s on fiber to the home. One of the most commonly used technologies that ensures a smooth end-user experience across these varying connection types is adaptive bitrate (ABR) streaming.
ABR streaming stores video in small files at different quality levels that are sent via HTTP to a media player. The original source is encoded into multiple versions/profiles at different bitrates and resolutions to adapt to end-user network conditions and screen size. Different versions are referred to in a file called ‘manifest’ or ‘playlist’. The manifest contains a list of all the available versions of a particular stream.
Taking a close look, we see that compressed streams are not single units, but are segments, each usually several seconds long. Segmenting is the process of breaking video into small parts for network transport and playback. This gives the consuming device the ability to switch between bitrates at segmentation points. The viewing device requests a copy of the manifest and updates every few seconds. When a device experiences a weak connection or plays the video in a windowed mode at small size, it requests a smaller bitrate segment. When a connection is stronger, the device can request a higher bitrate segment if screen size requires it, and so on. There are different metrics used at the player level to decide which quality to display (based on CPU utilization, network bandwidth, memory available, etc.).
After an encoder/transcoder produces an ABR stack, the next step is packaging. Packaging prepares multiple bitrate streams for viewing over HTTP(S) by segmenting video streams into files and creating the appropriate manifest files depending on the streaming delivery format used. HTTP(S) is used for transport because it is one of the most reliable and suitable protocols for getting a file from a source to a recipient over the internet. HTTP(S) delivery is also well supported by content delivery networks (CDNs), making it one of the least expensive and most scalable ways to deliver content.
Following is a typical secondary distribution delivery chain, which omits monetization for simplicity.
Depending on the specific software or architecture, some components may perform multiple functions. For example, some ABR encoders also perform packaging, and some packaging systems are also origin servers.
Mainstream OTT streaming protocols
Broadly speaking, all ABR formats concern themselves with packaging and transport. A payload consisting of the encoded video, audio and subtitles is roughly the same for all formats, but the details of packaging and transport are slightly different. The four most commonly used types of adaptive bitrate formats and standards are as follows.
- HTTP Live Streaming (HLS) – developed by Apple Inc. and released in 2009 is widely used for internet streaming, including iOS and Android devices.
- Dynamic Adaptive Streaming over HTTP (MPEG-DASH) – developed under the Motion Picture Experts Group (MPEG) and published as a standard in April 2012.
- Common Media Application Format (CMAF) – a new format developed by MPEG in 2016-2018 to simplify internet streaming by combining aspects of HLS and MPEG-DASH. CMAF’s main driver is the optimization of HTTP(S) streaming infrastructure and reduced storage by using a common video file format for both HLS and DASH delivery.
- Microsoft Smooth Streaming (MSS) – developed by Microsoft in the mid-2000s and first included with Internet Information Services (IIS) 7.0. It is still used in legacy workflows.
To be complete, we should also mention HDS from Adobe, although it is no longer in use. Adobe announced end of life for Adobe Flash Player in December 31, 2020, which was a requirement for playing this type of ABR.
HTTP live streaming (HLS)
Support for the HLS protocol is widespread in media players, web browsers, mobile devices, and streaming media servers. It is the most popular streaming format.
HLS uses a two layer playlist structure, combined with media segments. The very first file a player needs to start a stream is called the multi-variant playlist. HLS playlists can be identified by the file suffix “.m3u8”. The multi-variant playlist contains references to all the variant playlists files available to the player, and additional information about what those variant playlists contain. In most cases, there are different variants to cater to various network speeds or available bandwidth (high, medium, and low, with potentially intermediate steps). There may also be additional audio and/or caption/subtitle streams available. An example of a multi-variant playlist file follows:
Media segments can be in transport streams or mp4 containers. The previous example uses transport stream (.ts). The consuming device or player uses the referenced files, requesting specific segments, while determining which version of the segments (higher or lower bitrates) can be supported with detected network speed. Another convenient aspect of HLS playlists is they remain reasonably human readable. You can establish things like how many variants there are and certain attributes about those variants, such as bitrate, resolution, codec, audio language, etc.
Playlists may also contain other types of information, such as locations for advertisements, captions, and language track options.
Note that that the HLS specification is also sometimes referred to as the Pantos specification. This is a reference to Roger Pantos – an Apple software engineer who was one of the original co-contributors to the HLS specification.
Apple has created a web site for HLS developers with guidance for HLS protocol implementation. You can find a lot of useful information there.
Dynamic adaptive streaming over HTTP (MPEG-DASH)
According to the DASH specification, the origin serves two types of files: MPD (Media Presentation Description) files and segments. MPD files are the manifests that reference the available content and informs the player about available variants, the URLs where they can be found, and other characteristics. The segments contain multimedia bitstream chunks. MPEG-DASH is format-agnostic but most players and packagers use only the ISO BMFF profile. So, you may see single, or multiple MP4 containers that consist of several media components (such as audio, video, and captions).
To start playing, a DASH client/player first fetches the MPD file. It parses the MPD, from which it will learn about the timing, the media content, the media types, resolutions, minimum and maximum bandwidths, and the existence of various encoded alternatives of multimedia components, the location of each media component on the network, and other characteristic of the content. Once this has been established, the player selects the most suitable version and starts streaming the content by fetching segments.
The player is tasked with keeping sufficient buffer to prevent playback stalls (buffering). As with HLS, the client continues fetching segments while monitoring bandwidth fluctuations. Based on those measurements, it can decide to step up or step down bitrate versions if they are available.
Let’s look a bit more deeply at a DASH manifest file. It may consist of one or multiple periods, where a period is an interval of the program within the channel and usually used to segment the stream for advertisement placements. Each period has a starting time and duration and consists of one or multiple adaptation sets.
An adaptation set provides the information about one or multiple media components and its various encoded alternatives. For instance, an adaptation set may contain the different bitrates of the video component of the same multimedia content. Another adaptation set may contain the different bitrates of the audio component (e.g. lower quality stereo and higher quality surround sound) of the same multimedia content. Each adaptation set usually includes multiple representations/profiles.
Each representation consists of one or multiple segments. Segments are the media stream chunks in temporal sequence. Each segment has a URL, i.e. an addressable location on a server that can be downloaded using HTTP(S) GET or HTTP(S) GET with byte ranges. Segment numbering happens based on SegmentTemplate. There are two options available to define it in MPD: “Number” or “Time”.
The following graphic describes how both methods work.
Another element that is usually part of DASH streaming is the initialization segment. It contains information required to initialize the video decoder. The initialization segment is optional (refer to ISO/IEC 23009-1).
The DASH-Industry Forum (DASH-IF) helps the adoption of MPEG-DASH by providing guidelines for packager vendors and players to improve compatibility. Similar to the Apple HLS developers’ website, there is website with guidance for the DASH protocol implementation, which contains useful information you can find here.
Common Media Application Format (CMAF)
The Common Media Application Format (CMAF) standard has quickly gained adoption across media technologies and playback devices. It was proposed by MPEG in 2016. Very much simplified, CMAF can be described as a hybrid solution of DASH and HLS. By enabling a single file format to store, process, and deliver video streams to both HLS and DASH clients using fMP4 segments, CMAF allows to increase efficiency and limit complexity by reducing processing, storage and delivery cost for live and on-demand video workflows.
Microsoft Smooth Streaming (MSS)
MSS is the HTTP-based protocol developed by Microsoft. Similar to the protocols already discussed, it allows a client player to adapt to changing conditions by dynamically optimizing content quality throughout the course of a streaming session. On the server side, content is encoded at multiple bitrates, one MP4 file per bitrate.
The Smooth Streaming packager generates three file types as output formats:
- ISM — A manifest file that contains links to each rendition along with additional metadata.
- ISMC — A client file that contains information about each rendition and each segment within each rendition.
- ISMV/ISMA — One or more movie (PIFF) files (sometimes known as fragmented MP4).
From the player’s perspective, playback starts by fetching the manifest. This is followed by requests for segments and for refreshed versions of the manifest. Requested segments can contain only audio, only video, or both muxed (merged) together.
Conclusion
With a better understanding of HTTP video streaming and various protocols, consider building your own video workflows with Amazon Web Services (AWS). Purpose-built AWS Media Services like AWS Elemental MediaPackage, AWS Elemental MediaLive and AWS Elemental MediaConvert are service that allow you to construct and enable video streaming.
Lastly, please, check out this free one-hour foundational video that breaks down important video technology concepts and the processes involved in getting content from its source to viewers’ screen.