Networking & Content Delivery

Using multiple content delivery networks for video streaming – part 2

If you are reading part two of this two-part blog series, it probably means that you operate a video streaming service for millions of viewers, with high sensitivity to performance, and you are considering multiple CDNs for your video delivery. In this part, I will guide you through important questions to consider when deploying a system to route traffic across multiple CDNs. More specifically, I will address some questions like how to score the performance of a CDN based on measurements, what criteria to consider for routing, and finally how to do the routing itself. To set the right expectations, you will not read about a magical solution that fits all business cases, but rather you will learn more about the topic, to help you buy the right commercial solution or build your own system for multi-CDN delivery.

Calculating performance score

In part 1, I discussed how to measure the performance of a CDN. With measurement data available to you now, how do you combine different metrics to score CDNs and rank them?

You can calculate CDN performance scores using mathematical formulas based on commonly accepted metrics like buffering, play failures, bitrates, and startup times. However, the optimal formula is unique according to the specifics of your business. For example, if you deliver short video clips with ads, you would give more weight to startup time in your formula. On the other hand, for long-form video on demand (VoD) like films, you would give more weight to buffering and bitrates. As an example of scoring viewer experience, this article explains the formula used by Mux, a video technology company that provides client-side quality of experience (QoE) analytics as shown in the following screenshot. Regardless of whether you buy commercial solutions like Mux, or build your own system, make sure that you understand well the formula that is used, and then optimize it for your workload through multiple iterations and feedback loops based on the engagement of your audience.

Another case study is Amazon Prime Video who operates a video on demand service (transactional and subscription based) with a very large catalog where 12% of titles account for 90% of playbacks. For calculating performance scores of CDNs, they mostly index on the percentage of sessions without rebuffering, the percentage of sessions without fatal errors, bitrates, response times, and time to first frame. Watch this talk from re:Invent 2018, where Amazon Prime Video explains how they deliver their video using multiple CDNs, including CloudFront. You can also hear how CloudFront engineers work backwards with Amazon Prime Video to optimize performance and gain more traffic share.

Once you decide on the relevant scoring formula for your business, you can apply it to your measurement dataset, by segments, to surface dimensions that affect the performance of video streaming. The most important dimension is internet service provider (ISP) which is represented by an autonomous system number (ASN). ISPs will typically have different performance characteristics because of their diverse capabilities in terms of network connectivity with various CDNs. Some customers like Amazon Prime Video go beyond a single dimension and add other dimensions like viewer location and device type for more granularity in evaluating the performance of CDNs. In fact, CDNs may optimize video delivery for some streaming formats or user devices. For example, when you use CloudFront’s native smooth streaming feature to stream VoD from S3, CloudFront prefetches the next segments into the cache which improves performance for devices using this format like smart TVs. With dimensions now defined, make sure that you have enough measurements in each segment (ideally hundreds of thousands per day) for statistical relevance, otherwise, you can group segments. For example, in a country where you don’t have enough audience to produce relevant statistics per ASN, you can group data points in one country segment.

The final step is to aggregate measurements in a segment to calculate scoring. On this point, I do recommend combining multiple aggregations. First, combine short aggregation periods (a couple of minutes) with long aggregation periods (tens of minutes) to be reactive to CDN fluctuations, while at the same time keeping in mind the bigger picture of CDN performance level. Second, combine multiple aggregation functions (average, median, P90, and so on) to take into consideration variance in data points. For example, in a single ASN, you might have broadband and mobile users which sit in different places in the statistical histogram.

Allocating traffic to CDNs

At this stage, you have a performance score per segment and for each CDN. As a next step, you need to apply your traffic allocation logic to produce a routing table that will be consumed by your CDN switching system. If you do not want to build these components yourself, some commercial solutions like Cedexis or NS1 with Mux, SmartSwitch from Nice People At Work, and Conviva Precision provide them, but you can still go through the next two paragraphs to better evaluate the capabilities of your selected provider.

When using ASN as a single dimension to breakdown your measurements dataset, the routing table would look like this:

Input = ASN Output = Allocation amongst CDNs
Orange ISP in France (AS3215) 40% CDN1, 20% CDN2, 40% CDN3
Telefonica ISP in Spain (AS3352) 60% CDN1, 40% CDN2, 0% CDN3
Comcast ISP in US (AS7922) 33% CDN1, 34% CDN2, 33% CDN3

Producing the routing table is a standard problem in analytics that you can solve using AWS services like Kinesis Analytics. I will not focus on this aspect, but rather on the allocation logic itself, for which you should consider the following questions:

  • How frequently should I update the routing table? In general, as real-time as your analytics pipeline can go to improve your reactivity to CDN fluctuations. I am talking here about tens of seconds.
  • When should I switch traffic from one CDN to another? Consider switching thresholds to compensate for statistical errors. For example, only switch traffic from CDN1 to CDN2 if the score is at least X% better. Additionally, make sure that you keep a minimal level of traffic on each CDN to keep its cache warm.
  • How incrementally should I switch traffic from a CDN to another? Consider switching a percentage of traffic that is proportional to the difference of CDN scores, but is capped to avoid undesired effects like saturating the target CDN, or creating traffic fluctuations between CDNs.

There are other factors you should consider as well. First, think about optimizing cost when two CDNs have similar performance. You have the obvious cost of delivered gigabyte (GB) by CDN (make sure you understand whether a CDN uses base 1000 or base 1024 to calculate GB), but also costs incurred by your origin. Another factor to consider is the content popularity, to decide how many CDNs you want to use for that specific content. Typically, for long tail content with low popularity, it is better to use a smaller number of CDNs to improve cache hit ratio.

Regardless of your logic, your allocation system should allow static overriding of allocations for reasons such as:

  • Taking into account the agreed capacity of a CDN in a specific region.
  • Considering the consumption of the contractualized traffic commitment with a CDN.
  • Executing a manual emergency change in situations where your system didn’t automatically catch an issue.

Finally, you can go beyond reacting to CDN fluctuations and start predicting them. For example, when you notice a recurring pattern with a specific CDN, like performance degradation during evening peak hours because of network congestion, you can program a regular anticipated traffic allocation. Or even better, use machine learning to automate it.

Switching traffic among CDNs

Now that you have built your routing table, you can start switching traffic among CDNs. For this purpose there are two common approaches. The first one uses DNS with short TTLs to steer traffic. It’s a fairly simple, fast, and scalable solution that companies usually start with. However, DNS based switching has some drawbacks:

  • Some ISPs or devices do not honor DNS TTLs which skews your traffic allocation.
  • It is only compatible with dimensions that can be inferred from an IP address, like ISP and country. Sophisticated dimensions like content type or device type are not possible.
  • It is based on the DNS resolver’s IP address to make decisions, which makes routing less precise in some cases like with divergent resolvers that serve a wide set of users across many networks or geographies. This is a well-known challenge for DNS based CDNs.
  • It doesn’t work with features like token protection solutions because each CDN has a different implementation.

The second approach is HTTP based switching. With this method, CDN switching decision are made when the video player requests the playback URL from your application. With HTTP based switching, a streaming session will stick to one CDN and benefit from reusing the same TCP/TLS connection, which makes it easier for some features like prefetching and token protection. However, you can still failover to the second best CDN if an issue occurs with the primary CDN allocated to a streaming session, by sending to the player a list of playback URLs using different CDNs, and ordered by their scores. You can implement failover logic on the client side by leveraging native failover mechanisms in streaming formats, such as multiple base URLs with MPEG-DASH, and redundant variant streams with HLS. For more sophisticated failover, you can implement the logic in the player code. Finally, to improve response times of returning playback URLs, you can use a CDN in front of your switching endpoint, and if your audience is spread across the world, you can replicate your switching endpoint to multiple geographies to route a user request to the nearest endpoint. A simple implementation using AWS services is illustrated in the following diagram.

Using DynamoDB global tables, you can store the CDN allocation data and replicate it to multiple AWS regions. CloudFront terminates the connection and accelerates this dynamic content over the AWS global network. Finally, Lambda@Edge is used to query the nearest DynamoDB endpoint about the best CDN to use based on the viewer request characteristics, and return the playback URL to the player.

Conclusion

In this two part blog series, I started by explaining why and when is it relevant to adopt a multi-CDN strategy. I also used this opportunity to talk about the benefits of CloudFront, AWS’s content delivery network, for media companies. Then I explained some best practices for implementing multi-CDN delivery through its main components:

  • Performance measurement
  • CDN scoring
  • Traffic allocation
  • Traffic switching

I know that there is much more knowledge to be explored on this topic. I would be very happy to hear your thoughts and your learnings, so please get in touch with me!