AWS for M&E Blog
Troubleshooting SRT and Zixi with AWS Elemental MediaConnect
Sending video long distances can be challenging without a clear understanding of how your packets get from point A to point B. It also requires robust networking, sufficient bandwidth, and efficient encoding and delivery hardware. We will outline the best practices and troubleshooting techniques required to operate reliable transport using secure reliable transport (SRT) and Zixi with Amazon Web Services (AWS) Elemental MediaConnect.
Contribution best practices
High availability contribution design requires redundancy. Two encoders output locked and egressing out different network circuits, carried over different providers, will provide both encoder and network diversity. These streams should be going to separate MediaConnect flows in separate Availability Zones (AZ). This basic architecture principle can achieve five 9’s of availability.
Diagram 1 shows a typical resilient and redundant streaming media workflow consisting of two encoders, two separate network providers (Direct Connect), two MediaConnect receivers, and a standard channel AWS Elemental MediaLive channel that runs in two AZs.
It’s recommended to use AWS Direct Connect when possible. AWS Direct Connect provides a consistent network experience, with lower latencies and higher throughput rates compared to internet-based connections. If you have access to Direct Connect links you must enable VPC endpoints on your MediaConnect flows to confirm that no traffic egresses through Internet Gateways (IGW).
If you must send video through the public internet, just know your traffic is routed through multiple networks and service providers, increasing the latency and potential for packet loss. While Direct Connect and virtual private clouds (VPCs) are more reliable than the public internet, your feeds are not totally immune to issues in this configuration.
How the Zixi and SRT protocols work
Zixi and SRT have subtle differences in their technologies, which we will not get into here; however, they share many of the same approaches for delivery. SRT and Zixi are both built on user datagram protocol (UDP) and are both purpose-built for reliability and low-latency transmission over public networks. Both protocols also use automatic repeat requests (ARQ), which retransmits dropped packets. Also, both protocols allow you to set a latency value. This value adds delay in your stream, but also provides more time for dropped packets to be retransmitted and recovered. The larger this value is, the more delayed your video stream is, but also provides better recovery from dropped packets.
Diagram 2 shows how packets are sent and received by the receiver and placed in a buffer for output. Packets lost in transit are “dropped packets,” which then the receiver sends a re-request for the packet. Packets re-transmitted and received in time are placed into the stream as a “recovered” packet. Packets that are not sent back in time, or lost again in transmission are “non-recovered” packets.
Provider sender recommendations
Latency settings should not be lower than 3x the peak round trip time (RTT) measured in MediaConnect. If the latencies settings are too low, this may result in packets dropped. In Diagram 2, packet number seven is lost due to the network delay and low receive buffer settings. With a larger buffer latency, more time is provided to receive and place packets into the delivery output.
Latency settings should match the receiver application to achieve predictable results. If a provider sends an SRT stream with a different latency value than the configured MediaConnect minimum latency, the higher of the two values will be used. For example, if the provider sends a stream with 500 ms latency, but the MediaConnect minimum latency is set to 6000 ms, the effective latency will be 6000 ms. Senders that expect acknowledgement responses (“ack”) within their configured 500 ms latency, can timeout or pause sending packets. This behavior may vary by application.
The recommended send buffer for SRT is 25,000 packets. If a sender has no buffer, packets are sent as soon as provided by the encoder, which can lead to inconsistent downstream timing. Increasing the buffer size may help address packet timing issues. The default SRT buffer size in some sender applications is 8,192 packets, which may not be adequate. If you see continuity or timing issues, increase this value to 25,000 packets, or more if needed.
SRT applications typically contain additional settings that can be changed. You can read more about available settings and what they do. For most streams, default settings should be acceptable.
MediaConnect receiver recommendations
Do not set a primary source for failover unless the secondary is known to be lower in quality or reliability. If the secondary is known to be less reliable (for example, the primary is on Direct Connect and the secondary is through the internet) then you should set a primary. Otherwise, do not set a primary source.
Unless the source is properly configured for Merge, when a source fails over, the viewer will see content jump forward or backwards depending on the relative timing of each source. If no primary is selected, this jump will only happen once and stay on the second source. If a primary is selected in the configuration, the viewer may see two jumps, once during the initial failover, and again when the source returns back to the primary source as soon as the connection is re-established.
Set the Max Bitrate to 2x the expected bitrate. The Max Bitrate needs an acceptable overhead established for a working buffer, for instances when packets arrive at the same time. By default, MediaConnect sets the Max Bitrate to 160 mb. For senders that send in excess of 80 mb the Max Bitrate setting needs to be set properly.
CIDR rules should not be set to 0.0.0.0/0. In addition to the security implications, restricting CIDR rules to specific IPs (x.x.x.x/32) prevents accidental “double publishing” from misconfigured sources. Double publishing is when you send more than one stream to the same IP and port combination. This can lead to the wrong source showing in the output and incorrect metrics due to parsing multiple sets of incoming packets.
Monitoring and troubleshooting
If your stream is experiencing issues, there are several things you can do to determine the root cause. MediaConnect offers Amazon CloudWatch metrics that are used to measure the reliability of the stream. MediaConnect “flow metrics” only show the health of the active source. If you are using multiple sources for failover, MediaConnect also provides “source metrics” for individual source inputs. In addition to MediaConnect metrics, it is advisable to investigate the sending device for any relevant log entries or metrics during any interruption to your video stream.
Next are six important metrics for monitoring the health of your flow. These will help you determine if an issue originates from the source and how to troubleshoot them.
#1 – Disconnects: Source disconnects are typically caused by disruptions from the sender application or network path between the encoder and MediaConnect flow. For example, if an encoder is sending content to a MediaConnect flow as an SRT Caller that is rebooted, then a Disconnect metric with a value of one will be seen in the CloudWatch flow metrics.
If the network sending content is unstable or congested, leading to dropped packets, then the SRT or Zixi protocol will disconnect and a Disconnect will be registered in the flow metrics. This network level disruption is common in cases where the public internet is used to send the encoded video to MediaConnect.
When an SRT or Zixi protocol disconnection occurs, the MediaConnect flow automatically attempts to re-establish the connection. The Connection Attempts metric shows how many of these reconnection attempts were made by the flow. When coupled with encoder logs, it can help provide additional insight into the cause of a disconnection. For example, a disconnect is logged on the flow, followed by two connection attempts but only one is received by the far-end encoder. It can be reasonably assumed that a network disruption prevented one of the connection attempts from being seen by the encoder.
Other metrics such as Continuity Counter Errors, Jitter and Round-Trip Time can be used to understand if network connectivity is unstable enough to have contributed to these disconnections. MediaConnect only has visibility from when the source reaches the flow. If there are no logged issues at the sender, then engage with your ISP to determine if there are any network hops that may be faulty and introducing issues.
Momentary disconnects will occur during scheduled maintenance. MediaConnect provides metrics to show when maintenance is scheduled and completes. Upcoming maintenances are listed on your AWS Health dashboard and notifications are sent to the email listed on the account.
#2 – Round-trip time (RTT): This value represents the time it takes for a packet to be sent by an encoder and then acknowledged by the flow. You should verify that the MediaConnect latency settings is greater than three times the RTT. For example, if your flow’s Max Latency value is 2000 ms, then you should confirm that the measured RTT metric is no more than 650 ms. If this metric is above, or spikes above, this recommended threshold the stream is at risk of disconnecting due to timeouts or buffer overflows.
The MediaConnect flow will generate an alert and be displayed in the AWS Management Console when the configured latency value is too low for the detected RTT value. The following alert will be generated in these cases:
Stream Error: Latency may be too low for RoundTripTime. There is an increased risk of NotRecoveredPackets. Please investigate the flow source.
Packet round trip time can be influenced by different factors. These factors include: the distance travelled by the packet across the network between encoder and flow, the network path health, and the number of network hops (for example, routers) in the path. You can help verify a low RTT by placing your encoder and flow as close to each other as possible (geographically) so the logical network path taken by the video traffic is as short as possible. For example, if your encoder is located in Seattle (USA) you should confirm that the MediaConnect flow being used is located in the us-west-2 AWS region. Another way to reduce RTT is to use a managed network path like AWS Direct Connect.
#3 – Not recovered packets: This metric indicates that the protocol was unable to recover packets that were lost in transit. Packet loss in general is expected in any network and especially over the public internet. SRT and Zixi protocols are designed with packet re-request mechanisms to recover packets that were dropped due to poor network conditions (see Diagram 2).
Ordinarily, minor levels of dropped packets do not have a detrimental impact to your video stream, however in cases of extreme network loss or congestion, the SRT or Zixi protocols may not be able to recover all dropped packets. The Consecutive Not Recovered metric shows how many unrecovered packets were recorded by the flow in a row and can be used to determine the scope of impact.
If you see the Not Recovered Packets, first check if the Min/Max Latency setting is correct for the measured RTT your sources have. If these are set correctly, then confirm your source encoder is not producing “behind real-time” alerts/log entries or has any output connection errors. If the encoder is healthy, then work with your ISP to validate that the routing is healthy between your sender and MediaConnect receiver. Running a “traceroute“ from your sender to the EMX destination IPs can help you see if there are delays at certain network hops.
#4 – Jitter: This metric represents the inter arrival time of packets at the flow. Jitter is not just network related and can be introduced in many sections of the workflow: encoder, sender application, repeater sender/receivers, or network. In general, this value should be no more than 40 ms, and ideally less than 10 ms.
When your sources regularly spike or are consistently high, this causes packet bursts at the receiving device, which can result in packet loss and continuity errors. Jitter above 500 ms will cause disconnects, and/or result in discontinuities and frame loss for downstream encoders such as AWS Elemental MediaLive.
It is recommended that the Maximum Bitrate and Latency values configured on the MediaConnect flow are large enough to accommodate those extra bursts of packets caused by jitter. Setting the Max Bitrate value to double the configured encoder’s bitrate value will facilitate that there is enough overhead buffer on the flow to accommodate the spike in packets arriving at the flow caused by jitter. For SRT, verify the latency value is high enough to prevent buffer overflows, as this value represents the buffer available.
If all configurations are correct and you still see high jitter, there may be packet or frame timing issues upstream. In some cases, packet timing issues originates from the sender’s encoder. Check the sender encoder logs for dropped frames or delayed frames. Make sure your sender has the right hardware requirements and configuration to output frames in real time.
Packet captures can be helpful to isolate the source of the jitter. A packet capture (“pcap”) logs all packets on the network interface of the encoder and can help visualize if jitter exists at the encoder level. If Jitter cannot be detected at the encoder and is being reported by the flow, then the network path is likely introducing this jitter.
#5 – Continuity counter (CC) errors: Video packets within the transport stream are expected to be received in the correct order by the MediaConnect flow. CC errors are logged when a packet arrives out of the expected order or does not arrive at all. CC errors occur with high NotRecoveredPacket loss, or when the transport stream contains errors during encoding or delivery from upstream applications. CC errors are classified as a TR 101 290 Priority 1 metric, which signifies a critical stream disruption and may manifest as pixilation in decoded video.
Having a sufficiently large buffer (Latency) for the SRT and Zixi protocols to recover these packets is critical. However, in the case of extreme network conditions even these protocols cannot compensate for these disruptions. Having a stable, un-contended/congested network link is vital for mitigating these errors.
As prescribed previously, start at the encoder and look for errors that may indicate there are output errors, dropped frames, or behind real-time alerts. If there are no issues, work with your ISP to determine the routes are not introducing delays. Additionally, you can try different protocols. If you are using SRT, and can send Zixi, then try using Zixi to see if there is improvement. Zixi and SRT are similar protocols, however, SRT has more customization and configuration complexity.
#6 – Failover switches: If there are two sources configured on the flow, when no content is received by the flow on a source for more than 500 ms a failover switch will occur. Depending on the frequency of these switches, this may lead to discontinuities and continuity errors on downstream encoders or receivers.
Normally, failover switches correlate to a source disconnect and you can confirm this by looking at the CloudWatch metrics for the flow with a resolution as low as possible (one second recommended). However, sometimes a failover switch can occur without a correlated disconnect. The network path may have been disrupted, but not long enough to cause an SRT or Zixi protocol level disconnect.
In these cases, the Source Bitrate metric or Source Total Packets metrics can be helpful in determining if there was a momentary loss of traffic into the flow which triggered the disconnect. For constant bitrate (CBR) sources this is straightforward to verify, but may not be easily discernible for variable bitrate sources (VBR).
Finally, check if the failover is due to a jitter spike above ~500 ms. Short spikes above 500 ms will trigger a disconnect, but the CloudWatch metric resolution may not show this if it is small enough or occurs between sampling periods.
Conclusion
We’ve gone over configuring and troubleshooting your senders and AWS Elemental MediaConnect receivers. While not exhaustive, this covers the most frequent problems and their causes.
By understanding how the SRT and Zixi transport protocols work and best practices for configuration, implementing solid monitoring and alerting can help you minimize stream downtime and maximize availability.
Contact an AWS Representative to know how we can help accelerate your business.