Disaster Recovery 5G Core Network on AWS

The Communication Service Providers (CSPs) in the telecom industry are looking to find more use cases to leverage. However, a public cloud and 5G core network deployments on AWS are gaining more attention by identifying practical use cases, such as private networks for enterprises, as well as brand new 5G network creation. As emphasized in the 5G network evolution on the AWS white paper, the AWS global cloud infrastructure of AWS Regions, AWS Availability Zones (AZs), AWS Local Zones, and AWS Outposts can provide an effective and elastic environment to host 5G core networks per the characteristics of network function (NF). For example, user plane function (UPF) can be on the AWS Local Zone or AWS Outposts for low latency processing.

Among the various use cases for AWS to host 5G NFs, one of the strong cases that appeals to CSPs, who have already built a 5G core network, would be disaster recovery (DR) or enhanced disaster-resilient network creation using AWS. This 5G DR network intends to provide scalable and immediate measures against a 5G NF failure, complete data center outage, or the event of a maintenance window. More specifically, since this DR network is an additional network environment that is only required in correspondence to any unexpected failure or outage by disaster or planned maintenance, the design of the network must minimize the cost of resources by having fast scaling-in and scaling-out. Compared to the case of building such a redundant network in a traditional telco data center, AWS can help CSPs minimize costs and energy consumption during normal operation. This can be done while also allowing them to react promptly to the demand of network changes, such as a burst traffic surge and maintenance event.

This post outlines how AWS can be leveraged as another virtual data center environment for the 5G network to achieve “disaster-resiliency” and “disaster-recovery” objectives. It focuses on utilizing the 3GPP high-availability concept(s) in AWS and related AWS services, such as autoscaling, automation tools, and cost optimization aspects of the network. Specifically, the Amazon Elastic Compute Cloud (Amazon EC2) Auto Scaling capability, along with horizontal pod autoscaling, and cluster autoscaling features of Amazon Elastic Kubernetes Service (Amazon EKS) can minimize the footprint of Container-based Network Function (CNF) in the VPC for DR. Then, it can provide a fast-scaling-out during the traffic surge event to cope with sudden traffic surges and spikes.

In addition, to maximize costs and energy savings, while NFs running on AWS serve swing-over traffic (the migrated traffic to AWS Cloud that was previously destined to the original on-premises site), AWS Graviton instances can be considered to host 5G core NFs. This post explains the DR model and strategies for general applications on AWS. In addition, it explains how it would apply to a 5G network case first, and then presents key ideas such as how 3GPP architecture can be leveraged to help this DR objective, and how AWS services such as EC2 autoscaling, Cluster Autoscaling, and other functions can help implementation by introducing some open-source examples.

DR model for 5G core network in AWS

As discussed in DR posts and white papers about DR in the cloud, there are two objectives in DR: Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO means the acceptable delay between the interruption of service and restoration of service, while RPO means an acceptable amount of time since the last data recovery point. In the case of general applications running on AWS, known services for DR include AWS Elastic Disaster Recovery (AWS DRS) and Amazon Route 53 Application Recovery Controller (Route 53 ARC).

However, the 5G core network applications focused on in this post have stronger requirements for networking interfaces and protocols based on the 3GPP standard. Furthermore, those services are not always applicable to all components of the core network. Therefore, even though these services can be applied to specific components or elements of NFs, this post aims to take a more holistic view. We focus on how AWS services can aid in DR implementation within the context of the 3GPP standard architecture.

In the case of 5G NFs, AMF, SMF, and UPF – like most core functional components – would matter to RTO, since these components play a key role in the fast recovery and restoration of 5G voice and data services. On the other hand, UDM would matter to both RPO and RTO because it deals with the subscriber profile and information. Each NF has a different objective focus, thus different types of DR strategies can be applied. The following figure shows the four strategies for DR, which are highlighted in the DR whitepaper. From left to right, the graphic shows how DR strategies incur different RTO and RPO. In the case of telco 5G core NFs, because applications are serving mission-critical services, RTO should be less than the representation in the following figure.

Figure 1: General non-Telco DR strategies

For example, as mentioned previously, UDM requires both RTO and RPO to be near real-time. Therefore, when you build a DR site of UDM on AWS, you might have to consider running UDM on AWS to be always active with synchronization to UDM in legacy data centers. In this case, a Hot-standby (Active-Active) strategy would be more appropriate.

Meanwhile, you can utilize one of the following strategies: Warm-standby, Pilot Light, Backup & Restore, and for the other NFs based on the use case and characteristics of NFs. Backup & Restore would be applicable to non-mission-critical and lower priority use cases where you don’t have tight RTO (< 1 hour). As long as you have pre-established Amazon Direct Connect between your data center and AWS (otherwise, Site-to-Site VPN can be alternatively used with bandwidth limitation and stability), you can leverage AWS tools such as AWS CloudFormation, AWS Cloud Development Kit (AWS CDK), AWS CodePipeline etc., to achieve this “immediate instantiation of NFs” task. Moreover, use the foundation of Infra-as-Code (IaC) benefit. For more detail on this DR strategy, refer to the post of DR Architecture on AWS, Part II. In addition, for building a continuous integration/continuous development (CI/CD) pipeline for 5G NF deployment on AWS that helps quick recovery of service, refer to the AWS white paper on CI/CD for 5G Networks on AWS.

Cold-standby could be another option to provide a cost-effective DR site for such a non-mission-critical 5G network use-case. In this strategy, since all EC2 instances are in the shut-off state but pre-created without making the DR site serve traffic during normal operation, it would be not only faster than Backup & Restore, but also more cost effective than Warm-standby. On the other hand, Warm-standby would be the most practical way to build a DR 5G network on AWS in consideration of RTO for the macro telecom voice and data service, which would be mission-critical. This strategy means making the majority of 5G NFs in the DR site on AWS to handle a minimal amount of traffic using a minimum footprint of deployment. Then, based on scaling policy set, it can grow to serve more traffic during traffic cutover. This helps achieve building your 5G network to be disaster-resilient while having service-continuity. In the case that 5G NFs are implemented on Amazon EKS, general autoscaling group-based functionality may not be enough to serve instantaneous demand to absorb a surge of the traffic. This is because the required RTO is shorter than the general response time of the Kubernetes autoscaling action. Therefore, in the latter section of this post, some effective implementation skills to achieve this faster Warm-standby under the Amazon EKS environment are introduced in detail. Last but not least, for cost-optimization, the Graviton instance gives a significant benefit in terms of price-performance and addresses the concerning topic of sustainability by providing increased energy savings.

3GPP-defined resilience mechanisms and use of AWS

3GPP has defined a concept for the 5G core network in TS23.501 that helps build network resilience, called an NF Set. In this concept, an NF can be deployed as a single instance, or multiple instances of the same NF can be combined to form an NF Set. This enables higher redundancy and scalability, as an NF instance can be replaced by an alternative NF instance within the NF Set to provide the requested service in cases of failure and handling sudden bursts of service requests. In the 5GC network there are many NFs that support the concept of the NF set, such as AMF, SMF, and UDM. However, for the purposes of the DR use case, the focus is on utilizing the AMF set, as that is the NF that would receive the N2 traffic from gNodeb, and then subsequently the AMF can invoke services from the other NFs of the DR site hosted on the Region. As illustrated in the following figure, the AMF Set configuration for the given Tracking Area Code (TAC) would help AMFs in different data centers share traffic load as well as switchover traffic during the failure recovery action across three different data centers. In the case of the AMF Set, 3GPP also defines the concept of AMF Load Balancing, which uses Weight Factor for each AMF to steer UE registration based on the capacity of AMF in each data center.

Figure 2: 3GPP NF Set deployment across multiple data centers

Although the AMF Set like NF Set concept can enable a group of NFs to share the traffic load across the data center, the Network Repository Function (NRF) can play the role of providing localization of traffic load for the rest of the NFs other than AMF. More specifically, as per the 3GPP standards, the NRF is responsible for maintaining and providing information regarding the status of NFs that provide services (Producers) to other NFs that request services (Consumers) using Nnrf_Management (nnrf-nfm) and Nnrf_Discovery (nnrf-disc) services. The NRF service Nnrf_Management supports the operation of a“NFRegister” that is used by all 5G core network NFs for registering their respective profiles with the NRF. During the registration process, NFs send a request to NRF, as part of the NFProfile, provide mandatory information such as nfInstanceID, nfType, nfStatus, FQDN, or IPv4/IPv6 address, and can optionally provide information related to the NF Set it belongs to, such as priority and locality.

Figure 3: NF Registration to NRF and NF Discovery to NRF

The NRF service Nnrf_NFDiscovery supports operation “NFDiscover” for the “Consumer” 5G NFs to retrieve information about “Producers” that offer the required services. Typically, for a production network to achieve higher resilience, CSPs deploy multiple instances of producer NFs. 3GPP provides options for consumer NFs to discover the target producer NFs. Either the consumer can request a full list of all producer NFs that match the criteria of required services, or it can reduce the scope of the returned list by providing additional information in the query parameters. One of the query parameters that can be used by the consumer NF to select the target producer NF is based on the priority information returned by the NRF for the different producer NFs. The other information that can be used by the Consumer NF is “preferred-locality”. These two key concepts of AMF Set and NRF preferred-locality can be exploited to create a virtual data center on AWS for the DR use case, especially for Pilot light, Warm-standby, or Hot-standby strategies of DR. For example, if the Weight Factor of AMFs is configured with identical value across the existing data center and virtual DR data center on AWS, then it makes the DR site work in Hot-standby mode.

Traffic shift from Telco data center to the VPC during the event

In a typical on-premises production deployment, a 5G core network is deployed with a minimum of one DR site. The number of DR sites in a network is dependent on the strategy adopted by the CSP. Furthermore, it can be the 1+1, N+1, or N+K model. Regardless of the number of DR sites, the resources dimensioned for the on-premises DR site is always at least greater than or equal to those required for taking over the traffic of active site(s) failure.

The CSP can use the mechanisms of the NF Set to distribute the group of NF instances across the active sites (or existing data centers) and DR sites (virtual data center on AWS). For the DR site on AWS, the 5G NFs that support the 3GPP construct of the NF Set can become a part of the already deployed on-premises NF Set. The 5G NFs that cannot be added as part of an NF Set can be deployed in new instances. The DR site NFs depending on the NRF deployment model, centralized or distributed, can either register with the on-premises centralized NRF or register with a local NRF on AWS. However, for either of the NRF deployment options, in the Warm-standby strategy, it is essential that these DR site NFs use lower priority (and lower Weight Factor in case of AMF) than the on-premises NFs during the Registration operation. This would enable the active on-premises site to continue handling traffic and only, during the event of a disaster, an NF failure or maintenance window. Then the traffic shifts to the DR site on AWS.

It is also important to leverage AWS’ scalability and elasticity to grow the capacity of NFs once this DR site becomes active. Second, using the “preferred-locality” information during the NF registration process, it is possible for the communication to remain within the DR site 5G NFs. This would reduce the latency and result in better response times for service requests.

Figure 4: 5GC Network DR site on AWS

Considering that in the sunny day scenario, the majority of the traffic is handled by the on-premises active site, a CSP can start with a minimal DR site footprint on AWS, and then, depending on the type of failure, single/multiple NFs can utilize the fast-autoscaling mechanisms for the right resources mentioned in the following section. This strategy is good not only for cases of complete on-premises site failure, but also to cover cases of partial failure or during maintenance windows of on-premises deployments.

Fast autoscaling to cope with traffic surge

As explained in the previous sections, 5G core NFs are deployed with a minimal footprint in AWS and handle a fraction of user traffic in the case of the Warm-standby strategy. However, in the case of a primary site fault, a large traffic surge can be expected as the traffic shifts from the primary site to the DR site on AWS. In these scenarios, it is critical to quickly autoscale both 5G NFs and the AWS compute platform capacity. The 5G NF autoscaling is usually based on Kubernetes HPA. However, scaling out of PODs alone does not address the worker node scale-out. Given that 5G NFs are being deployed on AWS via Amazon EKS, the solution to this challenge is tied to Autoscaling Groups. Amazon EKS utilizes the Autoscaling Group to deploy and manage Kubernetes worker nodes including scaling. However, Amazon EKS-driven scaling of the Autoscaling Group, via cluster autoscaler, can be considered too slow for the 5G scale-out needs. This is because the cluster autoscaler is reactive in nature and triggers scaling only after the PODs are discovered to be unschedulable. Instead, a cold-standby feature of the autoscaling group can be effectively utilized to quickly scale-out Amazon EKS worker nodes as the traffic surge occurs.

Figure 5: Amazon EKS Worker Node state change to support faster scale-out with cold-standby model

Moreover, the use of cold-standby can allow Amazon EKS worker nodes not only to be pre-configured and “primed” for hosting the workloads, but also powered off to save on costs while not in use. This is especially useful for workloads that require special tuning at launch time, usually automated via user-data scripts, such as those that require the use of Multus and DPDK. The “priming” of these worker nodes can go as far as pre-downloading POD container images from Amazon Elastic Container Registry (Amazon ECR) or elsewhere. Therefore, at POD start-up, the container time to ready state is reduced. Automation of Cold-standby worker nodes can be achieved via an API call to a custom Lambda function as presented in the Git repo.

Figure 6: Automation of Cold-Standby Amazon EKS worker node wake-up process

Automating autoscaling based on custom metrics

With the Warm-standby strategy, as the primary site goes down for maintenance or due to a disaster/fault, the traffic surge is evident in application custom metrics. For example, AMF metrics for incoming subscriber registration attempts surge. A solution that utilizes application metrics scraping via Amazon Distro for Open Telemetry (ADOT) can also be used to trigger the powering-on of the cold-stby nodes. One of these solutions is depicted in the following figure where KEDA is used to collect the custom metric from Amazon Managed Prometheus service and, based on the well-defined trigger and threshold, and to kick off a Kubernetes job which invokes the API call to bring the worker nodes online and out of the standby state.

Figure 7: Fast auto-scaling using custom metrics and KEDA

Cost comparison and savings between Hot-standby and Warm-standby DR options

Although the comparison of TCO between AWS-based and on-premises based Geo Redundancy solutions for 5G Core is a more complex calculation, you can examine the basic cost savings of using a combination of a Warm-standby DR strategy and the aforementioned mechanism of powering on the cold-standby EC2 instances on AWS. For example, let us consider a scenario in which running a Hot-standby (Active-Active) site that can handle the full load requires Six (6) m5.8xlarge EC2 instances, and for the same workload in the Warm-standby mode requires Two (2) m5.8xlarge EC2 instances as the site would be handling only fraction of user traffic. Let us also assume that 100% traffic shift to the DR site on AWS happens on an average of four hours every month. Therefore, for the duration of four hours every month the DR site must function with additional Four (4) m5.8xlarge EC2 instances to handle all of the user traffic.

Using the AWS Calculator for the Hot Standby option and the pricing for EC2 Instance Savings Plan in US-East (Ohio), for always-on Six m5.8xlarge EC2 instances (each with 64GB EBS storage, as on the date of writing this post, this costs $47,830.32 per year. However, based on the assumption for the Warm-standby and using AWS Calculator for the Warm Standby the cost would be $16,914.60. The cost calculation for this option is based on the annual cost of $15,943.44 for Two (2) always-on m5.8xlarge EC2 instances and an annual cost of $294.96 for deploying Four (4) on-demand additional m5.8xlarge EC2 instances every month for 4 hours using on-demand pricing and $676.20 for Four (4) EBS volumes each with 64GB for storage.

As can be seen from the following graph, the overall savings achieved by utilizing the Warm-standby mode offers savings of nearly 65% compared to the Hot-standby (Active-Active) mode of operation.

Figure 8: Cost comparison of Hot-standby vs Warm-standby for Six (6) m5.8xlarge EC2 instances

CSPs, for their 5GC network, can apply the above approach by inputting the specific quantity of EC2 instances and utilizing the AWS Calculator for EC2 to estimate the potential cost reduction resulting from implementing the Warm-standby DR strategy.

Summary

Since the 5G mobile core network serves mission-critical services, such as voice calls and data streaming, you must make sure of the disaster-resiliency of the service and also have a capability for prompt disaster-recovery of the network component. More specifically, to have better isolation from a fault and disaster, it is reasonable to consider building a DR 5G network on the cloud rather than legacy CSP data centers. In addition, if this DR 5G network is mainly supposed to be used for a limited time period (during the recovery of service or absorbing the spike in traffic burst), then it has a good fit with the cloud’s pay-as-you-go model. AWS can help CSP customers by providing not only an environment for building this DR virtual data center, but also various tools of automation and scaling capability for the network as demonstrated with the GitHub repo sample in this post. Using this fast scaling-out capability along with the right type/size of instance, such as the Graviton instance in AWS, would maximize the benefit of cost and energy saving for building a DR 5G network for CSP customers. For more information about telco 5G use cases on AWS, contact aws.amazon.com/telecom/contact-us.