Cloud-Native NFV Orchestration on AWS

With mobile traffic increasing almost 10% quarter-by-quarter (Ericsson Mobility Report) and technology evolving at a fast pace, communication service providers (CSPs) are under pressure from multiple dimensions. Some of the challenges are as follows:

Cost Savings (TCO) and productivity: The increase in traffic corresponds to higher bandwidth and processing capacity requirements. With the proliferation of unlimited plans, this increase doesn’t necessarily translate to new revenues. Hence, to maintain profitability it is imperative to lower the cost of operations by avoiding complex interactions between network functions and increase revenue by enabling new enterprise use cases.
Agility: Since traffic demand fluctuates, CSPs are considering elastic scaling to provide the required capacity while minimizing unused idle resources. This requires a versatile network orchestrator that can stand-up, take-down, scale-out, or scale-in both network functions and infrastructure.
Assurance and Resilience: With more and more safety- and business-critical services relying on connected infrastructure, the connectivity service must meet SLAs. When failures happen, an intelligent orchestration system should provide automated self-healing while keeping customer experience within the desired SLA bounds.

Due to the importance of orchestration, the European Telecommunications Standards Institute (ETSI) created the Network Function Virtualization (NFV) Management and Orchestration (MANO) framework in 2013 https://www.etsi.org/technologies/nfv. This workgroup looked at the concepts and issues of management of network functions and infrastructure in a virtualized world of virtual machines (VMs) and came up with a prescriptive method. This helped create a unified framework that has been widely adopted as the reference architecture for management and orchestration in the CSP virtualization context.

In the late 2000s and early 2010s, containers started becoming popular for software development and deployment. Containers are units of application that package code and all dependencies and can be run individually and reliably from one environment to another. Due to the additional benefits offered by containers over virtualization, there is an increasing desire to containerize the virtualized network functions (VNFs). An application often consists of multiple containerized modules, and these multiple modules need to be managed. Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications. Network functions have additional requirements than commonly found cloud-applications. ETSI MANO framework was specifically designed to address these specialized requirements in the context of virtualized network functions. As we move to containerization, it becomes important to map the management and orchestration of CNFs to the ETSI MANO framework so that none of the requirements are missed and the containerized orchestration continues to work well without mandating changes in the business and operation support systems.

In this blog, I will discuss the NFV orchestration requirements and specifications as set by ETSI in the context of Kubernetes and cloud-native implementations. We will also evaluate the orchestration needs in various phases of a network implementation, and provide a way to implement agile and resilient automation and orchestration solutions using AWS container and serverless constructs.

ETSI MANO and Kubernetes

The traditional ETSI MANO framework was developed in the context of virtual machines (VM). Figure 1 indicates ETSI MANO architecture along with major functions performed by each of its components. Network slice management and its associated functions such as Network Slice Subnet Management Function (NSSMF) and Network Function Management Function (NFMF) are part of 3GPP specifications and are beyond the realm of MANO framework however they are shown in the Figure 1 to provide complete view of network and service management.

etsi mano framework

Figure 1: ETSI MANO Framework with 3GPP Management

When applications are containerized (instead of virtualized), they are referred to as a containerized network function (CNF). In Kubernetes, containerized applications are executed as Pods. Kubernetes manages the lifecycle of Pods, as well as scales them up and down. Later is done via changing specified constraints in configuration, such as minimum, maximum, and desired number of replicas in its Deployment constructs and infrastructure nodes in a node group configuration.

In the Kubernetes context, Virtual Infrastructure Manager (VIM) is responsible for placing pods and containers on nodes which can be bare-metal machines or a virtualized platform. Operation and Management of the infrastructure and platform are intrinsically provided by common cloud tools, such as Amazon CloudWatch and K8 config-maps. When there are multiple VIMs/Clusters, NFV Orchestrator (NFVO) manages/coordinates the resources from different VIMs. NFVO also manages the creation of an end-to-end network function chain involving multiple CNFs that might be managed by different Virtual Network Function Managers (VNFMs).

Due to scale, organization, and security issues, one Kubernetes cluster is generally insufficient for a network operator, and a multi-cluster infra-management solution is needed. There are many open-source and vendor-provided solutions to manage multiple Kubernetes clusters. In AWS, this multi-cluster control plane management is provided by the Amazon Elastic Kubernetes Service (Amazon EKS) master control plane, which makes sure that all of the clusters are healthy and scale as needed. The Amazon EKS dashboard provides a single pane of glass to manage multiple clusters, with each cluster exposing its own API endpoint to which kubectl and helm commands are directed.

In light of the above, one way to map the Kubernetes and its associated constructs to ETSI MANO framework is depicted in Figure 2.

etsi mano framework in context of kubernetes

Figure 2: Role of Kubernetes in ETSI MANO framework

From the above discussions and figures, it is clear that the ETSI MANO NFV framework doesn’t map 1-1 to Kubernetes framework. This creates some difficulties in applying understandings gained from ETSI MANO NFV to Kubernetes operational environments. Some of these difficulties are discussed next.

Framework mapping challenges

The main difficulty in mapping the ETSI MANO model to the Kubernetes framework is that the first takes the declarative approach to imperative operations, whereas the latter takes the declarative approach to an intent-based framework. In ETSI, there are direct procedures between NFVO and VNFM and VIM, whereas in the Kubernetes framework all of these communications are done by artifacts, APIs, and manifests with the desired final state of operation.

The ETSI MANO architecture uses Lifecycle operation granting where ETSI VNFM asks NFVO before initiating changes within the defined parameters. The Kubernetes scheduler along with constructs such as the auto-scaling group in Amazon EKS, is well equipped to efficiently manage the demand on resources, and NFVO should leave it to them to manage in the Kubernetes framework. Another complexity is the configuration of VNFs once they are instantiated. This is coupled in ETSI MANO where the VNFM, while instantiating VNF, also takes responsibility for interacting with the element manager to make sure of the configuring of VNF. Kubernetes doesn’t have built-in mechanisms for application configuration. However, with capabilities such as lifecycle hooks, init containers, configmap, and operators, one can build an efficient GitOps or DevOps method for configuring CNFs during or after it’s instantiated.

The fundamental differences between MANO and Kubernetes frameworks also manifest in the way that both of these systems respond to failures and performance impacts. With Kubernetes’ intent-based framework and its ability to take care of Pod failures, many of the failures and performance issues can be handled at the Kubernetes level, and Operation Support System (OSS)/Business Support System (BSS) only need to take care of higher-level performance issues and failures that couldn’t be corrected by Kubernetes and associated constructs such as auto-scaling groups.

Due to the different approaches in cloud-native operations versus virtualized application operations, the day-1 to day 2 operations also look different between them. In the following section, I will go over typical day-1 to day 2 operations, and explore how a cloud native orchestration can be built using those requirements. To meet an orchestrator’s business goals outlined earlier, it is important that orchestration solution itself should be resilient and agile. If the orchestrator is implemented using cloud-native and serverless implementation best practices then it will be also be easier to evolve the orchestrator to meet the changing demands from OSS/BSS systems.

Architecture for day-1 to day-2 operations

The following guidelines develop automation steps that are scalable while utilizing AWS and Amazon EKS native constructs for maximum flexibility.

Day-1: Planning and network prerequisites

Common tasks during this phase are:

Creation of Account Structure
Planning IP/subnets
Work with ISVs and CSP to populate CIQ and artifacts
Order Outposts if needed
Define naming convention, metadata and tags
Create IAM accounts and roles in the account structure
Create service and CNF Catalogs
Give appropriate permissions and set policies
Create Infrastructure: deploy AWS constructs such as VPC/TGW/DX, and subnets.

Most of these activities are covered by proper landing zone design and discussions among network function providers, CSP network team and cloud provider to create well-structured accounts, permissions, naming conventions and network design. AWS services, such as AWS Organizations and AWS Control Tower, could be quite useful to develop proper account structure and management, such as new account creation. With AWS Identity and Access Management (IAM), one can specify who or what can access services and resources in AWS, centrally manage fine-grained permissions, and analyze access to refine permissions across AWS. AWS Key Management Service (AWS KMS) helps create, manage, and control cryptographic keys across applications and more than 100 AWS services, and it helps with secure access and management.

Amazon Virtual Private Cloud (Amazon VPC) gives full control over the virtual networking environment, including resource placement, connectivity, and security. One or more VPCs might be required depending on VPC design and scale. AWS Direct Connect links the CSP internal network to a Direct Connect location over a standard Ethernet Fiber-optic cable. AWS Transit Gateway connects Amazon VPCs and on-premises networks through a central routing hub. This simplifies the network and puts an end to complex peering relationships as each new connection is only made once. Amazon Route 53 is a highly-available and scalable Domain Name System (DNS) web service. Route 53 connects user requests to internet applications running on AWS or on-premises.

Amazon Elastic Container Registry (Amazon ECR) is an AWS-managed, OCI-compliant container image registry service that is secure, scalable, and reliable. Amazon ECR supports private repositories with resource-based permissions using IAM. This makes sure that only specified users or Amazon EC2 instances can access container repositories and images, thereby allowing separation across vendors. Amazon ECR supports multiple AWS Command Line Interface (AWS CLI) options to push, pull, and manage Docker images, Open Container Initiative (OCI) images, and OCI compatible artifacts.

Significant engineering efforts and design considerations should be given at this stage, as this lays the foundation of future automation and operations. Although proper planning needs human decisions, implementation of design decisions can often be automated.

Day 0: Topology development

Some of the tasks in this phase of the deployment are as follows:

Activate hardware, such as AWS Outposts, if it’s deployed
Populate service/CNF catalog: this catalog contains services that upper layers can call as well as CNF images, helm charts, etc.
Deploy platforms: such as Amazon EKS clusters, CNIs/CSIs, vRouters, and observability infrastructure such as probes and clients
Bootup Infrastructure: such as node groups

Some of the well-developed Robotic Automation Tools or customized process automation tools can be developed to activate outposts. Services such as AWS Service Catalog can be quite useful in creating the catalog and customizing it for the particular CSP/ISVs. Care should be taken to properly abstract configuration parameters to avoid bloating the catalog size. AWS CloudFormation and/or AWS Cloud Development Kit (AWS CDK) constructs are flexible and functionally rich tools for deploying infrastructure and many of the platform components, such as Amazon EKS clusters, CNIs, CSIs, etc. The invocation of CloudFormation/AWS CDK templates can be further customized using AWS CodePipeline with tools such as AWS CodeCommit, AWS CodeDeploy, etc. Some of the functions can also be automated using a purpose-built automation workflow using AWS Lambda and AWS Step Functions. Furthermore, the remaining networking constructs such as vRouter and database infrastructure should also be deployed.

Increased agility and resilience can be achieved via Amazon EKS, an AWS-managed Kubernetes service that makes it easy to run Kubernetes on AWS and on-premises. The Kubernetes control plane managed by Amazon EKS runs inside of an Amazon EKS-managed VPC and runs components such as the Kubernetes API server nodes and etcd cluster. Kubernetes master nodes run an API server, scheduler, kube-controller-manager, etc., is single-tenant and unique, and runs on its own set of Amazon EC2 instances. API server nodes run in a minimum configuration of two distinct Availability Zones (AZs), while the etcd server nodes run in an auto-scaling group that spans three AZs. This architecture makes sure that an event in a single AZ doesn’t affect the Amazon EKS cluster’s availability. The control plane backup (e.g., etcd backup) is periodically performed by AWS. Having a unified managed Kubernetes control plane helps with operational agility and automation.

Day 1: Instantiation

This phase of deployment deals with the following tasks:

Instantiate CNFs
Update route tables

Most of these functions can be automated if properly designed. CloudFormation/AWS CDKs are good constructs for this part. The automation pipeline could be built using AWS-provided continuous integration/continuous development (CI/CD) tools such as CodeCommit and CodeDeploy, or custom workflows using Lambda and Step Functions that invoke appropriate CloudFormation/AWS CDK templates. Moreover, care must be taken to develop pipelines with appropriate parameterization so as not to bloat the number of pipelines or custom workflows. Some of the approaches that can be particularly beneficial in this part of the deployment are constructs such as Amazon EKS blueprints.

One must also create proper databases to handle vast and different type of data, generation via network functions, and to make sure of the proper mapping between services, functions, and their instantiation. A graph database, such as Amazon Neptune, can be quite useful in this regard. Neptune is a fast, reliable, and fully-managed graph database service that makes it easy to build and run applications. Amazon DynamoDB is a fully-managed, serverless, and key-value NoSQL database designed to run high-performance applications at any scale. DynamoDB offers built-in security, continuous backups, automated multi-Region replication, in-memory caching, and data import and export tools. Amazon Relational Database Service (Amazon RDS) is a managed relational database service for MySQL, PostgreSQL, MariaDB, Oracle BYOL, or SQL Server. Some of the AWS partner solutions, such as portworx can be useful in architecting CNFs for high availability.

Day 2: Operation and management

This is arguably the hardest part of the automation lifecycle, and deals with day-to-day operation of the network. This phase deals with the following tasks:

Update and scale CNFs
Update and scale Network Services
Update EKS version
Update Configuration
Allow creation of new services
Monitor and Manage
Terminate service/CNFs when not needed

New infrastructure, network and functions could also be deployed in this part to address increased network demand. Hence, it is important to not view this phase in isolation of earlier phases but think of it this phase as invoking all the earlier phases as and when needed.

This phase is also the one that is most difficult to handle with traditional CI/CD. However, the new GitOps based approaches can be particularly beneficial in this context. GitOps enables configuration as code and, if properly implemented, can take care of drift management from the desired configuration. This model is often utilized as an efficient strategy for provisioning cloud provider-specific managed resources, such as Amazon Simple Storage Service (Amazon S3) bucket and Amazon RDS instance, on which application workloads depends. Furthermore, AWS constructs such as AWS Auto Scaling can provide a cost-effective way to manage utilization and allow for traffic adaptation. Combining this approach with an application configuration provides a useful method to manage the operational configuration.

Monitoring, observability, and logging alarms for the day 2 operations can be achieved using services such as CloudWatch, CloudTrail, and AWS-provided managed services such as AWS Managed Service for Prometheus and AWS Distro for OpenTelemetry. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events so that the operations team can get a unified view of operational health, and gain complete visibility of AWS resources, applications, and services running on AWS and on-premises. One can use CloudWatch to detect anomalous behavior in CSP environments, set alarms, visualize logs and metrics side-by-side, take automated actions, troubleshoot issues, and discover insights to keep the network running smoothly.

Since the network functions continuously emit performance data and KPIs, one needs a method for processing this streaming data. Amazon Kinesis makes it easy to collect, process, and analyze real-time streaming data so that an operator can get timely insights and react quickly to new information. Amazon SageMaker helps one to prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities that are purpose-built for ML. This makes it easy to get insights into the deployment and operations.

In light of the processes and services discussed here, the following diagram represents a grouping of relevant AWS services for implementation of a cloud-native network orchestrator. To maintain ease of reading, we have ignored some of the operational requirements such as reliability, security and recovery from the following diagram, however in a real implementation, it is important to consider those requirements as well.

framework implementation

Figure 3: AWS constructs for a cloud-native CNF and infrastructure orchestrator

With the above architecture, one possible implementation on AWS will be as follows.

aws cloud

Figure 4. Example implementation architecture of a cloud-native CNF and infrastructure orchestrator

This diagram represents VPC constructs, EKS clusters, load-balancers and repositories, network connections etc. in the context of region, availability zones, and on-premises data centers. For ease of representation, we haven’t depicted some of the functionalities such as account and user administration, the creation of a landing zone, security, and DNS that were part of the earlier Figure 3 as many of those features will run in their own VPCs within control of cross-account permissions.

Conclusion

In this post, I have discussed the essential role of agile and resilient cloud native network orchestration to achieve the business and operational goals of service providers. I have examined and mapped the NFV orchestration requirements and specifications as set by ETSI in the context of Kubernetes and cloud-native implementations. We also discussed the orchestration needs in various phases of implementation, and provided a way to implement agile and resilient automation and orchestration solutions using AWS container and serverless constructs. These constructs are flexible and can be easily adopted to meet CSPs and network function providers automation requirements.

For further information on AWS telco offerings, and how some of these constructs have been used with the service providers, please visit https://aws.amazon.com/telecom/ .

AWS for Industries

Cloud-Native NFV Orchestration on AWS

ETSI MANO and Kubernetes

Framework mapping challenges

Architecture for day-1 to day-2 operations

Conclusion

Resources

Follow