Your telecom cloud journey on AWS: Part 3 – Optimizing cloud operations on AWS for telecom excellence

Introduction

In “Your Telecom Cloud Journey on AWS: Part 1 – Establishing a Foundation” and “Accelerating Your Telecom Cloud Journey: A Technical Roadmap with AWS” we covered the transformation journey that many Telcos are embarking on when running complex and critical Telco workloads on AWS. Once a cloud foundation is established and workloads are running on AWS, how can we make sure that they achieve or even exceed the same high operational standards as on-premises? By reading this post, you can understand the key operational considerations in designing critical workloads, establishing an operating model, and observing Telco workloads on AWS.

Design applications for critical workloads

IT services are awash with agreements and objectives, from contractual/external Service Level Agreements (SLAs), Internal Operational Level Agreements (OLAs), and Service Level Objectives (SLOs). These are used to drive good development (SLOs), operational practices and configuration (OLAs), and ultimately provide a service operation commitment and a commercial framework for when something goes wrong (SLAs).

In many cases, internal IT systems SLAs are typically best efforts, where the systems are architected to be resilient/reliable, but if they fail they are recovered as quickly as possible. This is not the case for some CSP network services. For example, Emergency Voice services (999 in the UK) are considered critical national infrastructure, and as such they must be highly available (99.999% for example) or the consequences could be catastrophic. If these types of services don’t meet their SLA, then the penalty for a CSP can be substantial. For example, in June 2017 a UK service provider was fined £1.9 million after it failed to make sure users could contact emergency services due to a weakness in its network handling emergency calls. CSPs often take a failure prevention approach with these critical services, building from the ground up and using redundant hardware (power supplies, network supervisors, line cards, etc.), dual control plane, and signaling infrastructure.

Amazon Web Services (AWS) provides SLAs for its services, for example a single Amazon Elastic Compute Cloud (Amazon EC2) instance has a commitment of 99.5% availability SLA, which equates to a downtime of 1.83 days in a year (or 7.2 minutes in a 24hr period). If this SLA is not met, then users are eligible for service credits. Therefore, the challenge for many CSPs, as they consider migrating critical workloads to AWS, is how to bridge the gap between the AWS SLA/commitment and their contractual/regulated SLA.

To support bridging this gap, AWS recommends the following:

Establishing a solid Landing Zone that is designed to use key architectural constructs, such as AWS Regions and Availability Zones (AZs), as well as:
- Using Managed services, such as Amazon Relational Database Service (Amazon RDS), which support fail-over, clustering, and monitoring.
- Technology capabilities such as zonal autoshift and good observability tooling.
- Architectural guidance such as cell-based designs.
Cloud Native application design, such as circuit breakers, bulkheads, etc.
Strong operational practices around testing, deployment, observability, and financial controls.

Figure 1: Cell based architecture on AWS

The starting point for a reliability design is the Landing Zone, which is architected to use AWS Regions, AZs, and managed services, as shown in the preceding figure. At the simplest level, an application can be deployed across two AZs with multiple instances. Data can be replicated between zones using a standard active/passive model, and each application instance has to be fronted by a single cross-zone Load Balancer (not shown).

This typically supports a 3×9’s (99.9) or 4×9’s (99.99). If a higher SLA is required, then this single AWS Region solution can be extended across multiple AWS Regions with Amazon Route 53 and cross-Region data replication to provide 5×9’s (99.999) availability, and an example of this type of design can be found in this AWS Public Sector post. If a multi-Region design is not possible for data sovereignty, cost, or regulatory reasons, then a hybrid approach can be used through AWS Outposts or existing data centers.

The final option is to implement cell-based architecture across multiple AZs and AWS Regions, which is how many AWS services are architected under the covers. Adding share nothing application cells in an AZ can support high levels of resilience and large numbers of tenants. For example, replicating data between cells for resilience and using a cell routing layer to route requests to individual cells based on a partition key (userid for example) or to failure condition a cell-based architecture. Cell-based architectures also provide additional benefits, such as avoiding noisy neighbors. However, as it is more complex, it is typically only used for large-scale systems such as AWS itself.

These infrastructure patterns can be created and delivered by individual development/operations teams, platform engineering (PE), or site reliability engineering teams (SRE) to effectively bridge the SLA gap. The challenge is they need to be intentional in their design, and highly reliable with a focus on detecting and recovering from failure rather than failure prevention.

Then, these patterns are used to provide a platform for the developers, who add the final layer of reliability and recovery by using cloud development best practices, such as micro-services, tracing, circuit breakers etc. This bridges the SLA gap and allows critical applications to be deployed on AWS.

Establish an operating model

Traditional CSP operating models are typically hierarchical and highly structured (ITSM/TMF) mainly due to their need to need to support and maintain critical, real-time workloads at a country level. Although most large CSPs have taken significant steps in adopting Agile at Scale (for example Scaled Agile Framework) and DevOps practices for IT workloads, changes in the operations space for network workloads (Cellular, Broadband, Content, etc.) have been slower. This has led to a division between the two domains, as shown in the following diagram. This is typically addressed by the (global) operations teams and tends to lead to organizational friction and inefficiencies.

Figure 2: Transforming operating model

As CSPs move away from the traditional hardware and software vendors, embracing open source software and white box hardware solutions, there is the opportunity to adopt SRE/DevOps practices and gain commercial advantages due to faster time-to-market, higher reliability, and lower time to restore. This transformation takes time, so the existing traditional networks operating model needs to continue, albeit in a smaller, more silo-like way. Furthermore, the CSP must embrace newer technical practices that come with managing real-time distributed systems that have a high reliability target.

Therefore, the conclusion must be that as CSPs modernize their network services, they should adopt and adapt good practices from the IT space. However, this alone can’t be enough. They must also begin developing new practices and learning from teams that operate large, real-time software systems today in order to deliver an operating model that can deliver and support higher SLAs.

Observability

As Telco operators accelerate their migration to the cloud, comprehensive observability is crucial for maintaining a high quality of service and meeting SLAs. Telco workloads have unique observability needs as compared to enterprise IT systems.

Telco environments generate massive volumes of telemetry that must be analyzed in real-time and comply with regulatory compliance on data sovereignty and privacy. Outages directly impact users, so observability platforms must help achieve stringent SLAs on availability and performance. CSPs also operate complex hybrid environments spanning cloud, edge, and on-premises. To gain end-to-end visibility, observability data must be aggregated across all domains, thus providing unified real-time health and performance insight.

Key observability capabilities
A robust observability platform needs specialized capabilities to meet the needs of Telco operators:

Unified monitoring across hybrid environments
Telco operators must collect and correlate metrics, logs, and traces from across physical network functions, virtual network functions, edge compute locations, data centers, public cloud, and on-premises environments. This provides the complete end-to-end visibility necessary to track services and network behavior. This means using scalable services such as Amazon CloudWatch, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and Amazon Managed Service for Prometheus. Observability data must be stored in specialized data stores optimized for different analytic workflows. This includes Amazon Simple Storage Service (Amazon S3) for a durable data lake, and OpenSearch for log and trace analysis.

For hybrid environments, the observability platform needs to integrate on-premises telemetry using VPN or AWS Direct Connect to stream data into the Telco Cloud. AWS Distro for OpenTelemetry (ADOT) enables collecting telemetry from both on-premises systems and cloud-based applications. To ingest data from on-premises systems, the ADOT agent can be deployed to gather telemetry and securely forward it to the cloud platform. The processed data becomes available for analysis and observability through the unified monitoring interface.

For legacy on-premises systems that cannot run agents, logs and metrics can also be collected through syslog or SNMP forwarding to OpenSearch or CloudWatch. The key is providing easy integration methods to bring the critical on-premises telemetry into the centralized cloud observability system.

High-scale ingestion
To handle high-scale data ingestion from large Telco networks, we recommend using Amazon MSK to provide buffering at the edge to smoothly handle spikes in load and multiple destinations for storage or consumption. The observability platform should scale horizontally and use cloud-native architectures to distribute and process high volumes of streaming data.

Log parsing and filtering, compression, and aggregation of metrics can further optimize bandwidth and storage for high-volume timeseries data. By handling ingestion correctly, full fidelity observability data can be cost-effectively collected at any scale.

Customizable visualization and alerting
The observability platform must provide customizable dashboards aligned to specific operator roles (NOC technician, capacity planner, executive), network domains (RAN, core, transport) and services (VoLTE, IoT, CDN). Real-time and historical monitoring data is visualized through tailored graphs, topology maps, alerts, and reports built in Amazon Managed Grafana or CloudWatch.

Security and compliance
A comprehensive observability platform should include advanced analytics to automatically detect abnormal behaviors, suspicious access attempts, or violations of policy. Then, the platform should trigger alerts based on this detection. Additionally, the observability data collected should enable auditing across the Telco network.

Artificial Intelligence-Ops capabilities
Applying artificial intelligence (AI) and correlation engines to observability data enables intelligent alerting, anomaly detection, and automated insights into performance and issues. This further reduces the mean time to detect and resolve incidents.

Service-centric monitoring
Observability platforms need to align monitoring to specific services and network slices. This provides real-time insight into service health and performance from the user perspective, based on business KPIs such as network latency, packet loss, throughput, and uptime.

When issues arise that impact services, operators can quickly understand the root cause by using visualizations of the underlying network components and flows supporting each service. This service-centric approach is mandatory for 5G network slicing observability.

End-to-end automation
To enable closed-loop automation, observability platforms integrate with network orchestration and automation systems. Alerts can trigger the auto scaling of capacity, network healing, traffic steering, and more. This prevents issues from impacting users.

Observability best practices

Best practices when deploying observability for Telco Cloud:

Comprehensive data collection: Identify and capture diverse types of data, such as telemetry (logs, metrics, and traces) from various network components, applications, and supporting infrastructure.
Data correlation and analysis: Implement mechanisms to correlate and analyze data from different sources, enabling the identification of patterns and correlations.
Single platform observability: Consolidate network, application, infrastructure, and security observability onto a single platform to enhance correlation and analysis across the Telco Cloud.
Easy integration with CSP landscape: Observability platform to act as an interconnected nerve system for the Telco Cloud environment, not a detached black-box module. Interoperable and flexible architecture allows the telemetry data to be integrated with other systems, data lakes, etc. This enables true observability across the Telco Cloud landscape, rather than just isolated monitoring of specific systems.
Additional information can be found in the observability best practices guide.

Reference observability architecture
The observability platform can be deployed in a dedicated account within the Telco Cloud Landing Zone, allowing it to be used by multiple accounts. The following diagram shows a typical observability architecture:

Figure 3: Reference observability architecture

Telco application on Amazon EKS and Amazon EKS Anywhere
The preceding reference architecture illustrates how infrastructure and applications running on Amazon Elastic Kubernetes Service (Amazon EKS) can be observed by using:

ADOT on Amazon EKS for Telco Cloud workloads and Amazon EKS Anywhere for on-premises workloads collects logs, metrics, and traces.
Logs and traces are forwarded to OpenSearch, where the correlation of data occurs using unique trace identifiers. The logs are further stored in Amazon S3 for archival and additional AI use cases. This includes:
- Application logs: call flow, etc.
- Infrastructure logs
Metrics at every level are stored in Amazon Managed Service for Prometheus for a limited time, and Amazon S3 for longer retention. This includes:
- Network Metrics: N3 throughput, Packet loss, etc.
- Application Metrics: UE registrations, UE de-registrations, etc.
- Infrastructure Metrics: Node Failures, Cluster Health, etc.
Alerts configured in the alert manager in Amazon Managed Service for Prometheus trigger notifications and initiate corresponding actions using AWS Step Functions, calling orchestrator APIs and performing closed loop automation.

AWS services

In addition to using Amazon EKS, the Telco Cloud has other services that need to be considered as part of an observability strategy, which can be addressed by:

Using CloudWatch monitors metrics and logs for AWS managed services.
Collecting VPC Flow Logs enabled and stored in a log archive account along with other logs for retention.
Sending Metrics from CloudWatch sent to Amazon S3 through metric streams using Amazon Kinesis Firehose.
Using Amazon Managed Grafana, OpenSearch, Amazon Managed Service for Prometheus, and CloudWatch configured data sources to visualize the end-to-end health of the Telco network.

By using AWS cloud-native observability platforms and following Telco-specific best practices, Telco operators gain the unified visibility and actionable insights across hybrid, multi-vendor environments needed to deliver outstanding quality of service in today’s highly complex networks.

Summary

In this post we showed that CSPs manage critical workloads, and to achieve the necessary SLAs , it’s important to properly design the AWS infrastructure and application using AWS Regions and AZs to achieve the given objective. During this transition to the cloud, you should consider the operating model that is used with many users looking to transform to DevOps and SRE practices. Some of the benefits include increased software quality, agility, time-to-market, and innovation, giving the CSP a competitive edge. An enabler for these operational practices is the proper implementation of observability on AWS with logs, metrics, and traces being essential to assess the health, performance, capacity planning and troubleshooting of Telco workloads running on AWS.

AWS for Industries