Building Observability on Amazon Managed Grafana for 5G O-RAN sites built on EKS-Anywhere

An important step in the push for an open, virtualized, and intelligent 5G has been the proliferation of Open Radio Access Network (O-RAN), which seeks to improve Radio Access Network (RAN) flexibility and deployment velocity, while simultaneously reducing the capital and operating costs through the adoption of cloud architectures. O-RAN does this by disaggregating the RAN into multiple components, such as O-RU(O-RAN Radio Unit), O-DU(O-RAN Distributed Unit), O-CU(O-RAN Central Unit), RIC(RAN Intelligent Controller). These components can run as containerized software on commodity servers supplemented with a programmable Acceleration Abstraction Layer (AAL) when necessary. The underlying cloud computing layer that enables the hosting of containerized RAN network functions is also called O-Cloud in O-RAN standards.

This architecture can be represented as three open layers, as shown in Figure 1 below. The commodity hardware servers host a container runtime in order to run the containerized RAN functions. The most commonly used Container Runtime is Kubernetes, which we will use in this blog. For simplicity, only O-CU and O-DU software applications have been shown.

Figure 1: Layers of an O-RAN site

O-RAN technology inherently consists of disaggregation and allows Communication Service Providers (CSPs) to have multiple suppliers for each layer. This creates challenges in monitoring and observing the different types of hardware and software. Although usually every vendor provides their own Element Management System (EMS), it is inconvenient to use different systems and co-relate issues.

In this blog, we explain how we can use Amazon Managed Grafana with commonly used observability solutions such as Prometheus and OpenTelemetry, to monitor all three layers of O-RAN sites (server, Kubernetes cluster, and O-RAN applications), uniformly across suppliers. By aggregating this observability in the AWS Region, CSPs gain the reliability and operational ease of the cloud. This enables CSPs to pivot on the flexibility that O-RAN was designed to provide.

We use Amazon Managed Service for Prometheus as an example of a Prometheus server, AWS Distro for OpenTelemetry (ADOT) as an example of an OpenTelemetry collector, and Amazon EKS Anywhere as an example of a Kubernetes runtime which can be used with Commercial Off the Shelf (COTS) servers. However most concepts in this blog could apply to your choice of hardware and Kubernetes distribution. For general guidance on how you can build O-RAN on AWS please prefer to this whitepaper.

Monitoring the Edge Servers availability

For the server layer, we show examples of critical monitoring, and whether the server is reachable/available or not. The techniques shown for server availability can be used for detailed server observability as well.

Option 1: Use Redfish REST APIs to monitor the server status

Redfish is a cross-vendor industry standard that defines an easy-to-use and implement RESTful interface that lets users manage a wide range of devices and environments, such as stand-alone servers. In many cases, a baseboard management controller (BMC) implements Redfish protocols, resources, and functions to provide remote management capabilities of a system. Most of the major server suppliers for the O-RAN edge support the Redfish standard. HPE servers use the Redfish enabled iLO interface. Similarly Dell iDRAC and Supermicro also provide Redfish support. You can check here for the Redfish compliancy of your server supplier.

As long as your specific server is Redfish enabled (or can become Redfish enabled using additional software or licensing), we can use the Redfish REST specification to monitor it, as shown in the following figure.

Figure 2: Using Redfish APIs to monitor servers

Redfish enabled servers define an event subscription service at /redfish/v1/EventService/Subscriptions, where a client such as our Monitoring service can subscribe with the following info:

The listener URI where an event-receiver client expects events to be sent. When an event is triggered within the Redfish service, the service sends an event to that listener URI.
The type of events to send.

Refer to the ‘Eventing’ chapter in the Redfish Specification for more details.

Let’s use an example to monitor an HPE server with a Redfish enabled iLO interface.

1. Create a listener REST API in Amazon API Gateway using the steps provided in this API Gateway doc.

2. Subscribe to desired events with the server’s Redfish service as follows:

POST /redfish/v1/EventService/Subscriptions/

Request Body:

{
    "Destination": "https://{restapi-id}.execute-api.{region}.amazonaws.com/{stage}",
    "Context": "Some context",
    "RegistryPrefixes": [
        "StorageDevice",
        "NetworkDevice",
        "iLOEvents",
        "ResourceEvent"  
    ],
    "HttpHeaders": {
        "Content-Type": "Application/JSON",
        "Odata-Version": "4.0"
    },
}

Where:

Destination is the listener API we created in Step 1

Context can be any static information you want to provide, such as the site detail

RegistryPrefix is the list of server events to which we’re subscribing

HttpHeaders are arbitrary HTTP headers you need for the event POST operation

3. If an event to which we’ve subscribed occurs (iLOEvent ‘ServerPoweredOff’ in this example), Redfish sends a POST event such as the following to the listener API.

{
    "EventID": "myEventId",
    "EventTimestamp": "2023-02-13T14:49:20Z",
    "Severity": "Critical",
    "Message": "This is a test event message",
    "MessageId": "iLOEvents.2.1.ServerPoweredOff",
    "MessageArgs": [
        "NoAMS",
        "Busy",
        "Cached"
    ],
    "OriginOfCondition": "/redfish/v1/Systems/1/"
}

4. Configure your listener API’s POST backend as a Python function in AWS Lambda to process the server event and write to Amazon Timestream.

5. Add Amazon Timestream as the data-source for Amazon Managed Grafana.

6. Configure Amazon Managed Grafana to create Alerts using the Amazon Timestream data.

Option 2: Use the Prometheus ‘up’ metric to monitor server status

If the servers to be monitored are part of Kubernetes clusters, then you can use Prometheus for their monitoring. For each instance scrape Prometheus stores a sample for the up time series as follows.

up{job="<job-name>", instance="<instance-id>"}: 1 if the instance is healthy, meaning reachable, or 0 if the scrape failed.

The up metric can be used for server health monitoring as shown in the following figure.

Figure 3: Using node-exporter ‘up’ metric to monitor servers

1. Install utilities that can monitor nodes such as Node Exporter as daemonsets in your Kubernetes Cluster. As an example, Node exporter collects the up metric for all your nodes in this format:

up{instance="192.168.1.50:9100",job="node-exporter",nodename="hostname-121"}

2. Use the ‘Prometheus Receiver’ component in your OpenTelemetry Collector to scrape the Node Exporter metrics. You can use the following sample config for ADOT to scrape from Node Exporter. Note that the ADOT collector is provided as a curated package for EKS Anywhere, which makes EKS Anywhere a convenient choice for your container runtime.

      receivers:
        prometheus:
          config:
            global:
              scrape_interval: 30s
              scrape_timeout: 10s
            scrape_configs:
              - job_name: 'node-exporter'
                kubernetes_sd_configs:
                  - role: endpoints
                ec2_sd_configs:
                relabel_configs:
                  - source_labels: [ __address__ ]
                    action: keep
                    regex: '.*:9100$'
                  - action: replace
                    source_labels: [__meta_kubernetes_endpoint_node_name]
                    target_label: nodename

3. Use the ‘Prometheus Remote Write Exporter’ component in your OpenTelemetry Collector to send metrics to Prometheus remote write compatible backends, such as Amazon Managed Service for Prometheus.

exporters:
  prometheusremotewrite:
    endpoint: {{ prometheusRemoteWriteUrl }}

4. Create an Alert rule in your Prometheus backend.

{
    "alert": "NodeTargetMissing",
    "expr": "up{instance=~\".*9100.*\", job=\"node-exporter\", nodename!=\"\"} == 0",
    "labels": {
        "severity": "critical"
    },
    "annotations": {
        "summary": "Node target with name {{ $labels.nodename }} is missing",
        "description": "A Node target has disappeared.\n  LABELS = {{ $labels }} for more then 30 seconds"
    }
}

5. Configure Amazon Managed Grafana to use Amazon Managed Service for Prometheus (or any Prometheus) as the data source, as explained in this Grafana doc.

6. Visualize alerts at Amazon Managed Grafana by following steps in this Grafana doc to use the Prometheus Alertmanager as the data source.

Monitoring the Kubernetes Container Runtime

Use a similar setup as Option 2 and install utilities such as kube-state-metrics, Node Exporter, and Container Advisor on your clusters. These utilities monitor your clusters and generate metrics that can be consumed to create alerts and informative dashboards.

Follow Steps 2-6 from Option 2 to scrape metrics, create alerts, and visualize the alerts on Amazon Managed Grafana. You can find sample Alerting rules for Amazon EKS Anywhere at this GitHub post.

Once Kubernetes metrics are available in Amazon Managed Grafana, you can access a vast marketplace of preconfigured Grafana Kubernetes dashboards that you can import into Amazon Managed Grafana.

Monitoring the O-RAN applications

O-RAN applications, such as vDU, vCU, and AAL, run as microservices on the Kubernetes container runtime and must be able to generate and export metrics. It is up to the application vendor to create this capability. Then, we can use one of the following two commonly used methods to monitor the applications as shown in the following figure.

Figure 4: Monitoring O-RAN applications using metrics

Option 1: O-RAN application exposes metrics that can be scraped by Prometheus

Prometheus works by ‘scraping’ metrics from instrumented applications. The O-RAN application may generate Prometheus compliant metrics and make them available over HTTP. It is up to the O-RAN application vendor to provide this capability.

Use the ‘Prometheus Receiver’ component in your OpenTelemetry Collector to scrape the exposed metrics from the O-RAN application, and then export them to Prometheus backends as in the following example. Then, the O-RAN application metrics can be visualized in Amazon Managed Grafana as explained in the previous Kubernetes Container Runtime section.

receivers:
  prometheus:
    config:
      global:
        scrape_interval: 30s
      scrape_configs:
        - job_name: oran_app_metrics
          static_configs:
            - targets: ["1.2.3.4:9091"]
              
exporters:
  prometheusremotewrite:
    endpoint: {{ prometheusRemoteWriteUrl }}

Where 1.2.3.4:9091 is the IP and port of the metrics server on the O-RAN application.

Option 2: O-RAN application forwards metrics using OpenTelemetry Protocol

The O-RAN applications may also send Prometheus compliant metrics using the OpenTelemetry Protocol (OTLP) to your OpenTelemetry collector (such as ADOT collector) as suggested in the example. It is up to the application vendor to provide this capability.

Use the ‘OTLP Receiver’ component in your OpenTelemetry Collector to receive the metrics, and then export them to Prometheus backends as in the following example. Then, the O-RAN application metrics can be visualized in Amazon Managed Grafana as explained in the previous Kubernetes Container Runtime section.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  prometheusremotewrite:
    endpoint: {{ prometheusRemoteWriteUrl }}

Conclusion

In this blog, you learned multiple methods to monitor all layers of your O-RAN edge sites (O-Cloud) in the AWS Region using Amazon Managed Grafana. This centralizes your observability of geographically spread O-RAN sites regardless of the hardware/software suppliers, in turn helping to achieve the O-RAN vision. By having the observability function in the AWS Region, you get built-in resiliency that comes with Amazon’s managed services. In addition, you can also use supplementary Amazon Artificial Intelligence/Machine Learning (AI/ML) and Amazon data analytics services to process your metrics (for example, to transform your metrics into insights). During the process, you also learnt about other AWS services that can help you with your O-RAN journey, such as Amazon EKS Anywhere and ADOT.

AWS for Industries