IBM & Red Hat on AWS

Enabling AI-Powered Observability for ROSA Clusters with AWS DevOps Agent

Many organizations run containerized workloads on Red Hat OpenShift Service on AWS (ROSA), using its enterprise-grade Kubernetes capabilities. As these deployments scale, teams need intelligent observability that goes beyond basic monitoring—they need AI-powered investigation and automated root cause analysis. While ROSA includes built-in monitoring, customers often want to integrate with AWS DevOps Agent to benefit from its AI-driven capabilities alongside their existing AWS observability infrastructure.

This post demonstrates a solution approach for enabling AWS DevOps Agent to monitor and investigate issues in your ROSA clusters by routing telemetry through AWS CloudWatch Container Insights. This pattern allows you to use AI-powered investigations while maintaining unified observability across your AWS environment—without requiring complex multi-source integrations.

The Power of AI-Driven Observability

Traditional observability approaches require DevOps teams to manually correlate metrics, logs, and cluster state across multiple tools when investigating incidents. This process is time-consuming, error-prone, and requires deep expertise in both Kubernetes and AWS services. The AWS DevOps Agent changes this paradigm by bringing AI-powered investigation capabilities directly into your observability workflow.

The AWS DevOps Agent is an intelligent assistant that can autonomously investigate issues in your infrastructure by querying multiple data sources, correlating findings, and providing root cause analysis with actionable recommendations. When integrated with ROSA and CloudWatch, the DevOps Agent can:

  • Detect anomalies proactively by continuously monitoring CloudWatch metrics for unusual patterns such as CPU spikes, memory pressure, or increased error rates
  • Correlate metrics with logs to understand not just that a problem occurred, but why it occurred, by analyzing application logs and system events in context
  • Inspect cluster state in real-time by querying the Kubernetes API to check pod health, resource configurations, recent deployments, and cluster events
  • Provide root cause analysis by synthesizing information from metrics, logs, and cluster state to identify the underlying cause of issues
  • Suggest remediation steps based on AWS and Kubernetes best practices, reducing mean time to resolution (MTTR)
  • Learn from patterns to improve detection and recommendations over time

With the DevOps Agent, teams can ask natural language questions instead of spending hours manually piecing together information from CloudWatch dashboards, log queries, and kubectl commands: “Why is the payment-service experiencing high latency?”

This integration transforms observability from a reactive, manual process into a proactive, AI-assisted workflow that scales with your organization.

Understanding the Solution Approach

ROSA clusters generate vast amounts of telemetry data—metrics from nodes, pods, and containers, along with application and system logs. While ROSA includes built-in Prometheus for metrics collection, customers deploying workloads across both ROSA and native AWS services often want to:

  • Centralize observability data across ROSA clusters and AWS services in CloudWatch for unified monitoring
  • Enable long-term retention of metrics and logs for compliance, capacity planning, and historical analysis
  • Automate incident investigation using AI-powered agents that can correlate metrics and logs
  • Integrate with existing AWS tooling for alerting, dashboards, and operational workflows

Prerequisites

Before implementing this solution, ensure you have:

  • AWS account with administrator access
  • Existing ROSA cluster (or ability to create one)
  • Basic understanding of AWS services including CloudWatch, IAM, Red Hat OpenShift Service on AWS (ROSA)/Kubernetes concepts
  • AWS CLI and `oc` (OpenShift CLI) installed
  • `kubectl` or `oc` access to your ROSA cluster
  • Helm 3.x installed
  • AWS DevOps Agent enabled in your AWS account

Solution Approach

This solution delivers a practical approach for enabling AWS DevOps Agent to monitor ROSA clusters using CloudWatch Container Insights as the integration layer. It is designed around four core principles:

AWS-Native Observability

Use AWS CloudWatch Container Insights—AWS’s official Kubernetes monitoring solution—instead of attempting direct Prometheus-to-AMP integration. This avoids metadata compatibility issues and provides a supported, production-ready path.

Comprehensive Data Collection

Deploy both metrics collection (CloudWatch Agent) and log aggregation (Fluent Bit) as DaemonSets to capture complete cluster telemetry, including node metrics, pod metrics, container logs, and system logs.

Centralized CloudWatch Access

Configure the AWS DevOps Agent with access to CloudWatch as the unified observability platform: CloudWatch Metrics for performance monitoring and anomaly detection, and CloudWatch Logs for detailed application and system log analysis with Kubernetes metadata (pod names, namespaces, labels). All cluster telemetry flows through CloudWatch, providing a single pane of glass for AI-powered investigations without requiring direct cluster access.

Cost-Conscious Design

Implement pause/resume capabilities, log filtering, and retention policies to keep CloudWatch costs predictable and manageable, especially during development and testing phases.

Bringing It All Together: Integration Architecture

The integration between AWS CloudWatch Container Insights and ROSA uses a Helm-based deployment model that automatically collects and forwards metrics and logs to CloudWatch. AWS DevOps Agent then queries this centralized observability data to perform AI-powered investigations.

Figure-1: ROSA integration with AWS DevOps Agent using CloudWatch Container Insights

When you deploy the CloudWatch observability stack to your ROSA cluster, it provisions two key components as DaemonSets: the CloudWatch Agent collects cluster and container metrics, while Fluent Bit aggregates logs from all pods and nodes. Both components automatically enrich telemetry with Kubernetes metadata—pod names, namespaces, labels, and container IDs—making logs and metrics instantly searchable and correlatable.

AWS DevOps Agent connects to CloudWatch (not directly to the ROSA cluster) and uses this enriched telemetry to perform intelligent investigations. When you ask the DevOps Agent to investigate an issue, it queries CloudWatch metrics for performance anomalies, searches CloudWatch logs for error patterns, and correlates the findings using the embedded Kubernetes metadata to identify root causes.

This architecture combines comprehensive cluster visibility with AI-powered investigation: comprehensive cluster visibility through AWS-native observability tools, combined with AI-powered investigation that eliminates hours of manual troubleshooting.

Getting Started: Step-by-Step Guide

Setting up CloudWatch Container Insights with AWS DevOps Agent for your ROSA cluster is straightforward. Here’s how to get started:

Step 1: Create IAM Role for CloudWatch Agents

The foundation of secure integration is IAM Roles for Service Accounts (IRSA), which allows Kubernetes service accounts to assume AWS IAM roles without storing credentials in the cluster.

First, create an IAM role with the necessary CloudWatch permissions:

# Create IAM role for CloudWatch agents
aws iam create-role \
  --role-name ROSACloudWatchRole \
  --assume-role-policy-document file://trust-policy.json

# Attach AWS managed policy for CloudWatch Agent
aws iam attach-role-policy \
  --role-name ROSACloudWatchRole \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

The trust policy must include your ROSA cluster’s OIDC provider:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/ROSA_OIDC_PROVIDER"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "ROSA_OIDC_PROVIDER:sub": "system:serviceaccount:amazon-cloudwatch:cloudwatch-agent"
      }
    }
  }]
}

Step 2: Deploy CloudWatch Observability Helm Chart

AWS provides an official Helm chart that deploys both the CloudWatch Agent (for metrics) and Fluent Bit (for logs) as DaemonSets in your cluster.

# Add the AWS Observability Helm repository
helm repo add aws-observability https://aws-observability.github.io/helm-charts
helm repo update

# Install the CloudWatch observability stack
helm install cloudwatch-agent aws-observability/amazon-cloudwatch-observability \
  --namespace amazon-cloudwatch \
  --create-namespace \
  --set clusterName=my-rosa-cluster \
  --set region=us-east-1 \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/ROSACloudWatchRole"

This single command deploys:

  • CloudWatch Agent DaemonSet: Collects metrics from kubelet and node exporters
  • Fluent Bit DaemonSet: Collects and forwards container and system logs
  • Controller Manager: Manages the lifecycle of observability components

Step 3: Verify Metrics and Logs Flow

The CloudWatch Agent DaemonSet runs on every node in your ROSA cluster, collecting metrics from multiple sources:

  • Node-level metrics: CPU, memory, disk, and network utilization
  • Pod-level metrics: Resource usage, restart counts, and status
  • Container-level metrics: Per-container CPU and memory consumption
  • Cluster-level metrics: Aggregated health and capacity metrics

Verify metrics in CloudWatch:

# List metrics in ContainerInsights namespace
aws cloudwatch list-metrics \
  --namespace ContainerInsights \
  --dimensions Name=ClusterName,Value=my-rosa-cluster

Step 4: Configure Log Collection

Fluent Bit runs alongside the CloudWatch Agent, collecting logs from multiple sources and forwarding them to CloudWatch Log Groups:

  • /aws/containerinsights/CLUSTER_NAME/application: Application container logs
  • /aws/containerinsights/CLUSTER_NAME/dataplane: Kubernetes control plane logs
  • /aws/containerinsights/CLUSTER_NAME/host: Node-level system logs
  • /aws/containerinsights/CLUSTER_NAME/performance: Performance log events

Fluent Bit automatically enriches logs with Kubernetes metadata (namespace, pod name, container name, labels) making them easily searchable and filterable in CloudWatch Logs Insights.

Verify logs in CloudWatch:

# List log groups
aws logs describe-log-groups \
  --log-group-name-prefix /aws/containerinsights/my-rosa-cluster

# Query recent logs
aws logs tail /aws/containerinsights/my-rosa-cluster/application --follow

Step 5: Configure AWS DevOps Agent Access

To enable the AWS DevOps Agent to investigate issues in your ROSA cluster, configure IAM permissions for CloudWatch access. The DevOps Agent queries CloudWatch metrics and logs—it does not require direct access to the ROSA Kubernetes API.

Add CloudWatch permissions to DevOps Agent IAM role: The DevOps Agent’s IAM role needs permissions to query CloudWatch metrics and logs.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:GetMetricData",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:ListMetrics",
        "cloudwatch:DescribeAlarms"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:FilterLogEvents",
        "logs:GetLogEvents",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams",
        "logs:StartQuery",
        "logs:GetQueryResults"
      ],
      "Resource": "arn:aws:logs:*:*:log-group:/aws/containerinsights/*"
    }
  ]
}

Apply the policy to your DevOps Agent role:

# Create the policy file
cat > devops-agent-cloudwatch-policy.json << 'EOF'
[paste the JSON policy above]
EOF

# Attach to DevOps Agent role
aws iam put-role-policy \
  --role-name DevOpsAgentRole-AgentSpace-XXXXX \
  --policy-name DevOpsAgentCloudWatchAccess \
  --policy-document file://devops-agent-cloudwatch-policy.json

Step 6: Start Investigating with AI

Once connected, you’re ready to investigate. Your AWS DevOps Agent now has full access to CloudWatch metrics and logs from your ROSA cluster. You can:

  • Ask natural language questions about cluster performance and errors
  • Investigate high resource usage with AI-powered root cause analysis
  • Search logs across namespaces for specific error patterns
  • Correlate metrics spikes with log events automatically
  • Receive actionable remediation recommendations based on AWS best practices

Example investigations:

  • “Show me CPU usage for pods in the payment-service namespace over the last hour”
  • “Find error logs from the payment-service namespace and identify the root cause”
  • “Which pods have restarted in the last 24 hours and why?”
  • “Investigate high memory usage in the payment-service namespace”

Step 7: CloudWatch Monitoring and Alerting

CloudWatch alarms provide proactive monitoring of your ROSA cluster health:

# Create alarm for high pod CPU usage
aws cloudwatch put-metric-alarm \
  --alarm-name rosa-high-cpu \
  --alarm-description "Alert when pod CPU exceeds 80%" \
  --metric-name pod_cpu_utilization \
  --namespace ContainerInsights \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2

These alarms can trigger SNS notifications, Lambda functions, or directly invoke the DevOps Agent for automated investigation.

Step 8: Cost Management and Optimization

CloudWatch Container Insights costs scale with cluster size and log volume. Implement these strategies to control costs:

Configure log retention policies:

# Set 7-day retention for application logs
aws logs put-retention-policy \
  --log-group-name /aws/containerinsights/my-rosa-cluster/application \
  --retention-in-days 7

Implement pause/resume for non-production environments:

# Pause CloudWatch collection (scale to 0)
oc scale daemonset cloudwatch-agent -n amazon-cloudwatch --replicas=0
oc scale daemonset fluent-bit -n amazon-cloudwatch --replicas=0

# Resume CloudWatch collection
oc scale daemonset cloudwatch-agent -n amazon-cloudwatch –replicas=1
oc scale daemonset fluent-bit -n amazon-cloudwatch –replicas=1

Filter logs to reduce volume:

Configure Fluent Bit to exclude verbose or unnecessary logs by updating the Helm values:

fluentBit:
config:
filters: |
[FILTER]
Name grep
Match application.*
Exclude log (healthcheck|readiness)

Use metric filters for cost-effective alerting:

Instead of storing all logs, create metric filters that extract specific patterns and alert on those metrics:

aws logs put-metric-filter \
  --log-group-name /aws/containerinsights/my-rosa-cluster/application \
  --filter-name ErrorCount \
  --filter-pattern "[time, stream, level=ERROR*, ...]" \
  --metric-transformations \
    metricName=ApplicationErrors,metricNamespace=ROSA/Custom,metricValue=1

Cost Estimates and Optimization

When deploying AWS DevOps Agent with CloudWatch Container Insights on ROSA, customers should budget for costs across four components:

  • ROSA Cluster — Base cost for running the ROSA cluster (nodes, control plane, management fees). Observability does not add to ROSA costs directly, but larger clusters generate more telemetry.
  • CloudWatch  — CloudWatch Agent (DaemonSet) collects metrics continuously; Fluent Bit (DaemonSet) collects logs. Charges apply per metric per month and per GB of logs ingested and stored. Cost scales with cluster size, application verbosity, and retention policies. CloudWatch Logs Insights queries are charged per GB of log data scanned when DevOps Agent performs investigations and queries. Cost scales with investigation frequency.
  • AWS DevOps Agent — You pay only for the time the agent spends on operational tasks, billed per second. Agent time is the cumulative time the agent spends actively working across all activities for a given task. There are no upfront commitments or charges for when the agent is idle or waiting for work.
  • Data Transfer — Typically minimal unless you replicate to other regions or export data outside AWS

Cost Optimization: Implement log retention policies, log filtering to exclude non-essential logs, pause/resume DaemonSets in non-production environments, and metric filters for alerting to reduce overall observability costs.

Real-World Benefits

This integration delivers tangible value across multiple dimensions:

Consider a practical example: A production incident occurs at 2 AM—pods in your payment-service namespace are experiencing high memory usage and intermittent crashes. Instead of waking an on-call engineer to manually dig through CloudWatch dashboards, query logs, and correlate timestamps, the AWS DevOps Agent automatically detects the anomaly, searches logs for OOMKilled events, identifies which pods are affected, correlates memory metrics with application errors, and provides a complete root cause analysis with remediation steps—all within seconds. The on-call engineer receives an investigation report showing metrics, log analysis, and remediation with actionable recommendations, not just a generic alert.

For Operations Teams

Reduce mean time to resolution (MTTR) by automating the investigation workflow. Instead of spending hours manually correlating metrics and logs across multiple tools, let AI do the heavy lifting while your team focuses on implementing fixes.

For Development Teams

Gain faster feedback on application performance and errors without needing deep expertise in Kubernetes or CloudWatch query languages. Natural language questions get detailed technical answers.

For Organizations

Improve reliability and customer experience by detecting and resolving issues faster. Reduce operational costs by minimizing manual investigation time and using centralized observability that scales across multiple ROSA clusters.

Verification and Testing

After deployment, verify that all components are functioning correctly:

Check CloudWatch Agent and Fluent Bit status:

# Verify pods are running
oc get pods -n amazon-cloudwatch

# Expected output:
# NAME                                  READY   STATUS    RESTARTS   AGE
# cloudwatch-agent-xxxxx                1/1     Running   0          5m
# fluent-bit-xxxxx                      1/1     Running   0          5m
# amazon-cloudwatch-observability-…   1/1     Running   0          5m

Verify metrics in CloudWatch:

# List metrics in ContainerInsights namespace
aws cloudwatch list-metrics \
  --namespace ContainerInsights \
  --dimensions Name=ClusterName,Value=my-rosa-cluster

Verify logs in CloudWatch:

# List log groups
aws logs describe-log-groups \
  --log-group-name-prefix /aws/containerinsights/my-rosa-cluster

# Query recent logs
aws logs tail /aws/containerinsights/my-rosa-cluster/application --follow

Test DevOps Agent integration:

Deploy a test application and ask the DevOps Agent to investigate:

# Deploy test application
oc create namespace test-app
oc run nginx --image=nginx --namespace=test-app

# Wait a few minutes for metrics and logs to flow to CloudWatch

# Query DevOps Agent (in AWS Console)
“Show me CPU and memory usage for pods in the test-app namespace over the last 10 minutes”

# Query logs
“Check CloudWatch logs for any errors from the test-app namespace in the last hour”

# Query metrics with log correlation
“Are there any pods in test-app namespace with high restart counts? Show me their logs.”

Note: The DevOps Agent accesses data through CloudWatch only. It cannot query the ROSA Kubernetes API directly, so questions about deployment configurations, service specs, or real-time cluster state may not be answerable. Focus investigations on metrics and logs.

Clean Up

To avoid ongoing costs after testing, follow these cleanup steps:

Delete test applications:

oc delete namespace test-app

Uninstall CloudWatch observability stack:

helm uninstall cloudwatch-agent --namespace amazon-cloudwatch
oc delete namespace amazon-cloudwatch

Conclusion

By integrating AWS DevOps Agent with ROSA clusters through CloudWatch Container Insights, organizations gain AI-powered observability that transforms how teams investigate and resolve incidents. This approach provides a supported, scalable, and cost-manageable path to unified observability—allowing ROSA customers to leverage the full power of intelligent incident investigation without the complexity of custom integrations. Start small with a non-production cluster, validate the approach, and scale confidently to production workloads knowing you have enterprise-grade observability and AI-powered investigation at your fingertips.

Ryan Niksch

Ryan Niksch

Ryan Niksch is a Partner Solutions Architect focusing on application platforms, hybrid application solutions, and modernization. Ryan has worn many hats in his life and has a passion for tinkering and a desire to leave everything he touches a little better than when he found it.

Raman Pujani

Raman Pujani

Raman Pujani is a Sr. Solutions Architect at AWS, where he helps customers to accelerate their business transformation journey with AWS. He builds simplified and sustainable solutions for complex business problems with innovative technology. Besides work, he enjoys spending time with family, vacationing in the mountains, hiking, and music.

Vijay Sivaji

Vijay Sivaji

Vijay Sivaji is a Sr. Technical Account Manager who provides strategic technical guidance and proven practices to help customers achieve their business objectives. He delivers technical leadership through architecture optimization, problem-solving, and implementation planning, while building strong customer relationships at all levels. Vijay drives business value through cost optimization, operational efficiency, and program management, serving as a customer advocate within the organization.