IBM & Red Hat on AWS

Intelligent Remediation for application workloads on ROSA with Instana and Ansible

Overview

Customers have adopted Red Hat OpenShift Service on AWS (ROSA) due to it being a managed service. This means Red Hat manage OpenShift, ensuring the cluster is available and the customer can focus on what is critical to the business. Though this sounds great it still means the customer is responsible for the application workloads which run on OpenShift. These application workloads can still experience performance degradations, pod failures, or resource constraints. Manually identifying and remediating such issues can lead to increased downtime and operational overhead.

In this blog post we shall explore how to gain improved observability and automate actions in response.

IBM Instana, an AI-powered observability solution which helps customers gain insight into application workload, health, performance and dependencies.

Watch this video: What is IBM Instana

Red Hat Ansible Automation, enables intelligent remediation by detecting issues in real-time and triggering automated corrective actions.

Challenges in Monitoring and Remediation

  • Complexity of application workloads: Dynamic containerized environments introduce complexity in detecting and addressing performance issues.
  • Manual Intervention: Traditional troubleshooting and remediation require human intervention, leading to delays and inefficiencies.
  • Lack of Contextual Insights: Identifying the root cause of issues across distributed microservices can be time-consuming.
  • Scalability Concerns: Managing remediation at scale in large environments can be challenging without automation.

Solution: Instana + Ansible for Intelligent Remediation

How It Works

  1. Instana monitors application workloads running on ROSA in Real-Time
  • Provides deep visibility into nodes, pods, services, and application dependencies.
  • Detects performance anomalies, pod failures, and resource constraints using AI-driven insights.
  1. Automated creation of Alerts with Ansible Automation Platform
  • Ansible is used to configure and manage alerting rules in Instana for key application workload and OpenShift performance metrics.
  • Ensures that alerts are automatically created based on predefined thresholds (e.g., high CPU/memory usage, pod restarts, or network latency).
  • Eliminates manual configuration efforts, ensuring consistency across multiple ROSA clusters.
  1. Instana trigger Automated Remediation
  • When an issue is detected (e.g., high CPU usage, pod crash loops, or service unavailability), Instana generates an event or incident.
  • Instana’s AI engines discover the root cause of the issue and determine the most appropriate Ansible artifact to invoke.
  • Create support cases and change control requests by integrating with solutions such as Jira and Service now.
  • The corrective Ansible automation is triggered in the Ansible Automation Platform.
  • Notify owning stakeholders such as application or database owners, and infrastructure teams.
  1. Ansible Executes Remediation Playbooks
  • Based on predefined automation workflows, Ansible runs corrective actions such as:
    • Restarting failed pods
    • Scaling up/down resources
    • Rolling back faulty deployments
    • Clearing problematic cache or restarting dependent services
    • Notifying DevOps teams via Slack, Microsoft Teams, or email
  1. Continuous Feedback Loop
  • Post-remediation, Instana verifies the system’s stability and ensures the corrective actions resolved the issue.
  • If further action is needed, additional automation workflows can be triggered.

Benefits of the Solution

  • Reduced Mean Time to Resolution (MTTR): Automating issue detection and remediation reduces downtime and operational delays.
  • Improved Reliability and Performance: Ensures ROSA workloads run optimally with minimal human intervention.
  • Scalable and Consistent Operations: Automated workflows scale across multiple clusters and environments.
  • Enhanced DevOps Efficiency: Frees up engineering teams from manual troubleshooting, allowing them to focus on innovation.

Use Case Example

Scenario: A microservices application running on ROSA experiences pod failures due to an out-of-memory (OOM) error.

Instana Detection: Instana detects the OOM event and identifies the affected service.

Ansible Automation Platform Remediation:

  • Ansible is triggered automatically.
  • Playbook increases memory limits for the affected pod and redeploys it.
  • Notification is sent to the DevOps team with remediation details.

Result: The application recovers without manual intervention, ensuring a seamless end-user experience.

Execution layers of event driven automation

Figure 1: Execution layers of event driven automation

Conclusion

By integrating Instana’s AI-driven observability with Ansible’s automation capabilities, enterprises running workloads on ROSA can achieve intelligent, automated remediation. This approach minimizes downtime, enhances performance, and optimizes operational efficiency, making it a vital strategy for modern cloud-native applications.

Ryan Niksch

Ryan Niksch

Ryan Niksch is a Partner Solutions Architect focusing on application platforms, hybrid application solutions, and modernization. Ryan has worn many hats in his life and has a passion for tinkering and a desire to leave everything he touches a little better than when he found it.

Hicham Mourad

Hicham Mourad

Hicham is responsible for technical marketing of the Red Hat Ansible Automation Platform on Clouds. Hicham has been in the software industry for over 20 years and for many of them focused on cloud management. Hicham has been a frequent presenter at events and conferences like VMworld, vForum, VMUG, VMLive, Gartner, Dell Technology World, AWS re:Invent, HPE Discover, Cloud Field Day, Red Hat Summit, AnsibleFest, in addition to Customer events.

Mathew Packer

Mathew Packer

Matthew Packer is a Principal Product Marketing Manager for Ansible Automation Platform and is responsible for cloud automation. Prior to joining Red Hat, he worked in product marketing specializing in retail payment technology at Vontier and product management at Cisco in cloud-based networking. Matthew also worked as a consultant at Honeywell in the manufacturing and utilities industries with a focus on the Internet of Things (IoT) and predictive analytics space.

Thanos Matzanas

Thanos Matzanas

Thanos Matzanas is a Staff Product Manager and the AWS Alliance Lead for IBM Instana. He has been in the monitoring and observability field for over a decade focusing on helping clients achieve business goals from the use of observability solutions. In his current role, he leads the product’s integrations with AWS and focusing on increasing Instana’s visibility within the AWS ecosystem.