IBM & Red Hat on AWS
Intelligent Remediation for application workloads on ROSA with Instana and Ansible
Overview
Customers have adopted Red Hat OpenShift Service on AWS (ROSA) due to it being a managed service. This means Red Hat manage OpenShift, ensuring the cluster is available and the customer can focus on what is critical to the business. Though this sounds great it still means the customer is responsible for the application workloads which run on OpenShift. These application workloads can still experience performance degradations, pod failures, or resource constraints. Manually identifying and remediating such issues can lead to increased downtime and operational overhead.
In this blog post we shall explore how to gain improved observability and automate actions in response.
IBM Instana, an AI-powered observability solution which helps customers gain insight into application workload, health, performance and dependencies.
Watch this video: What is IBM Instana
Red Hat Ansible Automation, enables intelligent remediation by detecting issues in real-time and triggering automated corrective actions.
Challenges in Monitoring and Remediation
- Complexity of application workloads: Dynamic containerized environments introduce complexity in detecting and addressing performance issues.
- Manual Intervention: Traditional troubleshooting and remediation require human intervention, leading to delays and inefficiencies.
- Lack of Contextual Insights: Identifying the root cause of issues across distributed microservices can be time-consuming.
- Scalability Concerns: Managing remediation at scale in large environments can be challenging without automation.
Solution: Instana + Ansible for Intelligent Remediation
How It Works
- Instana monitors application workloads running on ROSA in Real-Time
- Provides deep visibility into nodes, pods, services, and application dependencies.
- Detects performance anomalies, pod failures, and resource constraints using AI-driven insights.
- Automated creation of Alerts with Ansible Automation Platform
- Ansible is used to configure and manage alerting rules in Instana for key application workload and OpenShift performance metrics.
- Ensures that alerts are automatically created based on predefined thresholds (e.g., high CPU/memory usage, pod restarts, or network latency).
- Eliminates manual configuration efforts, ensuring consistency across multiple ROSA clusters.
- Instana trigger Automated Remediation
- When an issue is detected (e.g., high CPU usage, pod crash loops, or service unavailability), Instana generates an event or incident.
- Instana’s AI engines discover the root cause of the issue and determine the most appropriate Ansible artifact to invoke.
- Create support cases and change control requests by integrating with solutions such as Jira and Service now.
- The corrective Ansible automation is triggered in the Ansible Automation Platform.
- Notify owning stakeholders such as application or database owners, and infrastructure teams.
- Ansible Executes Remediation Playbooks
- Based on predefined automation workflows, Ansible runs corrective actions such as:
- Restarting failed pods
- Scaling up/down resources
- Rolling back faulty deployments
- Clearing problematic cache or restarting dependent services
- Notifying DevOps teams via Slack, Microsoft Teams, or email
- Continuous Feedback Loop
- Post-remediation, Instana verifies the system’s stability and ensures the corrective actions resolved the issue.
- If further action is needed, additional automation workflows can be triggered.
Benefits of the Solution
- Reduced Mean Time to Resolution (MTTR): Automating issue detection and remediation reduces downtime and operational delays.
- Improved Reliability and Performance: Ensures ROSA workloads run optimally with minimal human intervention.
- Scalable and Consistent Operations: Automated workflows scale across multiple clusters and environments.
- Enhanced DevOps Efficiency: Frees up engineering teams from manual troubleshooting, allowing them to focus on innovation.
Use Case Example
Scenario: A microservices application running on ROSA experiences pod failures due to an out-of-memory (OOM) error.
Instana Detection: Instana detects the OOM event and identifies the affected service.
Ansible Automation Platform Remediation:
- Ansible is triggered automatically.
- Playbook increases memory limits for the affected pod and redeploys it.
- Notification is sent to the DevOps team with remediation details.
Result: The application recovers without manual intervention, ensuring a seamless end-user experience.
Figure 1: Execution layers of event driven automation
Conclusion
By integrating Instana’s AI-driven observability with Ansible’s automation capabilities, enterprises running workloads on ROSA can achieve intelligent, automated remediation. This approach minimizes downtime, enhances performance, and optimizes operational efficiency, making it a vital strategy for modern cloud-native applications.