- AWS Solutions Library›
- Guidance for Troubleshooting Amazon EKS using Agentic AI workflow on AWS
Guidance for Troubleshooting Amazon EKS using Agentic AI workflow on AWS
Overview
This Guidance demonstrates how to address the complexity of troubleshooting Amazon EKS environments with multiple metrics and logs by implementing an agentic AI workflow that uses generative AI with RAG-enabled knowledge bases and chat interfaces to accelerate problem diagnosis. The implementation deploys a comprehensive EKS environment using Terraform with managed node groups, observability stack including Amazon Managed Prometheus and Grafana, and RBAC-based security mapped to IAM roles. A Slack-integrated AI agent system runs on the EKS cluster, where an orchestrator agent receives troubleshooting requests and delegates tasks to specialized agents that connect to Amazon S3 vector-based knowledge bases for semantic similarity matching against historical troubleshooting cases. You can significantly reduce your mean time to triage EKS issues while improving the accuracy of root cause analysis and remediation initiation across various infrastructure and application problems.
Benefits
Deploy intelligent agents that analyze cluster data and historical cases to deliver actionable recommendations. Reduce mean time to resolution by automating diagnostic workflows through natural language interactions in Slack.
Implement infrastructure as code with Terraform to deploy secure, multi-AZ Amazon EKS environments with essential add-ons and observability. Ensure consistent configurations across development, staging, and production environments.
Empower DevOps teams, SREs, and developers to troubleshoot Kubernetes issues directly from Slack using ChatOps patterns. Leverage real-time cluster insights and semantic search of past incidents to resolve problems faster.
How it works
Provision EKS Cluster
This diagram shows how to provision an Amazon Elastic Kubernetes Service (EKS) cluster with best practices configuration and critical add-ons.
Agentic AI Workflow
This architecture diagram shows Agentic AI troubleshooting workflow working with real-time EKS cluster data and integrated with AWS AI services for analysis and Slack for ChatOps.
Deploy with confidence
Everything you need to launch this AWS Solution in your account is right here
We'll walk you through it
Get started fast. Read the implementation guide for deployment steps, architecture details, cost information, and customization options.
Let's make it happen
Ready to deploy? Review the sample code on GitHub for detailed deployment instructions to deploy as-is or customize to fit your needs
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
Did you find what you were looking for today?
Let us know so we can improve the quality of the content on our pages