Skip to main content

AWS DevOps Agent

AWS DevOps Agent features

Always on, autonomous incident response

Open all

AWS DevOps Agent integrates with ticketing and alarming systems like ServiceNow to automatically launch investigations from incident tickets, accelerating incident response within your existing workflows to reduce mean time to resolution (MTTR).

You can also initiate and guide investigations using interactive chat. AWS DevOps Agent acts as a member of your operations team, working directly within your collaboration tools like ServiceNow and Slack to share findings and coordinate response. When needed, create an AWS Support case directly from an investigation, giving AWS Support experts immediate context for faster resolution.

AWS DevOps Agent integrates with observability tools, code repositories, and CI/CD pipelines to correlate and analyze telemetry, code, and deployment data, sharing its explored hypotheses, observations, and root cause findings. Through systematic investigations, AWS DevOps Agent identifies root cause of issues stemming from system changes, input anomalies, resource limits, component failures, and dependency issues across your entire environment.

Once AWS DevOps Agent has identified the root cause, it provides detailed mitigation plans, which include actions to resolve the incident, validate success, and revert a change if needed. AWS DevOps Agent also provides agent-ready instructions that can be implemented by another frontier agent, for example, code improvements that can be implemented by Kiro autonomous agent.

Through systematic investigation of alarms stemming from system changes, input anomalies, resource limits, component failures, and dependency issues across your entire stack, AWS DevOps Agent guides DevOps teams with targeted mitigation steps, reducing mean time to resolution (MTTR) from hours to minutes. For example:

  • System changes: If an incident is caused by Amazon DynamoDB getting throttled due to a recent code change that results in high latency from inefficient use, AWS DevOps Agent may recommend rolling back the change as an immediate mitigation.
  • System changes: If an incident is caused by Amazon SNS subscription errors due to filter policy mismatch following a code deployment, AWS DevOps Agent may recommend rolling back the code change that altered the message structure as an immediate mitigation to restore message flow.
  • Input anomalies: If an incident is caused by AWS Lambda throttling on notifications due to high traffic exceeding limits, AWS DevOps Agent may recommend increasing concurrency limits as an immediate mitigation.
  • Input Anomalies: If an incident is caused by Amazon SNS message publish failures due to message size issues, AWS DevOps Agent may recommend adding validation to Amazon SNS message publishing as an immediate mitigation.
  • Resource Limits: If an incident is caused by API throttling due to exceeded rate limits, AWS DevOps Agent may recommend raising rate/burst limits as an immediate mitigation.
  • Resource Limits: If an incident is caused by Amazon DynamoDB throttling due to exceeded write capacity, AWS DevOps Agent may recommend increasing write capacity as an immediate mitigation.
  • Component Failures: If an incident is caused by cold start latency due to performance degradation, AWS DevOps Agent may recommend increasing provisioned concurrency as an immediate mitigation.

Proactively prevent future incidents

Open all

AWS DevOps Agent analyzes patterns across historical incidents to provide actionable recommendations that strengthen four key areas: observability, infrastructure optimization, deployment pipeline enhancement, and application resilience. For example, in the area of infrastructure optimization, AWS DevOps Agent recommends the Kubernetes Horizontal Pod Autoscaler (HPA) for EKS clusters to handle unexpected traffic spikes. 

AWS DevOps Agent identifies gaps in observability coverage and opportunities to fine tune your alarms, reducing the mean time to detection (MTTD) so you can identify issues before they become a larger problem. For example, after identifying that incident detection for recent failures took too long, AWS DevOps Agent may recommend implementing monitoring and anomaly detection closer to the error source to reduce detection time, preventing extended outages.

Using a learning loop, AWS DevOps Agent continues to refine its recommendations, align with your operational priorities, and deliver increasingly relevant recommendations tailored to your organizational needs based on your team’s feedback on recommendations.

AWS DevOps Agent analyzes patterns across historical incidents to provide targeted recommendations that prevent future outages and strengthen system resilience. By evaluating real incidents, it delivers specific, actionable improvements that reduce both frequency and impact of similar issues in four key areas: observability, infrastructure optimization, deployment pipeline enhancement, and application resilience.

  • Observability improvement: AWS DevOps Agent may recommend adjusting alarm thresholds from 15 failures over 20 minutes to 3 failures within 5 minutes for critical authentication systems to reduce detection time, preventing extended integration outages.
  • Observability improvement: AWS DevOps Agent may recommend implementing targeted CloudWatch metric filters to track anomalous "Access Denied" patterns for IAM role changes, enabling faster detection compared to a prior alarm.
  • Infrastructure improvement: After analyzing that the Amazon DynamoDB table schema doesn't match the service's main access pattern, forcing inefficient full table scans, AWS DevOps Agent recommends creating a Global Secondary Index (GSI) with the frequently-queried attribute as the partition key. This would transform operations from Scans to Queries, reducing latency from 2,500-3,500ms to under 100ms and preventing throttling.
  • Infrastructure improvement: AWS DevOps Agent’s analysis shows the application has adequate resources but is constrained by a single-pod bottleneck where all requests queue to one instance during traffic spikes. AWS DevOps Agent may recommend adding Horizontal Pod Autoscaler to the Kubernetes cluster, which will automatically scale the service horizontally based on demand, effectively distributing the load across multiple pods.
  • Deployment pipeline: After analyzing failed Amazon ECS deployments, AWS DevOps Agent may recommend enabling automatic rollbacks and monitoring deployment states with Amazon EventBridge. These changes will quickly detect and address task health check failures, preventing disruption of customer transactions.
  • Deployment pipeline: After analyzing deployment failuresAWS DevOps Agent may recommend mandatory pre-deployment validation of Amazon Managed Service for Prometheus connectivity for Amazon ECS task definitions. This recommendation would reduce failed deployments by detecting connectivity issues during the deployment process.  

Get more from your DevOps tools

Open all

As AWS DevOps Agent learns about your environment, it identifies your application resources such as containers, network components, log groups, alarms, and CI/CD deployments, and maps how they connect to create an application resource map. It combines this resource topology with your telemetry, code, and deployment data to precisely pinpoint root causes of issues.

AWS DevOps Agent offers built-in integrations with many observability tools (Amazon CloudWatch, Dynatrace, Datadog, New Relic, and Splunk), code repositories, and CI/CD pipelines (GitHub Actions and repositories, GitLab Workflows and repositories). 

You can extend AWS DevOps Agent beyond its built-in integrations by connecting to your own MCP server, enabling integrations with additional tools such as your organization’s custom tools, specialized platforms, or proprietary ticketing systems. For example, by connecting to your own MCP server, you can integrate with open-source observability signals such as Grafana alarms and Prometheus metrics and runbooks in Confluence. 

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages