Investigation is the most time-intensive phase of an operational event. When things are going wrong, it can be difficult to identify the root cause of the issue and prioritize what is most important to fix. Use these resilience solutions to help you quickly understand the root cause of an issue. This way, you can remediate faster and improve your mean time to recovery (MTTR).

AWS Services

Purpose-built cloud products

Amazon CloudWatch
Observe and monitor AWS resources and applications in the cloud and on premises
Amazon Managed Grafana
Scalable and secure data visualization for your operational metrics, logs, and traces
Amazon Managed Service for Prometheus
Highly available, secure, and managed monitoring for your containerized systems
AWS Distro for OpenTelemetry
Secure, production-ready open source distribution with predictable performance
Analyze and debug production and distributed applications