Accelerate your Monitoring and Observability foundation through AWS Managed Services

To establish a strong foundation for efficiently and safely operating your workloads in the cloud, you must consider how you will monitor the health of your workloads. As described in the AWS Well-Architected Operational Excellence pillar, one of the cloud’s design principles for operational excellence is “Anticipate Failure.” Therefore, design your cloud operations with proactive monitoring and readiness for failures.

As organizations adopt AWS at scale, they face the challenge of hosting workloads that may have unique operational needs and end user SLAs, but they must balance that with the need for standardization of operational best practices. This can be time-consuming for Cloud Operations/Cloud Platform/IT teams that are new to adopting AWS at scale. The teams are further challenged to retain cloud talent and may lack AWS skills. In some cases, the IT teams are small in size and must focus on higher-order operations automation and product/feature innovation directly impacting business revenue. Partnering with AWS Managed Services (AMS) can help your teams focus on value-add services, while AMS offers ongoing support for proactive infrastructure monitoring, response, and recovery from incidents and failures.

In this post, we will share how AWS Managed Services (AMS) can accelerate setting up operational best practices-based monitoring and observability through AWS native tools in your AWS accounts.

What does Monitoring and Observability mean?

Operationalizing Monitoring at scale involves three steps. First, ensure that all workloads are observable. This means that workloads emit health indicators that let you determine the health or performance of a workload. Second, collect and correlate the observability data (log, metrics, and trace data) and use it to determine the health or performance. Third is to optimize your monitoring strategy. This includes assessing the collection of insightful logs, monitoring the right metrics, and responding at the right thresholds that will help you react to imminent infrastructure and application failures before they impact your business productivity. This step is typically iterative as your teams assess the optimal monitoring baselines for their workloads. In terms of anticipating failures, your teams must build and test out response and remediation incident management plans in the event of a failure. This could be as simple as restarting a stopped Amazon Elastic Compute Cloud (Amazon EC2) instance or complex as initating a Disaster Recovery runbook to ensure resiliency that allows for timely recovery of critical applications and high availability. When you operationalize your monitoring strategy, it is important to work backwards from desired business outcomes like cost efficiency and service SLAs/SLOs. In cases where appropriate monitoring tools and processes have not been considered, it has adversely impacted the ability of business stakeholders to react to operational incidents because they don’t have visibility into the health of infrastructure and applications.

AWS enables visibility into workload health from metrics, logs, and traces with Amazon CloudWatch. However, establishing monitoring and alerting thresholds and baselines for AWS services such as Amazon Relational Database Service (Amazon RDS), Elastic Load Balancing (ELB), Amazon EC2, and others requires iterations to arrive at an optimal state. You must correlate metrics, logs, and traces to ensure context-based health checks and response plans. For Amazon EC2-based workloads, you need to configure resources for observability, ensuring that CloudWatch agents are deployed on each instance in your environment. For example, before you start monitoring AWS services such as Amazon RDS, Amazon EC2, and ELB, you should create a monitoring plan. This includes selecting the appropriate metrics to monitor and selecting the conditions under which the metric triggers an alarm and the associated action to respond and remediate. This monitoring plan is crucial for ensuring high availability in your workloads.

AMS as an enabler for Monitoring and Observability at scale

AMS enables your operations teams through automation and provides consistent monitoring and incident management across all account(s) and resources under AMS management. AMS leverages CloudWatch as the primary monitoring tool. Additionally, AMS collects logs, such as VPC Flow logs and CloudTrail logs, which help detect network traffic issues and non-conformance. Due to our extensive automation libraries and ability to pull diagnostic information ahead of an investigation, AMS detects and proactively notifies customers of 78% of performance-impacting incidents. AMS continues to iterate and learn from our operational experience, automate undifferentiated work, and improve our ability to detect an actual incident. Now let’s dive into what this offers you in your cloud adoption and migration journey:

Predefined and optimized monitoring baselines for AWS Account(s) – AMS simplifies the configuration of CloudWatch metrics and alarms. For AWS accounts onboarded to AMS, AMS deploys a default baseline of CloudWatch metrics and alarms that have been optimized to reduce the noise and identify indications of potential failures, performance degradation, and security issues. AMS calibrates its baseline monitoring on a periodic basis. Furthermore, AMS monitors AWS resources such as Amazon RDS, ALB, Amazon RedShift, Amazon EC2, and many others and deploys associated baseline for monitoring. For Amazon EC2 instances, AMS deploys CloudWatch Agent, which enables detailed metrics and OS logs for CloudWatch to collect. The agent also sends system-level logs to CloudWatch. These aspects help to achieve a comprehensive observability strategy.
Automated resource onboarding for health checks – AMS performs automated on-instance configuration, leveraging Resource Tagger. This action ensures that an instance emits the correct logs and metrics for AMS to manage the EC2 instance properly. As part of automated instance configuration, AMS adds the AWS Identity and Access Management (IAM)-managed Policies required to grant the instance permission to use the agents installed by AMS. For newly provisioned and existing Amazon EC2 instances, AMS also ensures that the SSM agent is running, which lets you run remote commands on the instance. To ensure that the required metrics and logs are emitted, AMS customizes the CloudWatch configuration for all resources supported by AMS.
Flexibility to customize the predefined baselines – AMS Alarm Manager lets you create specific alarm thresholds and apply alarms based on tags. You can customize the configuration of your AWS resources based on their type, platform, and other tags. AMS deploys Alarm Manager in your account during onboarding. For example, you can set a memory threshold of 90% on a fleet of Amazon EC2 instances but a memory threshold of 70% on other resources. Alarm Manager will automatically apply alarms when you provision a new resource and delete alarms when you delete an existing resource. This approach ensures that the instance is observable, and AMS can monitor your workload health.
Security Incident Response – AMS continuously monitors your managed accounts by leveraging native AWS services, such as Amazon GuardDuty, and Amazon Macie (optionally). GuardDuty is a continuous security monitoring service that uses threat intelligence feeds, such as lists of malicious IP addresses and domains, and machine learning (ML) to identify unexpected and potentially unauthorized and malicious activity within your AWS environment. Macie can be enabled in your AMS account to detect a large and comprehensive list of sensitive data, such as Personal Health Information (PHI), Personally Identifiable Information (PII), and financial data. AMS notifies you of any security findings , and works with you to remediate as needed. Our response processes are based on the National Institute of Standards and Technology (NIST) Cloud Security Framework. AMS regularly tests its response processes using Security Incident Response Simulation with you to align your workflow with existing customer security response programs. When AMS detects any violation, or an imminent threat, of AWS or your security policies, we gather information, including impacted resources and any configuration-related changes.
24/7/365 Incident Management support – AMS provides 24/7/365 follow-the-sun support with dedicated, skilled AWS operators In cases where critical severity incidents are impacting your critical workloads, AMS may recommend an infrastructure restore. See the list of alerts that AMS automatically remediates to ensure high-availability of your workloads. AMS also correlates health indicators from various AWS services, such as Cloudwatch, AWS Config, and AWS Personal Health Dashboard to provide you with an aggregated view. AMS extends support for US Soil, UK Soil, and GovCloud. The following health check failures scenario illustrates how AMS engages proactive monitoring and responds to restore the availability of your workloads under AMS management.

: Under the CloudWatch metrics dashboard, a CloudWatch generates an alert because a Status Check failed for more than five minutes, three times on an Amazon EC2 instance. Since the configured threshold is breached, the event triggers an alarm. This event is immediately categorized as an “incident” by AMS operators since this behavior may impact the workload availability hosted on the instance. This alarm indicates that the EC2 instance is running on degraded hardware or has entered a fault state. For these failures, AMS sends an auto-alert notification to you, and in the backend, we begin triage. AMS first validates instance accessibility. Suppose it’s confirmed that accessibility is impacted. In that case, AMS stops the instance and starts it again so that it can be migrated to the new underlying hardware.

Figure 1 – CloudWatch Metrics for an instance showing a Stack Check failure.

In the CloudWatch dashboard, under details, the instance has now returned to a healthy state. Once AMS automation successfully implements the mitigation, we inform you through a follow-up alert notification. In case AMS is unable to mitigate, we provide recommendations in our report about the next steps you must take to bring the instance back to an operational state. Your team can always respond to the alert notification and collaborate with AMS operations, which is available 24/7/365, to find the optimal solution.

Figure 2 – CloudWatch Status check failure alarm for the instance is in healthy state after proactive response and remediation by AMS.

Refine Operational Procedures – AMS works with your teams to plan for operations and security responses using game days. This procedure ensures that your teams understand and practice how to respond to operational and security events jointly and avoids any workload availability impact in actual events.
Health checks for EKS-based workloads – For modernized workloads, AMS provides a set of EKS cluster health checks that leverage automation to detect, alert, and remediate issues that can impact clusters and workloads. AMS compliments customer monitoring with foundational coverage. Furthermore, AMS manages the alert workflow from event to triage to the notification. It acts to remediate as appropriate and escalates as needed.

Conclusion

Operational excellence in the cloud is a journey rather than a destination. Getting started can be overwhelming for your cloud operations team if they’re constrained by staff and skill gaps, or handling competing business priorities for cloud transformation. Lack of consistent monitoring and observability of your AWS account(s) and workloads can add risk to your business and impacts your workloads’ availability and resiliency. It also impacts your ability to respond and remediate failures in case of outages. If you need help, then consider an AMS Operations Plan to bridge operational gaps and use automation to accelerate your path to production-ready applications in the cloud. Click here for a demo of how AMS supports your Cloud Operations needs in AWS. Explore the AMS Service Description Document on how to get started with AMS.