Reinventing automated operations (Part II)
The first post in this series, Reinventing automated operations (Part I), covered the importance of operations in the cloud and how deferring the creation of an operations plan can slow down your migration. In this post, I’ll share the primary mechanism of iterative improvement (aka flywheel) that AWS Managed Services (AMS) uses to increase operational efficiency and improve the operational experience of our customers. I will also cover the nine individual automation flywheels that we built to spin our primary flywheel faster.
Similar to the original Amazon flywheel, AMS also uses flywheels to iterate and improve our capabilities for customers. As we identify manual work, we create automations to reduce wait time, improve security, execute consistently, and more. Automation improves our operational efficiency by reducing our workload. The use of automation also means we need fewer engineers. Fewer engineers means lower costs and allows us to lower prices, which improves the operational experience for our customers without reducing quality (see figure).
Our automation flywheels convert our operational experience into libraries of operational content that are the artifacts we create within, run on top of, or use to extend standard AWS services. This operational content is our greatest value to customers since there is no compression algorithm for experience. Additionally, because there will never be enough qualified cloud engineers, operational content allows us to scale headcount in a nonlinear fashion and reach a global scale. Without our focus on automation, AMS would likely have four to five times the number of operational engineers we have today.
To date, AMS has identified nine automation flywheels to continually create and refine our operational content:
Security is job zero. We use AWS Security Hub and Amazon GuardDuty for security-specific alerts, Amazon Inspector for host based detection, and Amazon Macie for Amazon Simple Storage Service (Amazon S3) bucket monitoring. Automation and deduplication reduce the volume of alerts so that a real security incident is less likely to get overlooked. We have consolidated 54 Amazon GuardDuty alerts into four remediation run books. We use automation (for known configuration issues) and our intuition and experience when we review IAM and security group configurations. We use aggregated logging to prevent tampering and give customer security teams visibility for their own analysis.
2. Detective guardrails and remediation
In addition to the standard detective controls provided by AWS Control Tower, we use AWS Config rules to create and deploy customized conformance packs. To enforce Center for Internet Security (CIS) and National Institute of Standards and Technology (NIST) controls, AMS deploys and maintains 98 configuration rules in custom conformance packs. Simply turning on alarms would overwhelm our operations engineers (and potentially obscure real incidents), so we have created automations to remediate each control. We deduplicate any existing customer controls to reduce cost and noise and can tailor responses for each customer.
3. Monitoring alerts and events
Our primary monitoring tools are Amazon CloudWatch (for example, alarms, events, logs) and AWS CloudTrail. Individual alarm metrics are seldom useful. They can generate a lot of noise, which can increase costs. Alert management is both our biggest operational expense and opportunity for innovation. In addition to tuning, correlation, deflection, we use AWS Security Hub and the AWS Personal Health Dashboard to improve accuracy of the findings. In 2020, we reduced alarm noise for high-severity alerts by 41 percent. In 2019, we reduced it by 64 percent. We also use automation to automatically remediate known alerts and pull diagnostic information ahead of investigation. Our focus on this flywheel removes low-value work from our engineers, reduces labor costs, and improves our ability to detect a true incident.
4. Deployment automation
This flywheel is critical for dependency and fleet management capabilities. It allows our teams to focus on operational content. To make it easier for customers to resume operations in the future, we use native AWS services wherever possible to create canaries, mechanisms, and automations that evaluate, maintain, and remediate drift. For example, our scripts detect and correct if the AWS Systems Manager agent and Amazon CloudWatch agent are out of date, didn’t start, or are not installed. Many customers are already using some form of deployment automation for their workloads. We work with them to avoid conflicts, duplication, and false alarms from resetting each other’s configurations.
Most of our managed customer environments are 90-100 percent compliant with the latest vendor patches. Critical security patches are typically applied by the next business day. Amazon Machine Image (AMI) images are updated constantly and distributed for each customer, operating system, and account. Up-to-date AMIs help keep new launches and application releases current, but most enterprise workloads are immutable, long running, and are patched in place. We use AWS Systems Manager Patch Manager to create maintenance windows, define patch baselines (from a few to dozens), and execute the patches. We use automations along with AWS Resource Groups and tags to target different applications and environments (dev, stage, prod) and handle workload exceptions. Our engineers can now spend their time investigating failed executions, escalating noncompliant workloads, expediting critical security patches, and creating complex patch automations (instances in Auto Scaling groups) and orchestrations (resources share across multiple applications).
6. Backup and disaster recovery
Workloads have varying backup requirements and intervals. AMS uses AWS Backup to create recovery points. Although Amazon Elastic Block Store (Amazon EBS) snapshots work for most applications, some workloads, such as database-backed applications running on Amazon Elastic Compute Cloud (Amazon EC2), require coordination. Depending on root cause and whether the application follows the pillars in AWS Well-Architected, recovering from a failure is highly variable. Most data center workloads won’t be refactored before a migration (which is recommended). This flywheel has helped us develop run books and applications for disaster recovery. We routinely perform failover tests with our customers so that we are ready. Knowledge of the application is also critical. Some individual resource failures for an application might be a non-issue; others can cause an outage. AMS is expanding the use of AWS Service Catalog, AWS Service Catalog AppRegistry, AWS Systems Manager AppConfig, AWS Resource Groups, and even tags to increase awareness and expedite investigation and recovery.
7. AWS service depth
As a baseline AMS Accelerate provides monitoring and incident management (infrastructure and security) for every AWS service. AMS then goes further to develop a deeper operational understanding of the typical use cases, issues, and top failure and recovery scenarios. We use this knowledge to create service-specific automations, configurations, monitoring, and integrations that improve operations. Because many of the next-gen AWS services are closer to applications, we’ve expanded our definition of operations to include service configuration and performance tuning. Customer solutions built from more than one AWS service also fall in this category (for example, data lakes, AI/ML, security services).
8. Operational tooling
We built AWS Systems Manager OpsCenter in partnership with AWS Systems Manager to create an extensible flywheel for operational tasks (OpsItems). OpsCenter is a major enabler of our scale. We continue to invest and contribute to this AWS Systems Manager capability. For AMS, OpsCenter serves two purposes:
- It brings together troubleshooting information from across multiple AWS services and sources. This aggregation speeds up investigation and reduces handling times.
- After we know the root cause, we can create an automation to remediate future occurrences. We can configure inputs from Amazon CloudWatch, Amazon CloudWatch Application Insights, Amazon EventBridge, AWS Command Line Interface (AWS CLI), application programming interface (API), or manually. By using OpsCenter, we have automated the remediation of our top alarms and events. They now require no human intervention and can prevent an outage before it happens (for example, boot volume 90% full).
9. Financial and capacity management
We use AWS Cost Explorer, AWS Trusted Advisor, and tools like usage reports to help customers understand, predict, and reduce costs. Finding savings opportunities is only the first step. You must take action to realize those savings. AMS created a resource scheduler to shut down customer development environments during off hours, which saved customers 76% on these environments. There are cost-saving opportunities right after a migration, but we see a continuous need for oversight as new workloads are launched and usage patterns change. In some months, we have identified more cost savings than the cost of our fee.
Flywheels are an important part of how we stay agile and focused on our goal of automated operations and a better operational experience for our customers. I hope this two-part blog series has helped inform your own automated cloud operations strategy. It’s critical to consider operations before, during, and after a migration. If you need help, I encourage you to consider using an AWS Managed Service Provider partner or AMS Accelerate to simplify and de-risk your journey the cloud.