Maximize cloud investment value through operational excellence using AWS Managed Services

In this blog post, I share my observations as an AMS Solutions Architect on how achieving operational excellence can help organizations realize their cloud business objectives while migrating to AWS. I dive deep into the five design principles that AWS Managed Services (AMS) uses to achieve operational excellence.

Amazon is guided by four principles: customer obsession rather than competitor focus, passion for invention, commitment to operational excellence, and long-term thinking. When you focus on operational excellence, you see the following business benefits:

Improved quality and availability of service to end users.
Reduced time to recover from failure.
Reduced time to build new infrastructure and offer new services.
Reduced business risk and easier compliance with industry standards.

The importance of operational excellence in a cloud migration

The AWS Well-Architected Framework helps customers build secure, high-performing, resilient, and efficient infrastructure for their applications and workloads. When you’re planning a cloud migration, you place a lot of focus on the Security, Reliability, Performance, and Cost Optimization pillars. The Operational Excellence pillar is just as important. It focuses on running and monitoring systems to deliver business value and continuously improving processes and procedures.

You can accelerate your cloud migration through AWS Control Tower for pre-built governance and best practices and your infrastructure provisioning through AWS Service Catalog and AWS CloudFormation. You can also use services like Amazon GuardDuty and Amazon Macie for security management. The AWS Management and Governance services include AWS Organizations, Amazon CloudWatch, AWS CloudTrail, AWS Config, AWS Systems Manager (SSM), AWS Cost Explorer, and others.

While these services can help achieve a balance of business agility and governance controls, the heavy lifting is in the orchestration of these services to meet the needs of your cloud operating model. When it comes to scaling cloud operations, IT organizations struggle to isolate the root cause for failures between infrastructure and applications. It can be difficult to recognize which infrastructure metrics are appropriate for monitoring and which ones will result in noisy alarms. Another challenge that I’ve observed is to proactively assess the current state of hygiene as it relates to security and compliance and drive remediation. After they’ve set up the operational configurations, IT organizations can find it burdensome to manage patching, backup, and host security across their growing landscape of AWS accounts.

By focusing on operational excellence, you can tackle challenges that have a direct impact on business objectives, including:

Security and data breaches.
Stalled application migration at scale due to lack of operational readiness.
Absence of end-to-end runbooks for application and infrastructure incidents and failures.
Infrastructure bill shock due to an absence of enterprise-wide cost management policies and procedures.
Heavy lifting to support audit and compliance reports.
Long lead times to complete new resource account setup for application teams.

With AMS, AWS has automated cloud infrastructure operations through the orchestration of the AWS Management and Governance services and security services like Amazon GuardDuty and Amazon Macie.

Implementing operational best practices to protect your cloud investment

When you integrate the design principles for operational excellence outlined by the Well-Architected Framework, you ensure that your cloud platform is production-ready. This, in turn, helps you realize business benefits like cost efficiencies and improved service KPIs (high availability, high reliability, high agility).

Here are the five design principles from the Well-Architected Operational Excellence pillar that AMS uses to help you securely operate in the cloud:

Perform operations as code

Performing OS-level patching and backup manually can be time-consuming, error-prone, and expensive. Cloud operations built on automation can identify failure events and trigger consistent remediation. This helps proactively monitor for issues and improve overall service SLAs and uptime. It also helps offload undifferentiated operational burdens from infrastructure operation teams. Use AWS Systems Manager to automate patching based on pre-configured patch baselines and patch windows. Use AWS Backup to automate the running of backup plans. AWS recommends that after your automated operational tasks are configured, you should validate the implementation by ensuring that automated restoration from backup is working and patch failures are being reported and remediated. There should be automated reporting on patch and backup compliance across all accounts to monitor for compliance. AMS uses automation through a combination of scoped-down Identity and Access Management (IAM) roles, AWS Backup, Amazon CloudWatch, and SSM documents to monitor backup and patch compliance and auto-remediation in case of failures. AMS has helped large financial organizations with strict industry-regulation requirements to improve patch compliance in the cloud. Some enterprises have been able to improve patch compliance by 30% and achieve 100% patch and backup compliance.

Make frequent, small, reversible changes

If you don’t consider change management when you design service operations, unchecked changes can compromise security or the health of the environment and end customers. Plan your change management process and tools in a way that facilitates suitable change governance and security. If the changes are traceable and made in increments, they can easily be reversed, as needed. AWS recommends that operators have scoped-down IAM roles so that they can change or access only those resources for which they’re authorized. AWS also recommends that you try to automate changes to infrastructure through AWS CloudFormation or AWS Service Catalog. Enforce guardrails on changes that can potentially compromise the security of the environment like creating new IAM roles. AMS has helped healthcare and financial management companies who are seeking prescriptive infrastructure change management. When customers request changes to infrastructure or configurations, AMS uses automation to complete the change and logs the change in AWS CloudTrail so it’s auditable. The AMS security team has also partnered with customers to guide them through high-risk changes in the managed environment. One AMS customer was able to roll out configuration changes to 3,100 Amazon Simple Storage Service (Amazon S3) buckets in 33 production accounts in less than 8 hours. Although the configuration update to the S3 buckets was small, the AMS change management system enabled us to develop custom, secure, and compliant automation to orchestrate and validate the changes at scale. AMS also included full rollback capabilities and logging, which makes the changes fully auditable.

Refine operations procedure frequently

After you set up your cloud operations, test and refine frequently to ensure readiness for failures. This helps validate that all procedures are effective and that your teams are familiar with them. It also helps ensure that operational due diligence is completed for any newly launched product features and there are no gaps in the current operating model. Engage with application owners and security and infrastructure teams on regular tabletop exercises, especially for application reliability and availability. During onboarding to AMS, customers are guided through operational Game Day exercises to validate operational processes in the cloud with infrastructure and application teams. Any learnings are incorporated into operational runbooks. AMS has also supported customers in the regular implementation of a disaster recovery (DR) exercise by providing standardized and automated tooling and processes. An AMS enterprise customer was able to complete a DR exercise for an application stack in 1.5 hours (compared to the 24 hours they used to spend on-premises).

Anticipate failures

According to Murphy’s law, “Anything that can go wrong will go wrong.” You can reduce the impact of failures to your service if you test incident response procedures with all involved teams before you go into production. Design infrastructure operations with proactive monitoring and automated remediation in mind. AWS recommends that you iterate to identify appropriate metrics and performance indicators that should be monitored to detect failures and isolate failures at the OS and application levels. Running a defined plan will reduce errors and reduce downtime.

Using Amazon CloudWatch and Amazon GuardDuty, AMS deploys a baseline set of curated monitoring alerts that proactively monitor your AWS resources for failures, performance degradation, and security issues. The monitoring baseline is frequently calibrated to ensure the alerts are not too noisy or missing any new known triggers for failure. In case of an incident, AMS operators correlate data from AWS tooling like VPC Flow Logs, AWS CloudTrail logs, Amazon CloudWatch, AWS Systems Manager OpsCenter, and AWS Health to triage for root cause. When a failure is detected, AMS proactive alerting and automated remediation have led to a 60% reduction in service-impacting incidents. AMS also proactively notifies customers about 70% of all potentially service-impacting infrastructure related incidents which result in reducing the service impact time by 75%.

Learn from all operational failures

When appropriate tooling and remediation are not in place, operational errors and failures like misconfigurations, missing important indicators due to noisy alarms, inadvertently disabling logging in accounts, and so on can occur. These failures can compromise security monitoring. AWS recommends that you continuously audit for these types of errors and ensure that cloud operation runbooks are updated periodically to include those checks in the default configuration. AMS drives operational improvements for customers through nine automation flywheels. We continuously identify manual work and create automations around those tasks to reduce the risk of human error and lower operational costs.

Conclusion

For organizations that are migrating workloads to the cloud, operational excellence is critical to achieving business objectives. You can use AWS services to implement the operational excellence design principles on your own, but the process can be time-consuming and it can force your IT team to deprioritize other higher-order cloud adoption initiatives. By using AMS operators to handle the undifferentiated operations work, you can free up their time to focus on enterprise self-service deployments and business outcomes like accelerated cloud migration, application modernization, and long-term cost optimization.

If you need help, consider an AMS Operations Plan to bridge operational gaps and use automation to accelerate your path to production-ready applications in the cloud.

AWS Cloud Operations & Migrations Blog