Build a Cloud Automation Practice for Operational Excellence: Best Practices from AWS Managed Services

Introduction

In today’s fast-paced business environment, organizations are actively pursuing operational excellence to maintain a competitive edge. Automation is a critical foundation for achieving better efficiency, reliability, and scalability in operations. However, integrating automation into cloud practice entails more than simply implementing software or tools. Building a cloud automation practice requires a transformative journey that encompasses capabilities and practices across the organization’s People, Process, and Technology aspects. AWS Managed Services (AMS) provides operational excellence to AWS customers by managing their cloud infrastructure and security operations, with automation as a key guiding principle. Today, 95% of the actions performed by AMS operations are automated across all its functions. In this blog, we’ve shared guidelines and examples from the AMS experience in building and running cloud automation practices with the aim to help you prepare your organization’s transformative journey in achieving operational excellence towards any centralized operating model.

Integrating automation into cloud operational practice requires a cultural shift to adopt an “automation first” approach, with cross-collaboration between different organizational functions as an essential element for success. AMS, for example, has the following tenets to support and govern its cloud automation practice.

Figure 1: People, Process and Technology

Automate repetitive tasks: If a task is being performed repetitively, then assess data points, prioritize and automate it. As referenced in AWS Well-Architected DevOps Guidance here.
Return On Investment (ROI): Focus on tasks that have the most significant impact on customer outcomes and team effectiveness.
Avoid reinventing the wheel: Leverage existing code/automation to deliver bigger outcomes, and always have a fallback strategy for scenarios where automation isn’t feasible or failing.
Measure from Day 1: Focus on supporting operations data, areas to reduce risk, and continuously measure the outputs and the overall business outcomes to improve the defined processes over time.

To facilitate the above factors, organizations need to explore and leverage appropriate technology, ranging from DevOps tooling, to operations management capabilities and analytics. Let’s explore scenarios where continuous improvement of automation practices, helped AMS solve operational challenges and gain efficiency.

Cloud Automation Practice at AMS

AMS was designed to scale by leveraging AWS’ expertise, and by putting technology into action. Following the AWS Well-Architected Framework’s best practices, AMS built secure, high-performing, resilient, and efficient automations across different functions like incidents and response, proactive monitoring and remediation, and patching. For example, today AMS performs 1.35M AWS Systems Manager runbook activities per month, and up to 97% are automatically triggered to respond to an event. However, there is a need to continuously develop new, and to iterate on, existing automation due to changing customer needs. Handling customer requests involves a thorough review, adherence to security practices, and assessment of risk profile before executing them. It was crucial for AMS to develop and follow automation practice that supports scalability, agility and maintain reliability. Let’s dive deeper into the key components of People, Process and Technology that enable operational efficiency and scale.

People

From a people perspective, there are three fundamental roles in building a centralized cloud automation practice: Operations, Engineering, and Governance. These teams work in concert to meet automation outcomes:

Image of Automation flow in a virtuous cycle

Figure 2: Automation flow in a virtuous cycle

Platform Operations manage and perform operational activities on behalf of customers, with a primary goal to resolve customer issues, and to restore operations as quickly as possible to minimize downtime. The function identifies automation gaps from their operational or customer needs, measures performance metrics (such as Time to Resolve, Service Level Objectives), and continually provides feedback to improve automations.
Platform Engineering captures feedback and uses DevOps based software development and automation standards to create, improve, and maintain automation tooling used by Operations. AMS adopts an API-first culture to improve tooling adaptability and to build on native AWS services to allow for scale. Platform Engineering also works closely with AWS Service teams to drive standardization for different use-cases enabling customers to take advantage of AMS operational benefits across services.
Governance team continually prioritizes and manages roadmaps, tools, and programs, guided by shared tenets. The team consists of Technical Program and Engineering Management functions, focusing on the bigger picture to develop a strategy with multiple teams to ensure efforts are well justified for the ROI (Return on Investment).

Each function works in alignment with a key focus on reducing manual effort through automation, and continuously improving efficiency in a virtuous cycle (refer. Figure 2).

Process

The Governance team establishes an intake process for requestors from the Operations team that requires relevant data and justification for the proposed automation. This data includes the number of past manual requests, time spent on such requests, and any identified security risks. Requests that do not provide adequate justification are put on hold until there is additional data to support automation. To prioritize tasks, the Governance team uses a prioritization matrix for proposed automations, providing a clear view of prioritized items, enabling better decision-making for requestors and business stakeholders.

The template below facilitates the intake process, and includes criteria such as urgency of the automation (U), impact on operations or business (I), and effort required for development (E). Scores are assigned and increased based on defined criteria, and the final automation score is calculated by adding the urgency and impact scores. In cases where multiple automations have the same final scoring, the decision-making process considers the required development effort (E) to make the final determination.

Image describing Automation Prioritization Matrix

Figure 3: Automation Prioritization Matrix

As an example, among different categories of incoming manual requests, AMS used this approach to prioritize automation for Identity and Access Management (IAM) resource management, where they action 1000s of Identity related requests. Because IAM is an essential service that improves the overall security posture of the account, the use of Urgency (U) and Impact (I) scores in the above matrix, results in placement at the top of the prioritization list.

Automation Lifecycle

The development lifecycle is a critical process to enable automation. In AMS, organizational components depicted in the diagram below, plays an essential role in the development, deployment and improvement of an automation:

Image describing Automation Development and the Feedback Cycle

Figure 4: Automation Development and the Feedback Cycle

The Platform Engineering team initiates automation development, while consistently monitoring the process for relevance, safety, scalability, and efficiency. Representatives from the Operations, Engineering, and Governance teams assess automation backlog-to-developer ratio, ensuring timely outcomes and identifying challenges early in the development cycle. Following the DevOps model to scale development process, the Engineering and Operations team collaborate, forming a CI/CD-based contribution model involving operations team members in development. Having operations as part of automation development not only provides valuable insights into the success and failure rates, but also fosters knowledge sharing, team growth, and expedited delivery.

In applying the AWS Well-Architected Best Practice of continuously improving code quality (OPS05-BP07), and ensuring security awareness across the board (SEC11-BP01), the AMS Engineering team performs security reviews throughout the development lifecycle to ensure that all automation is cloud ready before an automation is released.

To extend the IAM example, we’ll next dive into this use case using the AMS automation practice approach.

Image of Automation Practice (IAM automation example)

Figure 5: Automation Practice (IAM automation example)

AMS service users created Identity Access Management (IAM) requests for Operations with each request taking approximately an hour from complete security review to implementation.
AMS Operations had 3 major manual tasks of authoring IAM requests (roles, policies and permission modifications), validating the IAM requests (adherence to best practices, security guidelines and least privilege access), and deploying the IAM resources into respective accounts.
Using operational data from Operations, AMS Governance team validated the automation requests using pre-defined prioritization matrix, analyzing the reported data, and engaged the Engineering team to work on the solution.
AMS Engineering developed an automation comprising of a centralized IAM repository for easy lookup and authoring, automated IAM request validations using pre-defined checks, and an automated mechanism to deploy the IAM resources.
An existing Command Line Interface (CLI) was updated with additional options for bulk deployment of IAM resources across AWS accounts, which further reduced operational efforts.

The overall automation practice outlined above not only assisted the Operations team in handling high volume of manual requests at scale but also saved an estimated of 6.7K Operations hours (equivalent to 34% improved efficiency). To generate these automation performance insights, AMS uses various technologies to built their analysis and reporting capabilities. And as part of continuous improvement, the automation went through multiple iterations of optimization by Engineering based upon the Operations team’s usage and feedback (refer Figure 2: virtuous cycle to complete the feedback loop).

Technology

Technology plays a vital role in supporting and advancing the automation practice within the organization. AMS has divided the cloud automation practice into three operational areas, each leveraging technology in multiple ways to achieve the automation goals.

Centralized Operations Management

AMS utilizes a combination of AWS monitoring and orchestration services, such as Amazon CloudWatch, Amazon Simple Notification Service, AWS Lambda, and AWS Systems Manager OpsCenter, to execute repetitive tasks in cloud operations at scale. These services allow event orchestration and automation workflows, enabling effective work item management, prioritization, and assignment. Using purpose-built AWS services frees up the Platform Operations and Engineering teams to focus on business outcomes rather than writing custom code for each event.

For example, workloads are monitored with Amazon CloudWatch to detect anomalies, while Amazon Simple Notification Service triggers notifications for the action by the relevant stakeholders managing environments. The AWS Systems Manager OpsCenter dashboard helps prioritize notifications using a pre-defined matrix based on the impact of an anomaly (e.g., security event, performance event) and leverages services like AWS Lambda and AWS Systems Manager to automate actions. In addition, the Engineering team develops automation to trigger actions across all managed AWS accounts. For example, the Engineering team built a CLI script integrated with AWS Systems Manager runbooks to assist Operations with parallel bulk executions across multiple resources and accounts within AMS-managed accounts, saving hundreds of hours of work to manually log in and run each automation.

Automation Development

The Engineering team is responsible for developing tools that are crucial for building cloud automation. Various AWS services like AWS Lambda, AWS Systems Manager, and AWS CloudFormation help to automate the build and management of the AMS infrastructure. The Engineering team regularly rolls out new automations using canary deployments with a built-in circuit breaker mechanism, and the ability to rollback changes if needed.

Services like AWS CodeCommit use segregated repositories to manage and share the code based on different features, and AWS CodePipeline automates security code scanning and testing, and handles feature deployment to each required environment via code. Quality is key to a successful automation practice. Features are continually bar raised through continuous monitoring before, during, and after deployment to regularly iterate and improve scaled automation. This also has the benefit of measuring the success run rate of each deployed feature. The same approach was utilized to build and improve IAM Automation practice (as mentioned Figure 5.)

Analysis and Reporting

Technical Program and Engineering Managers oversee business outcomes and work closely with business stakeholders to create scorecards that highlight the impact of automation. They utilize analysis and reporting tools to report on operational performance and measure outcomes of automation practice across the organization. AMS leverages AWS services like Amazon Redshift for structuring collected data from multiple sources, Amazon S3 for storage and reporting, and uses Amazon QuickSight dashboards for visualizing the data. Custom built Amazon QuickSight dashboards provide periodic views of automation success/failure rates and the current state of the business. This enables the team and various process stakeholders to plan efforts for improvements and develop new capabilities based on data-driven insights.

Figure 6: Diagram of AWS Managed Services (AMS) centralized operations management, automation development, and analysis and reporting

Conclusion

In conclusion, using an automation first mindset when establishing an operations practice can result in higher productivity, improved security posture, and enables your team to scale best practices across many workloads. “Automating repetitive tasks or processes” is an inherent part achieving Operational Excellence as described in AWS Well-Architected Framework. And Integrating automation into your cloud practice requires a journey of transformation that involves the adoption of new capabilities within your team, setting up new or revised practices, and defining processes and standard guidelines to adjust the existing method. In this blog article, you have learned about how AWS Managed Services (AMS) implements automation practice to scale and achieve operational excellence across many customers in AWS. To learn more, see the AWS Well-Architected Framework and AWS Managed Services (AMS). For follow up questions or comments, join our growing community on AWS re:Post.

About the authors:

AWS Cloud Operations & Migrations Blog