AWS Cloud Enterprise Strategy Blog

The Three Phases of AI Ops: Steps to Intelligent Cloud Operations

by Sid Arora, Sr. Product Manager, AWS Managed Services and Vieng Soukhavong, Global Head of Operations for AWS Managed Services
introduction by Mark Schwartz, Enterprise Strategist

Our Amazon Managed Services (AMS) group operates customer workloads in the cloud. It has learned a great deal from doing so and has devised innovative solutions for achieving high levels of reliability and security. We have asked them to share some of their learnings with our blog readers. This is the first in a series of posts from them. It presents the concept of AI Ops—the application of machine learning to operations in the cloud.

– Mark


The Three Phases of AI Ops: Steps to Intelligent Cloud Operations

by Sid Arora, Sr. Product Manager, AWS Managed Services and Vieng Soukhavong, Global Head of Operations for AWS Managed Services


Artificial intelligence (AI) and machine learning (ML) are revolutionizing many industries and functions. IT operations is no exception—leading to the practice of AI Ops, the use of artificial intelligence to simplify IT operations management. This approach reduces manual effort through automation, enabling you to focus on innovation rather than infrastructure and support.

Moving to the cloud helps accelerate results with ready access to ML and AI services, robust managed telemetry and monitoring services that can drive modeling and insights, and other prebuilt capabilities. You can take an incremental approach to AI Ops and begin getting value from it very quickly. Let’s look at the three phases of AI Ops and some tips for starting down the path.

Phase 1: Identify use cases, capture data, and discover insights

Simply introducing AI and ML into your operations won’t magically solve your problems. You need good data and a clear set of goals to be successful. The first phase begins with identifying core use cases where ML will help reduce effort. These are usually the areas that take the most time, have a finite combination of predictable steps to investigate and remediate, or have recognizable patterns.

Next, set up your environment for observability. This means capturing and storing data in ways that make it possible to discover insights. ML and AI are only as good as the data that feeds them; the cloud makes this easy and cost effective. In Amazon Web Services (AWS), for example, you can leverage services such as Amazon CloudWatch, AWS X-Ray, AWS Config, and AWS CloudTrail to collect monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of your cloud and on-premises resources. By setting up a data delivery pipeline, you can take advantage of collection, storage, and management solutions such as Amazon Redshift (a data warehouse), Amazon Simple Storage Service (Amazon S3), and Amazon OpenSearch Service (successor to Amazon Elasticsearch Service).

Once you can apply advanced analytics and visualization to your data, you can gain deeper insights from it. When disruptions occur, you can go beyond knowing what happened—such as CPU utilization going over a certain threshold—and start to understand why it happened. ML tools are great for discovering hidden patterns and correlations among events. This allows you to be more proactive about identifying conditions that could lead to problems and setting alerts that help you address them quickly. Basic automation can take care of simple runbook steps.

Phase 2: Make actionable predictions

In the next phase, you have the data and analytics in place to start making predictions and automating certain actions. ML can be used to create key performance indicators (KPIs) that offer predictive and prescriptive insights rather than simply backward-looking measurements. These can be provided as suggestions or next-best actions to engineers. Rather than replacing human judgment, you’re augmenting it with relevant information unavailable through manual processes.

ML models can also handle the data overload that is common in today’s increasingly complex cloud ecosystems. In environments with many virtual machines and microservices running at any given time, alerts can snowball into unsustainable volumes. Analytics can help to group them and discard trivial items such as duplicates.

AI offers significant value on the support side of operations as well. Natural-language chatbots (such as Amazon Lex) enable user self-service and are easy to implement in the cloud. Whether users want to reset their password or order a new mouse, the AI-driven bot can understand what they’re looking for and deliver the right information at the right time.

Phase 3: Digitize operations

In this phase, AI Ops outputs can be placed into production and can react to events in your AWS cloud environment automatically. For example, if there’s an issue with an EC2 instance capacity, an automated response, based on data collected from various sources, can trigger the resolution. Eliminating manual operational work to diagnose and resolve issues is what AI Ops can achieve.

Much of what comprised a traditional runbook is automated: machines take actions automatically without human intervention in response to changing conditions. This stage requires significant trust in your models and automation approaches. Getting to this phase requires a detailed plan, the right expertise and skillset, and the right instrumentation to make it successful.

By starting small, you can explore how to resolve various types of scenarios. For example, an ML model could automatically route an alert to a specific team based on the conditions it identifies. This is a relatively low-risk use case that you can use to get accustomed to managing and driving automation until it provides demonstrated improvements in the chosen KPI—in this case, mean time to resolution (MTTR). Humans can double-check machines’ decisions before they go into production.

Accelerating your journey to AI Ops

With the tools and resources available in the cloud, you can start experimenting now with AI Ops approaches. Get specific on your use cases. Choose a few that are low-risk yet offer significant potential for measurable improvement and try them out.

AWS provides many building blocks of AI Ops as fully managed cloud resources. For example, Amazon EMR makes it easy, fast, and cost-effective to process vast amounts of data. Amazon SageMaker provides every developer and data scientist with the ability to build, train, and deploy ML models quickly. And Amazon Forecast uses ML to deliver highly accurate predictions.

If you want to jump-start your journey to more efficient operations, get in touch with us here at AWS Managed Services. We are investing heavily in this space, applying machine learning to a wide range of operational challenges to reduce costs and increase quality.

Learn more at https://aws.amazon.com/managed-services/.

Note: This blog has been updated on September 8, 2021 for the renaming of Amazon Elasticsearch Service to Amazon OpenSearch Service.

by
Sid Arora, Sr. Product Manager, AWS Managed Services

Sid Arora leads product development on OpsCenter at AWS Managed Services, working cross-functionally with AWS Systems Manager. Sid has led building multiple products at Amazon.com and Amazon Web Services over the last 7+years. His passions include leveraging machine learning and artificial intelligence to simplify and personalize user experiences across consumer, enterprise, and cloud operations products.

Vieng Soukhavong, Global Head of Operations for AWS Managed Services

Vieng has over two decades of successful experience, leading and working in Operational environments. Vieng is passionate about operational excellence and delivering positive experiences for his teams and customers.

Mark Schwartz

Mark Schwartz

Mark Schwartz is an Enterprise Strategist at Amazon Web Services and the author of The Art of Business Value and A Seat at the Table: IT Leadership in the Age of Agility. Before joining AWS he was the CIO of US Citizenship and Immigration Service (part of the Department of Homeland Security), CIO of Intrax, and CEO of Auctiva. He has an MBA from Wharton, a BS in Computer Science from Yale, and an MA in Philosophy from Yale.