AWS Cloud Operations Blog

Automate AIOps for your microservices in AWS using Amazon DevOps Guru and AWS Systems Manager Incident Manager

Artificial intelligence operations (AIOps) is the process of using machine learning techniques to solve operational problems. The goal of AIOps is to reduce human intervention in IT operations processes. By using advanced machine learning techniques, you can reduce operational incidents and increase service quality, and AIOps can help you predict incidents before they happen.

Amazon DevOps Guru offers a fully managed AIOps platform service that enables developers and operators to improve application availability and resolve operational issues faster. It minimizes manual effort by leveraging machine learning (ML) powered recommendations. Incident Manager, a capability of AWS Systems Manager, increases incident resolution by notifying responders of impact, highlighting relevant troubleshooting data and by enabling responder team escalation.

Microservices are an architectural and organizational approach to software development where software is composed of small independent services that communicate over well-defined APIs. Use of microservices in the cloud provides key benefits such as increased agility, flexible scalability and lower costs. However, some of the common challenges that we hear from customers in their journey for successful microservices adoption are that small code changes may take too long to release. The increasingly distributed nature of microservices increases time for issue resolution, and code and application security reviews are difficult to perform.

The solution described in this blog post aims to address these challenges for microservices adoption in AWS by demonstrating:

  1. Automated software delivery for microservices using continuous integration and delivery (CI/CD) pipelines
  2. Prescriptive and standardized enforcement of AWS security best practices for microservices
  3. Proactively addressing operational issues before they happen by creating operational incidents based on predictive insights in an AWS application’s microservices stack

Solution Overview

In this solution, you check-in and update your microservices stack that consists of Amazon API Gateway, AWS Lambda and Amazon DynamoDB to an AWS CodeCommit repository. The microservices stack as well as the DevOps Guru operational insight scenario that I demonstrate here is based on this blog post. On check-in, an AWS CodePipeline based devops automation gets triggered that deploys and updates the microservices stack into AWS using AWS CodeBuild stages. You use your DevOps pipeline to modify the microservices stack and you then simulate a surge in traffic. The solution automatically enables DevOps Guru for your microservices stack and generates operational insights and recommendations that could pose a risk to your application availability.

Many customers already have documented escalation plans with contacts identified to handle operational issues. The solution in this blog post provides a 1-click automation that enables customers to incorporate their escalation plans and contact lists as an Incident Manager response plan. The response plan is triggered automatically by this solution whenever DevOps Guru discovers code and operational insights.

In order to provide prescriptive and standardized enforcement of AWS security best practices, the solution provisions an AWS Service Catalog portfolio with an AWS Service Catalog product that contains configuration compliance rules with AWS Config and automated remediations with Systems Manager for common configuration issues in AWS.

The solution automatically provisions Incident Manager incidents, Incidemt Manager response plans, and OpsCenter (a capability of Systems Manager) OpsItems for all DevOps Guru insights as well as AWS Config compliance violations.

You can download the full solution from here. The following diagram illustrates the architecture of our solution:

Architecture diagram describing AIOps for your microservices in AWS using Amazon DevOps Guru and AWS Systems Manager Incident Manager

Figure 1: Solution architecture for AIOps for your microservices in AWS using Amazon DevOps Guru and AWS Systems Manager Incident Manager

Prerequisites

You must first complete the following pre-requisites:

  1. Enable AWS Config in your account as well as optionally in all of your managed accounts in the organization.
    1. Conduct step 1 from the Automate configuration compliance at scale blog post in order to use Systems Manager Quick Setup so you can accomplish it with just a few clicks from your console.
  1. Complete these prerequisites step to enable CloudFormation Stacksets in your AWS environment
  2. An Amazon S3 bucket
    • Create an S3 bucket named s3-microservices-[AccountID]-[Region]. You will use this bucket to upload your compliance template as well as use it as a staging template for your devops pipeline.
  1. Integrate AWS Cloud9 as a local Git repository with AWS CodeCommit as the remote Git repository:
    1. Complete Step 1 from this AWS CodeCommit tutorial to create a CodeCommit repository. Provide a name for your CodeCommit repository (for example microservices).
    2. From your AWS Cloud9 environment, follow these steps to clone your CodeCommit repository into Cloud9 and use Cloud9 as your local Git repository integrated with a remote CodeCommit repository.
    3. In your Cloud9 environment, change folder to the newly created project folder (for e.g. microservices) where you have cloned your CodeCommit repo. Download the following files from the solution GitHub repo and upload them to the project folder (for e.g. microservices) of your Cloud9 local Git repository.
      1. cfn-shops-monitoroper-code.yaml
      2. buildspec.yml
      3. buildspec-update.yml
  1. In your buildspec.yml file, replace accountid and region with comma separated AWS Account IDs and regions where you want to deploy your microservices stack.
    1. Use standard git commands from your Cloud9 cloned repository’s root folder (where the buildspec.yml file resides) and check-in your microservices stack to the remote CodeCommit repository
git add .
git commit -m "initial commit"
git push origin *your branch name*
  1. Incident Manager requires defining contacts and optionally an escalation plan. Follow these steps here to define a contact. Once you defined contacts, follow these steps to define an escalation plan based on the contact list. Note down the Arn of the Contact details from the console.

Setup

CI/CD for microservices

Navigate to the CloudFormation console and create a stack by launching the aws-microservices-codepipeline.yml CloudFormation template.

    1. This template provisions the CodePipeline based DevOps automation with CodeCommit and CodeBuild stages.
    2. The template takes the following parameters:
      1. RepositoryName: CodeCommit repository for the Conformance Pack templates i.e. microservices.
      2. BranchName: Branch in the CodeCommit repository for the microservices stack (for e.g. master)
      3. StagingBucket: Name of the S3 Staging Bucket that stages the microservices stack copied from code commit. In our case this is s3-microservices-[accountid]-[region]

AIOps and centralized incident management

Navigate to the CloudFormation console and create a stack by launching the aws-microservices-systemsmanager.yml CloudFormation template.

  1. This template provisions Incident Manager incidents, Incident Manager response plans,  and OpsCenter OpsItems for all DevOps Guru insights as well as AWS Config compliance violations.
  2. The template takes the following parameters:
    1. IncidentPlanContactDetailsArn: Arn of the Contact details for the Incident Plan. This is the Arn for the contact plan you created in step 6 of the prerequisites.

Prescriptive compliance

Navigate to the CloudFormation console and create a stack by launching the aws-servicecatalog-prescriptivecompliance.yml CloudFormation template.

  1. This template provisions custom Systems Manager automation documents to provide automated remediations for AWS Config.
  2. The template provisions an AWS Service Catalog Portfolio with an AWS Config Remediations Product. This Service Catalog product provides automated detection with AWS Config and remediations with AWS Systems Manager.
  3. The template takes the following parameters:
    1. S3StagingBucketURL: URL of the Amazon S3 staging bucket from step 3 in the prerequisites section.

Validate

Validate devops

  1. From the CodePipeline console, validate that the devops code pipeline gets initiated. Validate that the CodeCommit and CodeBuild stages of the pipeline execute successfully.
    1. From the API Gateway console, validate that there are 2 API created for your microservices stack. For the ListRestApiMonitorOper API, get the ‘api-id’ and note down the prod stage URL- https://api-id.execute-api.region.amazonaws.com/prod that you will use later.
    2. From the Amazon DynamoDB console, check that a table gets created and note down the name of the table.

Validate AIOps and Centralized Incident Management

  1. From your AWS Cloud9 terminal, upload the populate-shops-dynamodb-table.json file, substitute the ‘TABLENAME’ field in this file with the name of the DynamoDB table that you noted earlier and run the following command
    aws dynamodb batch-write-item --request-items file://populate-shops-dynamodb-table.json
  2. Use your DevOps pipeline to modify the microservices stack and create operational insights. In your Cloud9 terminal, navigate to line 15 of the cfn-shops-monitoroper-code.yaml and modify the ReadCapacityUnits value from 5 to 1 to reduce the read capacity of your DynamoDB table.
  3. Rename the buildspec-update.yml file from the solution repo to buildspec.yml. Use standard git commands to push the new update.
  4. Simulate a surge in traffic. In the sendAPIRequest.py file, substitute the value of the url field with the API Gateway prod stage url that you noted earlier. Run the file multiple times:
    python sendAPIRequest.py & python sendAPIRequest.py & python
    sendAPIRequest.py & python sendAPIRequest.py
  5. Navigate to the Incident Manager console and check that an open incident is created. Validate that the Incident manager response plan has been activated by verifying that the contact channels in your Engagement Plan are engaged (i.e. via email and/or voice).
  6. Navigate to the Systems Manager OpsCenter console and check that an OpsItem has been generated by the incident that provides details of the operational issues with your DynamoDB usage.

Validate compliance

  1. The aws-servicecatalog-prescriptivecompliance.yml template creates an AWS Service Catalog portfolio in your account with an AWS Service Catalog product consisting of AWS Config rules and integrated remediation runbooks. It also creates an AWS IAM group (EndUserGroup), an AWS IAM role (EndUserRole) for the end user and a launch constraint enabling only that specific IAM group member or role to launch the Service Catalog product. Navigate the AWS IAM console and create an IAM user that is a member of the EnduserGroup, and then log out.
  2. Navigate to the AWS Service Catalog console now logged in as the IAM end user that you created, navigate to the left sidebar, and choose Products. Select the ‘AWS ConfigRemediations Compliance Product’ product, accept the defaults, and select Launch Product. The Service Catalog product screen will auto refresh until the product has been launched. Select Provisioned Products from the left sidebar in order to validate that the product has been launched and the status shows available. Log out as the IAM end user.
  3. Navigate to the AWS Config console again with your administrator login and validate that several AWS Config managed Rules have been provisioned. Each Config rule has an associated AWS Systems Manager automation document associated with it as a remediation.
    1. From the AWS Systems Manager automation console, check that custom AWS Systems Manager Automation documents have been provisioned- look under ‘Owned by me’ documents in Systems Manager automations
    2. Test continuous compliance. From the Amazon EC2 console, select Security Groups and select the Security group ID of the default VPC. Select Edit inbound rules. Select Add rule. Select SSH as the Type and 0.0.0.0/0 as the Source.
    3. After a few minutes, validate continuous compliance –
      1. From the AWS Config console, check that the RestrictDefaultSecurityGroup Config rule is triggered from your deployed Conformance Pack.
      2. From the AWS Systems Manager console, select Automation. You should see a successful automation execution that corresponds to the Custom-RestrictSecurityGroup automation document.
      3. Validate continuous compliance of your AWS environment by selecting Security Groups in the Amazon EC2 console.
      4. Then select the Security group ID of the default VPC. Confirm that the SSH rule is removed from the Inbound rules tab.

Cleanup

To avoid recurring charges, and to clean up your account after trying the solution outlined in this post, perform the following:

  1. Delete the CloudFormation stacks in the following sequence for these templates from the solution:
    1. cfn-shops-monitoroper-code.yaml
    2. aws-microservices-codepipeline.yml
    3. aws-servicecatalog-prescriptivecompliance.yml
    4. aws-microservices-systemsmanager.yml
  1. Delete the s3-microservices-[AccountID]-[Region] Amazon S3 bucket that was created for this solution.

Conclusion

While the use of microservices provides several benefits in AWS such as increased agility, flexible scalability and lower costs, the distributed nature of microservices also presents challenges such as increasing time for issue resolution or making code and application security reviews difficult to perform. In this blog post, I’ve demonstrated a solution that enables you to address challenges for successful microservices adoption in AWS by automating the use of AIOps. The solution also demonstrates features such as built-in, automated software delivery for your microservices using continuous integration and delivery (CI/CD) pipelines and prescriptive and standardized enforcement of AWS security best practices.

About the author:

Kanishk Mahajan

Kanishk Mahajan is Principal, WW Cross-Service Solutions Architect at AWS where he specializes in cloud operations, resilience, migrations and modernizations and security and compliance.