How to automate capture and analysis of CI/CD metrics using AWS DevOps Monitoring Dashboard solution

Across the world, organizations are investing in DevOps tools to improve productivity in their software delivery process. Customers tell us that they want to collect performance and operational metrics on their continuous integration/continuous delivery (CI/CD) pipeline, to quantify value from DevOps automation investments and to identify opportunities to improve efficiency in software delivery capabilities. However, some customers find it challenging to identify the right metrics and aggregate them from various components of the CI/CD pipeline because this process can be complex and time-consuming.

In this blog post, we show you how you can save time and effort using AWS DevOps Monitoring Dashboard solution to automate the setup process to collect and visualize DevOps metrics. This solution is a reference implementation that makes it easier for organizations of all sizes to collect, analyze, and visualize key operational metrics in their software delivery process.

This solution helps DevOps leaders measure the impact of their DevOps initiatives and make data-driven decisions to drive continuous improvement in their development teams.

The solution supports ingestion, analysis, and visualization of data from AWS Developer Tools (AWS CodeCommit, AWS CodeDeploy) and Amazon CloudWatch Synthetics to calculate key DevOps metrics such as mean time to recovery (MTTR), change failure rate, deployment, and code change volume.

Solution overview

The following architectural diagram illustrates the workflow of the solution.

Solution architecture is described in the body of the post.

The workflow includes the following steps:

A developer initiates an activity in a CI/CD pipeline, such as pushing a code change to AWS CodeCommit or deploying an application using AWS CodeDeploy. These activities create events.
An Amazon EventBridge events rule detects the events based on predefined event patterns and then sends the event data to an Amazon Kinesis Data Firehose delivery stream. One event rule is created per event source.
If a customer has set up an Amazon CloudWatch Synthetics canary and its Amazon CloudWatch alarm in their account, another events rule will capture events from the alarm that monitors the status of the canary. These resources are required to gather data for calculating MTTR metrics.
The Kinesis Data Firehose uses a Lambda function for data transformation. The Lambda function parses the raw event data to pass relevant data to the Firehose delivery stream, which in turn sends the data to an Amazon Simple Storage Service (Amazon S3) bucket for downstream processing.
An Amazon Athena database points to the data in Amazon S3. It runs queries against the data and returns query results to Amazon QuickSight.
Amazon QuickSight obtains the query results and builds dashboard visualizations for the management team.

Solution prerequisites

1. You need to have a CI/CD pipeline that consists of AWS CodeCommit, AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline in your account. For instructions, see Set Up a CI/CD Pipeline on AWS if you do not currently have a pipeline set up on AWS.

2. If you want to use the Amazon QuickSight visualization feature, you must subscribe to Amazon QuickSight Enterprise edition in the account where you deploy the solution. For instructions, see Signing Up for an Amazon QuickSight Subscription in the Amazon QuickSight User Guide if you do not have a QuickSight Enterprise account set up.

Make a note of the QuickSight Principal Amazon Resource Name (ARN). You will need it when you deploy the solution. To retrieve the Amazon QuickSight Principal ARN, you must have access to a shell or terminal with the AWS CLI installed. For installation instructions, see What is the AWS Command Line Interface in the AWS CLI User Guide. You can also use AWS CloudShell to run AWS CLI commands.

Running the list-users command returns the list of users and their QuickSight user ARNs (arn:aws:quicksight: <aws-region>:<account-id>:user/<namespace-name>/<quicksight-user-name>). The default namespace-name is default.

Here is an example ARN: arn:aws:quicksight:us-east-1:1111111111111:user/default/myquicksightuser

aws quicksight list-users --region <aws-region> --aws-account-id <account-id> --namespace <namespace-name>

Choose a user who has permissions to create QuickSight resources in the account and AWS Region, such as a QuickSight admin user.

Step 1: Launch the solution

Launch the provided AWS CloudFormation template to install the solution into your AWS account. The template has the following input parameters. You can modify them as needed.

Parameter	Default	Description
Athena Query Data Duration (Days)	90	Enter a duration (days) that Athena query uses to retrieve data. By default, Athena query retrieves data within the last 90 days. We recommend that you limit the duration to maximize performance and minimize cost.
AWS CodeCommit Repository List	‘ALL’	List of the names of AWS CodeCommit repositories that will be monitored. The names must be enclosed in single quotation marks and separated by commas (for example, *‘MyRepository1‘,’MyRepository2‘). To monitor all the repositories, keep the default ‘ALL’* value.
S3 Transition Days	365	Enter the number of days after which you would like to transition Amazon S3 objects to Amazon S3 Glacier storage class. By default, objects are transitioned to Amazon S3 Glacier 365 days (one year) after creation.
Amazon QuickSight Principal ARN	<Optional Input>	Provide a QuickSight Admin user ARN (*arn:aws:quicksight:AWSRegion:AWSAccountId:user/default/QuickSightUserName*) to automatically create QuickSight dashboards. QuickSight Enterprise edition must be enabled for the account. To disable QuickSight dashboards creation, leave it blank.

The stack takes approximately 10 minutes to deploy. After it’s successfully deployed, you should see that its status is CREATE_COMPLETE in the AWS CloudFormation console.

Note: If you provided an Amazon QuickSight Principal ARN, this solution launches a nested stack to create QuickSight resources into the account you provided.

Step 2: Configure Amazon QuickSight

The solution uses Amazon QuickSight for data visualization. Follow these instructions to configure permissions and view datasets, analysis, and dashboards in Amazon QuickSight.

Note: You can set up your own visualization tool if needed. For more information, see AWS DevOps Monitoring Dashboard Implementation Guide.

After the stack is successfully deployed, go to the Outputs tab of the stack and make a note of the values for QSAnalysisURL, QSDashboardURL, and DevOpsMetricsS3Bucket.
Sign in to the AWS Management Console and navigate to Amazon QuickSight console.
Change the AWS Region in the URL to match the Region where you deployed the solution. For example, if you deployed the solution in the us-east-1 Region, the QuickSight URL will mirror the following path: https://us-east-1.quicksight.aws.amazon.com/sn.
In the upper right, choose your user name, and then choose Manage QuickSight.
From the left navigation pane, choose Security & permissions.
Under QuickSight access to AWS Services, choose Add or remove.
Select IAM, Amazon S3, and Amazon Athena. If these options are already selected, clear and then select the options again.
Choose Amazon S3, and then click the Details link.
Choose Select S3 buckets.
Select DevOpsMetricsS3Bucket, and then select the check box under Write permission for Athena Workgroup.
Choose Finish, and then choose Update.
On the Output tab of the stack, choose QSAnalysisURL and QSDashboardURL to open dashboard and analysis. You can also navigate to them in the QuickSight console. The solution creates one analysis, one dashboard, and a few datasets. All the QuickSight resources created by the solution are prefixed with the stack name (for example, <stack-name>-dashboard).

Note: You might see a No Data message if the Amazon S3 metrics bucket is empty immediately after the solution is launched. Allow time for CI/CD activities to be sent to the solution. You can refresh the pages to view data and visuals after the solution finishes processing data and sends metrics to Amazon S3.

Step 3: Set up the canary and alarm

The solution uses an Amazon CloudWatch Synthetics canary and Amazon CloudWatch alarm to collect data required for calculating MTTR metrics. Canaries are configurable scripts that run on a schedule to monitor your endpoints and APIs. An alarm is triggered when a canary job state changes (failure or success). The solution uses the alarm data to calculate the time it takes to restore a service, based on the interval between failure and success events.

Choose one of the following ways to set up the canary and its alarm:

Automated setup (Recommended)

For your convenience, the solution provides a canary-alarm.template that you can deploy to create an alarm and/or canary in your account.

Manual setup

If you don’t have a canary, follow these steps to create one, otherwise go to next step.
Follow these steps to create an alarm that monitors the state of the canary job. When you reach the step to select metrics, make sure you select CloudWatchSynthetics, your canary, and the SuccessPercent metric, as shown in the following two figures.

Select CloudWatchSynthetics metrics for the alarm.

On the All metrics tab, CloudWatchSynthetics is selected for the alarm.

On the All metrics tab, search by canary, and then select your canary with the metric name of SuccessPercent.

On the All metrics tab, the canary named mycanary with the metric name of SuccessPercent is selected. This is the metric for the alarm.

Use this pattern for the alarm name: SO0143-[my-application-name]-[my-repository-name]-MTTR (for example, SO0143-[MyDemoApplication]-[MyDemoRepo]-MTTR). SO0143 is the solution ID. The application name is the name of the application that your canary monitors. The repository name is the name of the repository where the source code for your application resides. The solution uses the alarm name to determine if an alarm is used for MTTR metrics and which application and repository are associated with the metrics.

In the Conditions section, for Threshold type, choose Static. For Whenever SuccessPercent is, choose Lower, and enter a threshold value that fits your use case (for example, 100).

The following figure shows the alarm:

In this example, the alarm will be triggered when the success percentage of a canary job goes below 100% within 5 minutes.

Amazon QuickSight dashboard visualizations

After the initial setup is done and the solution starts to process events data, you can view the metrics visualizations using Amazon QuickSight dashboards.

Let’s take a look at the dashboards available on Amazon QuickSight using data queried from Amazon Athena. By default, Amazon Athena queries retrieve data within the last 90 days to analyze and visualize DevOps metrics. This is an input parameter that can be configured by the user in the CloudFormation template. We recommend that you limit the data duration to maximize performance and minimize cost.

Code change volume dashboards

These dashboards display the number of code changes made by author and repository. They provide a weekly, monthly, and aggregated view of the metrics by author and repository. You can use the custom filter to filter data by author, repository, or time period. DevOps leaders can use this dashboard to improve visibility into the coding activities of their development teams. They can answer questions, such as who makes the most code changes and which repositories are the most active over time.

Dashboard includes graphs for Weekly Change Volume by Author, Weekly Change Volume by Repository, Monthly Change Volume by Author, Monthly Change Volume by Repository, Total Change Volume by Author, and Total Change Volume by Repository.

MTTR dashboards

These dashboards display outage minutes by application and the average time it takes to restore an application from a failure to success state. They provide a weekly, monthly, and aggregated view of the metrics by application. You can use the custom filter to filter data by application or time period. These dashboards help DevOps leaders correlate change activity to system stability, track and identify opportunities to improve the stability of applications.

Dashboard displays graphs for weekly Average Outage Minutes by Application, Weekly Number of Outages by Application, Monthly Outage Minutes by Application, Monthly Number of Outages by Application, Total Outage Minutes by Application, and Total Number of Outages by Application.

Change failure rate dashboards

These dashboards display the frequency of deployment failures per application by measuring the ratio of unsuccessful to total deployments. They provide a weekly, monthly, and aggregated view of the metrics by application. You can use the custom filter to filter metrics by application or time period. These dashboards help DevOps leaders track the code quality of their development teams and drive improvements to reduce the change failure rate over time.

Dashboard displays graphs for Overall Change Failure Rate by Application, Weekly Change Failure Rate by Application, Weekly Change Failure Rate – Breakdown, Weekly Change Failure Rate Trend, Monthly Change Failure Rate Trend, Monthly Change Failure Rate by Application, and Monthly Change Failure Rate – Breakdown.

Deployment dashboards

These dashboards display the deployment frequency and state (success/failure) by application. They provide a weekly, monthly, and aggregated view of the metrics by application. You can use the custom filter to filter metrics by application or time period. These dashboards enable DevOps leaders to track the frequency and quality of their continuous software release to end users.

Dashboard displays graphs for Total Deployments by Application, Weekly Deployment Trend, Monthly Deployment Trend, Weekly Deployment by Application, Monthly Deployment by Application, Deployment State by Application, and Tabular View of Deployments by Application and State.

Cleanup

To avoid charges in your account, after you test the solution, delete the stack and resources.

To use the console

In the AWS CloudFormation console, choose the solution’s root stack, and choose Delete.

To use the AWS CLI

Run the following command in your AWS CLI environment:

$ aws cloudformation delete-stack --stack-name <installation-stack-name>

Note: Deleting the solution stack will remove all resources except the solution-created S3 bucket because its retention policy is set to Retain. You can delete this S3 bucket manually.

Conclusion

In this blog post, we showed you how to deploy AWS DevOps Monitoring Dashboard solution using an AWS CloudFormation template to collect data from AWS Developer Tools and visualize it using Amazon QuickSight. The solution enables you to monitor your team’s CI/CD activities so that DevOps leaders can track their development team’s performance and drive business goals around continuous improvement in software delivery.

For more information about the solution, see AWS DevOps Monitoring Dashboard Implementation Guide for a description of solution components, step-by-step directions, cost estimates, and more. Visit our GitHub repository to download the source code for this solution, and to share your customizations with others as needed. For more information, visit the AWS Solutions Library.

About the authors

Aijun Peng is a Solutions Builder and Data Analytics SME at Amazon Web Services. She is passionate about using AWS best practices to design and build cloud-based solutions that help customers solve common problems. In her spare time, she enjoys traveling, cooking and listening to music.

Rakshana Balakrishnan is a Technical Program Manager at Amazon Web Services. She is focused on delivering innovative turnkey solutions to solve customers’ business problems and help them stay ahead of the curve. Outside of work, Rakshana enjoys hiking, yoga, and art.

AWS Cloud Operations & Migrations Blog