Building Self-Healing Infrastructure-as-Code with Dynatrace, AWS Lambda, and AWS Service Catalog
By Kishore Vinjam, Partner Solutions Architect at AWS
By Andreas Grabner, Director of Strategic Partner Enablement & Evangelism at Dynatrace
Engineers working in cloud operations or system administration are often tasked with manually updating scripts to remediate issues detected by performance monitoring tools. This can include process restarts, resource cleanup, memory outage, and CPU utilization, and these issues are frequent and take up precious time.
In this post, we will demonstrate how using Dynatrace, AWS Lambda, and AWS Service Catalog, customers can build a workflow to initiate the required incident response action to the problems detected by Dynatrace AI.
Dynatrace AI detects and triggers a problem notification when an end user is impacted with the real user experience, service level agreements (SLAs), or service availability due to the underlying system resources. The issues due to underlying resources could be such as a full disk, bad configuration change, increased load, or a problem in depending service.
AWS Service Catalog allows you to create and manage catalogs of services that are approved for use on AWS, while Lambda lets you run code without provisioning or managing servers.
Dynatrace is an AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in Migration, DevOps, and Containers. If you want to be successful in today’s complex IT environment, and remain that way tomorrow and into the future, teaming up with an AWS Competency Partner like Dynatrace is The Next Smart.
The AWS Competency Program verifies, validates, and vets top APN Partners that have demonstrated customer success and deep specialization in specific solution areas or segments.
Auto Tune AWS Service Catalog Artifacts
AWS Service Catalog administrators prepare AWS CloudFormation templates, configure constraints, and manage AWS Identity and Access Management (IAM) roles that are assigned to products and provide for advanced resource management. Customers use AWS Service Catalog to launch products to which they have been granted access.
Additionally, customers that have implemented Infrastructure-as-Code governed through AWS Service Catalog will be able to extend their infrastructure provisioning pipeline and perform Auto Remediation-as-Code, thereby eliminating manual steps required to roll out changes to the provisioning scripts.
As mentioned earlier, there are various incidents that Dynatrace can identify and trigger the problem notifications. Taking remediation action based on these events, which we refer to as business logic, would change depending on the type of the trigger and the actual impact.
While there is no limit for developing this kind of business logic, it’s impossible to discuss all possibilities here. In this post, we’ll explain one business logic which handles the CPU_SATURATION event that impacts SLAs. As the event name implies, an event will be triggered when CPU resource starvation occurs. All other possibilities can be explored from the Dynatrace website under Problem Detection and Analysis.
In the organizations we work with, we often see a cloud admin team create a catalog of allowed products that includes Amazon Elastic Compute Cloud (Amazon EC2) instances for data science workstations and a few other use cases. The launching capability is given to various engineers within the organization to self-provision and terminates once done.
When the CPU_SATURATION event is triggered continuously multiple times by Dynatrace, that’s an indication for the need of increased CPU power. Typically, by the time this gets escalated to the cloud admin and corrective action is taken, there will be poor user experience if not service downtime.
In the sample solution below, we’ll show you how we can automate this remediation process with the help of Dynatrace, AWS Lambda, and AWS Service Catalog.
In this solution, you will see how to auto tune AWS Service Catalog products based on the event triggered by Dynatrace. You start with deploying an Amazon EC2 instance along with Dynatrace OneAgent using the AWS CloudFormation template provided. Then, you will configure the Dynatrace SaaS Instance to monitor your AWS environment.
OneAgent collects performance metrics from Amazon EC2 and sends to Dynatrace SaaS Instance. When Dynatrace identifies a problem, which impacts end user experience, SLAs, or service availability of the underlying resources, a problem notification event is triggered. Dynatrace SaaS Instance is configured to trigger Lambda function through Amazon API Gateway. The Lambda function further validates the issue and auto tunes the AWS Service Catalog Artifact.
The following diagram shows the solution’s architecture:
Figure 1 – Dynatrace Auto Tune Service Catalog Artifacts.
- AWS Service Catalog is used to provision an Amazon EC2 instance product that also installs Dynatrace OneAgent through user-date during the Amazon EC2 initialization. OneAgent collects performance metrics from the Amazon EC2 hosts.
- Dynatrace SaaS Instance is configured to monitor the AWS environment.
- The Amazon API Gateway, AWS Lambda, Amazon EC2, SSM Parameter Store, and AWS CloudFormation templates are used in this configuration.
How it Works
When you attach your AWS environment with Dynatrace SaaS Instance, the Amazon EC2 instances with OneAgent configured are continuously monitored by Dynatrace. When Dynatrace identifies an issue that impacts the end user experience, SLAs, or service availability, it triggers a problem notification. This is configured to trigger an API on Amazon API Gateway, which validates the event and triggers a remediation call as needed.
On identifying a real issue, AWS Service Catalog will be auto tuned and Dynatrace is notified.
Now, let’s see how Dynatrace SaaS Instance integration works together with AWS, and then auto tune AWS Service Catalog as needed:
- First, enable AWS monitoring in Dynatrace to attach your AWS environment with a Dynatrace SaaS Instance.
- Using AWS Service Catalog, provision an Amazon EC2 instance that installs OneAgent during instance creation. Once the Amazon EC2 instance boots up, it will automatically register with Dynatrace SaaS.
- When Dynatrace identifies a CPU_SATURATION that is impacting Host, Service, or Application Health, it triggers a problem notification and invokes an API on AWS API Gateway with associated PID information.
This API, in turn, triggers a Lambda function that performs the following steps:
- Invokes a Dynatrace API ‘problem/details/’ to fetch more information about the problem.
- Validates if the event triggered is qualified to trigger a remediation call.
- On qualified event, invokes Dynatrace API ‘entity/infrastructure/hosts/’ to get additional information of the impacted host.
- Queries the Amazon EC2 instance using API ‘describe-instances’ and gets the current servicecatalog-product-id information.
- Verifies the pre-approved list of Amazon EC2 instances from the SSM Parameter store and updates the CloudFormation templates with next larger approved Amazon EC2 instance. In this post, one of the CloudFormation templates provided creates the pre-approved list of Amazon EC2 instances. To update the pre-approved values from the AWS Management Console, click AWS Systems Manager > Parameter Store > ‘/dynatrace/approvedinstancetypes/’ > Edit.
- The Lambda function automatically creates a new version on AWS Service Catalog with the updated CloudFormation template. Any new products launched from this will have the updated Amazon EC2 type from here on.
- On successful update of AWS Service Catalog, it adds a comment about corrective action on the Dynatrace problem.
- You will need an AWS account.
- Sign up for Dynatrace SaaS Free Trail and attach your AWS account to start auto discovery and start monitoring your environment.
- Basic understanding of AWS Service Catalog. Please refer to the Getting Started section in our AWS Service Catalog Administrator Guide for additional information.
- Understanding of how to launch AWS CloudFormation stacks. Please refer to the Creating a Stack user guide for additional help.
In this section, we’ll see how to configure the Dynatrace AI and integrate with AWS Service Catalog to auto tune the AWS Service Catalog Artifacts based on the issues reported.
The CloudFormation templates are provided to create an AWS Service Catalog product and configure the AWS environment as needed for this blog. Configuring an AWS environment includes creating required SSM parameters, installing the Lambda functions, and setting up an Amazon API Gateway.
Steps to configure the Dynatrace problem notification are:
- Use the CloudFormation template provided with AWS Service Catalog to provision an Amazon EC2 instance with OneAgent automatically installed:
- Download the CloudFormation template (DynatraceMonitoringAsAServiceCFStack.json).
- Create an AWS Service Catalog portfolio if it doesn’t exist.
- Create a product using the CloudFormation template downloaded earlier.
- Add the product to the portfolio.
- (Optional) Add template constraints or launch constraints as needed.
- Grant end users’ permissions to access the portfolio.
- Login as the permitted user and launch the product. Please follow the instruction in the parameter page under each parameter description while launching the product.
- Upon successful launch of the product, an Amazon EC2 instance with OneAgent gets installed.
Figure 2 shows how the AWS Service Catalog product named ‘EC2InstanceWithOneAgent’ looks like.
Figure 2 – AWS Service Catalog product details page.
- Next, download the CloudFormation template (DynatraceEnvironLaunch.yaml) and Lambda zip file (receiveDynaNotification.zip).
- Upload the ‘ReceiveDynaNotification.zip’ file to an Amazon Simple Storage Service (Amazon S3) bucket. Note down the bucket name as you will be using it as value to “BucketName” in the next step.
- Launch a CloudFormation stack from AWS Management Console. Follow All Services > Management & Governance > CloudFormation > Create Stack. Under Choose a Template, click on Browse, point to ‘DynatraceEnvironLaunch.yaml’ that we downloaded earlier, and then choose Continue.
- Follow the instructions in the parameter page under each parameter description while launching the product. The CloudFormation template automatically performs the following steps:
- Configure the following SSM parameters used by AWS Lambda:
- Dynatrace Saas Instance name
- Authentication information
- Amazon S3 bucket used as workspace
- List of approved instance types.
- These SSM parameters can be manually updated if needed from Systems Manager > Shared Resources > Parameter Store > /DynatraceApprovedInstanceTypes > Edit.
- Create a Lambda function ‘ReceiveDynatraceLambda’ with appropriate role ‘CommonLambdaRole’ for execution.
- Create an API ‘ReceiveDynatraceAPI-<stagename>-‘ on Amazon API Gateway with API keys enabled. This API triggers the Lambda function ‘ReceiveDynatraceLambda.’
- Configure the following SSM parameters used by AWS Lambda:
- After successful launch of the CloudFormation stack, connect to AWS Management Console > Network and Content Delivery > Amazon API Gateway, and then select API > Stages all the way to Post. Note down the ‘Invoke URL.’
Figure 3 – API invoke URL screen.
- At Amazon API Gateway > Select API Keys > ‘ApiKeyToUseAtDynatrace’ click on Show and note down the API Key.
Figure 4 – API Key for Dynatrace instance authentication.
- Next, create and assign an alerting profile in Dynatrace to report the errors, as needed, for your organization. A sample alerting portfolio is shown in Figure 5.
Figure 5 – Dynatrace alerting profile configuration.
- Create a custom problem notification in Dynatrace to assign an alerting profile. Paste the content of ‘Invoke URL’ and ‘API Key’ from Amazon API Gateway to the ‘Webhook URL’ and ‘X-API-KEY’ columns in Dynatrace (below). Select the appropriate alerting at the bottom, and click on ‘Send Test Notification’ to validate the authentication and Save.
- You may copy and paste the below text under Custom Payload:
Figure 6 – Dynatrace problem notification integration with AWS API.
- Make sure the newly-deployed Amazon EC2 instance is seen on the Dynatrace console.
Figure 7 – AWS resources monitored on Dynatrace SaaS Instance.
- Use ‘stress’ tool on the Linux server to trigger CPU_SATURATION event. This can be installed and run using the following command:
- Once a problem is identified by Dynatrace AI, it triggers an event.
Figure 8 – Dynatrace AI reports the impacted AWS instance.
- As this triggers a qualifying CPU_SATURATION event, the AWS Service Catalog product will be updated.
Figure 9 – Artifact auto updates for the AWS Service Catalog product.
- The CloudFormation template before and after looks like this:
Figure 10 – Updates to the CloudFormation template on qualified event.
- Once the AWS Service Catalog product is updated, a comment is added back to the Dynatrace problem.
Figure 11 – Response to Dynatrace after AWS Service Catalog artifact auto update.
In this post, we looked at triggering an auto-remediation action when Dynatrace reports a problem due to resource starvation on AWS.
This way, we can let the infrastructure catalog be auto-updated as the frequent resource starvation happens, and any new resources launched will be auto tuned. Automatic update of the catalog will improve the end user experience, as they don’t need to wait for escalation to happen and wait for the cloud admin team to take corrective action.
The cloud admin team could start with smaller resources and scale automatically based on the actual resource usage. If there are any provisioned long-running instances, they can be updated through the service catalog when it is appropriate.
Watch the Video (51:16)
For additional use cases on auto-remediation and self-healing infrastructure, check out the AWS reInvent 2018 session from Dynatrace on Self-Healing with AWS Lambda.
Dynatrace – APN Partner Spotlight
Dynatrace is an AWS Competency Partner. Their AI-powered, full stack, and completely automated solution provides answers, not just data, based on deep insight into every user, every transaction, across every application.
*Already worked with Dynatrace? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.