AWS Cloud Operations Blog

Leveraging AWS CloudTrail Insights for Proactive API Monitoring and Cost Optimization

AWS CloudTrail Insights is a powerful feature within AWS CloudTrail that helps organizations identify and respond to unusual operational activity in their AWS accounts. This includes identifying spikes in resource provisioning, bursts of IAM actions, or gaps in periodic maintenance activity.

CloudTrail Insights continuously analyzes CloudTrail management events from trails and event data stores, establishing a baseline of normal API call volume and error rate patterns. When the service detects unusual activity that falls outside of this baseline, it raises an “Insights” event. These events are surfaced through the CloudTrail console, delivered to an Amazon S3 bucket, and sent to Amazon EventBridge. This allows organizations to create alerts, integrate with event management systems, and automate remediation efforts.

With this understanding of CloudTrail Insights’ capabilities and benefits, let’s move on to setting up the service and exploring how it can be leveraged to proactively monitor APIs, optimize cloud costs, and enhance security. By the end of this blog post, you will learn how to configure CloudTrail Insights to detect anomalies in API call rates and error rates, analyze the cost implications of API activity patterns, and set up automated alerts to identify and respond to suspicious security-related events. These insights and strategies can enable organizations to effectively manage and optimize their AWS infrastructure.

Setting up CloudTrail Insights

For CloudTrail Insights to analyze normal patterns of API call volumes and error rates to form a baseline, there are certain pre-requisites your CloudTrail trail or event data store needs to meet:

  1. To log Insights events on API call volume, the trail or event data store must log write management events.
  2. To log Insights events on API error rate, the trail or event data store must log read or write management events.

These pre-requisites can be verified from the CloudTrail console.

Steps for CloudTrail Trails

CloudTrail trails capture a record of AWS activities, delivering and storing these events in an Amazon S3 bucket, with optional delivery to CloudWatch Logs and Amazon EventBridge.

To verify that your CloudTrail trail is logging management events

  1. Open the CloudTrail console.
  2. In the navigation pane, choose Trails.
  3. Click on the name of the trail for which you want to verify that management events are logged.
  4. In the Management events section, click the Edit.
  5. In the Events section, make sure the Management events checkbox is enabled..
  6. In the Management events section, make sure the Read and Write checkboxes are enabled to ensure that read and write management events are logged to the trail.
  7. Click the Save changes.
Setup CloudTrail Trails
Figure 1: Setup CloudTrail Trails

Next, to enable Insights events for a CloudTrail trail

  1. Open the CloudTrail console.
  2. In the navigation pane, choose Trails.
  3. Click on the name of the trail for which you want to enable Insights events delivery.
  4. In the Insights events section, click the Edit.
  5. In the Events section, make sure the Insights events checkbox is enabled.
  6. In the Insights events section, check the boxes for both API call rate and API error rate checkboxes to enable delivery of both types of Insights events
  7. Click the Save changes.
Setup Insights Events for CloudTrail Insights
Figure 2: Setup Insights Events for CloudTrail Insights

Steps for CloudTrail Event Data Stores

CloudTrail Lake event data stores allow you to store CloudTrail management events and data events, CloudTrail Insights events, AWS Audit Manager evidence, AWS Config configuration items, or events from outside of AWS.

Note: You need to create a separate destination event data store that logs Insights events so that CloudTrail can deliver the Insights events. The steps to create an event data store that logs Insights events are outlined here.

To ensure your CloudTrail Lake event data store is logging management events

  1. Open the CloudTrail console.
  2. In the navigation pane, under the Lake section, navigate to the Event data stores.
  3. Click on the name of the event data store for which you want to verify that management events are logged.
  4. In the Management events section, click the Edit.
  5. In the CloudTrail events section, verify that the Management events checkbox is enabled.
  6. In the Management events section, check the box for both Read and Write checkboxes to enabled delivery of read and write management events to the event data store.
  7. Click the Save changes.

Next, to enable delivery of Insights events to an event data store

  1. Open the CloudTrail console.
  2. In the navigation pane, under the Lake section, navigate to the Event data stores.
  3. Click on the name of the event data store for which you want to enable delivery of Insights events.
  4. In the Management events section, click the Edit.
  5. In the Management events section, check the box labeled Enable Insights.
  6. In the Enable Insights section, select the Insight events data store from the dropdown. CloudTrail will deliver Insights events to the selected event data store.
  7. In the same section, check the boxes for API call rate and API error rate to enable delivery of both types of Insights events.
  8. Click the Save changes.
Verify Event Data Store, Insights API call rate and error rate
Figure 3: Verify Event Data Store, Insights API call rate and error rate

Note: Alternatively, you can configure your trails and event data stores to generate CloudTrail Insights events using the AWS CLI and SDK.

Monitoring API activity with CloudTrail Insights

CloudTrail detects unusual patterns of API activity that deviates from your normal patterns of API call volume and API error rates, also called a baseline, by generating Insights events. These Insight events can be one of two types:

  1. API call rates Insight – This type of Insight is generated when the number of management API calls that occur per minute deviates from the baseline API call rate. Only management API calls that are writes are measured.
  2. API error rates Insight – This type of Insight is generated when the number of management API calls that are unsuccessful and return an error deviates from the baseline error rate. Management API calls that are both reads and writes are measured.

Insights events are different from other management events that CloudTrail generates because they are generated only when CloudTrail detects a significant deviation from the usual API activity pattern for that account.

CloudTrail delivers Insights events for trails to the /CloudTrail-Insight prefix in the destination S3 bucket for your trail. Due to the need to establish a baseline pattern, it can take up to 36 hours for CloudTrail to deliver the first Insights event after you enable CloudTrail Insights for a trail. If you disable and then re-enable Insights events, or stop and restart logging on a trail, it can take up to 36 hours for CloudTrail to restart delivery of Insights events.

CloudTrail Insights events for event data stores are delivered to a destination event data store, which needs to be created as a pre-requisite.

Insights events can be viewed from the CloudTrail console or with the AWS CLI and SDK, using the LookupEvents API. The past 90 days of Insights events are viewable by either means. Older events can be retrieved from the delivery S3 bucket.

To view the events in the CloudTrail console, choose Dashboard in the navigation pane on the left to see the five most recent Insights events, or click Insights to see all Insights events logged in your account in the past 90 days. On the Insights page, you can filter Insights events by criteria including event API source, event name, and event ID, and limit the events displayed to those occurring within a specific time range. You can also select a specific Insights event from the results page to view more details in a graph that shows the specific unusual API activity that happened. Hovering over the highlighted portion of the graph shows more information about when the activity started and how long it lasted. The graph also allows you to pan, zoom, and even download it to include in an email / report, etc. as needed. For more information about how Insights events can be viewed in the CloudTrail console and using the AWS CLI, refer to the documentation.

Monitoring API Activity using CloudTrail Insights graph
Figure 4: Monitoring API Activity using CloudTrail Insights graph

Alerting and notification setup

Setting Up Alarms and Notifications

To fully leverage the capabilities of AWS CloudTrail Insights, you should consider setting up a comprehensive alerting and notification system. This will enable proactive response to any anomalous activity detected by the service, ensuring timely mitigation of potential security threats, cost overruns, or operational issues.

The recommended approach involves integrating CloudTrail Insights with two powerful AWS services, Amazon EventBridge and Amazon SNS. Amazon EventBridge can be used to listen for new Insights events and forward them to your Amazon Simple Notification Service (SNS) topic, which can then distribute the events to various channels such as event management systems and email distribution lists.

Refrence Architecutre Diagram for setting up CloudTrail Insights alarms and notifications
Figure 5: Reference Architecture Diagram for setting up CloudTrail Insights alarms and notifications

Follow the instructions below to set up automated notifications as shown above. For more details, see Creating Amazon EventBridge rules that react to events.

  1. CloudTrail: Enable CloudTrail Insight Events.
  2. Amazon SNS: Setup Amazon SNS to Publish Events to the destination of your choice. Common subscribers include Event Management Systems, Slack, or Email Distributions.
  3. Amazon EventBridge: Enable EventBridge as a Listener via rules for CloudTrail Event Insights. EventBridge has a native integration with CloudTrail Insights and can be easily setup out of the box as shown below. Set the SNS Topic Configured above as the EventBridge Target.
Amazon EventBridge Event Pattern Conifugration for CloudTrail Insights
Figure 6: Amazon EventBridge Event Pattern Configuration for CloudTrail Insights

This integrated approach ensures relevant stakeholders are notified when CloudTrail Insights detects anomalies. This allows for timely investigation, root cause analysis, and implementation of mitigation measures.

With the alerting and notification setup in place, let’s explore real-world use cases that demonstrate the practical applications of CloudTrail Insights.

Optimizing Cloud Security and Costs with CloudTrail Insights

Using AWS CloudTrail Insights can help organizations optimize cloud security and costs. CloudTrail Insights establishes a baseline of normal API usage, and can detect anomalies like spikes in API calls and error rates that could lead to unexpected cost overruns. This allows organizations to investigate root causes and take immediate action. Leveraging CloudTrail Insights, organizations can implement preventive measures like setting alerts, reviewing access policies, and optimizing pricing models. CloudTrail Insights can also help identify best practices for managing AWS costs by understanding normal usage patterns and addressing inefficiencies. Overall, integrating CloudTrail Insights into cost optimization strategies can provide your teams with data-driven insights to maintain a secure and cost-efficient cloud environment.

Real-world use case

Real-world Use Case #1: Identifying and Resolving Runaway Lambda Functions with CloudTrail Insights

Acme Corporation, a leading e-commerce company, heavily invested in automating operational processes, such as deploying new EC2 instances, by leveraging AWS Lambda functions to power various components of their application architecture. The company’s finance and DevOps teams work closely to monitor AWS costs and ensure efficient resource utilization.

One day, the Acme finance team notices a sudden and unexplained spike in the company’s AWS costs. After a thorough investigation, they discover that the root cause is a Lambda function that has seemingly “run away,” repeatedly invoking the AWS EC2 RunInstances API to create a burst of EC2 instances.

The issue was triggered by a human error during a routine configuration update. A member of the Acme DevOps team inadvertently misconfigured the trigger for a Lambda function, causing it to execute the EC2 RunInstances API call repeatedly, without any control or throttling mechanisms in place.

Without a proactive monitoring and alerting system, the Acme DevOps team would have been unaware of this issue until the cost overruns had already occurred, making it more challenging to quickly identify and resolve the problem.

However, Acme has integrated AWS CloudTrail Insights into their cost optimization strategy. By analyzing the CloudTrail logs, the Insights service detects the unusual spike in EC2 RunInstances API call volumes, which is directly correlated with the sudden increase in AWS costs.

CloudTrail Insights event for the spike in call volume for EC2 RunInstances API:

CloudTrail Insights Example of EC2 RunInstance API Activity Reporting
Figure 7: CloudTrail Insights Example of EC2 RunInstances API Activity Reporting

CloudTrail Insights event graph showing the spike in call volume for EC2 RunInstances API:

CloudTrail Insights EventGraph for Call Volume with EC2 RunInstances API
Figure 8: CloudTrail Insights Event Graph for Call Volume with EC2 RunInstances API

CloudTrail Insights event lists the associated CloudTrail events for further investigation:

CloudTrail Insights event lists
Figure 9: CloudTrail Insights event lists

Armed with this insight, the Acme DevOps team can quickly identify the malfunctioning Lambda function, investigate the root cause (the misconfigured trigger), and take immediate action to stop the resource creation and prevent further cost overruns.

Furthermore, the team leverages the historical data and trend analysis provided by CloudTrail Insights to better understand the normal usage patterns of their Lambda functions and set appropriate cost thresholds and alerts. This enables them to proactively monitor for any future anomalies and address them before they result in significant financial impact.

By integrating CloudTrail Insights into their cost optimization strategy, Acme Corporation is able to respond quickly to unexpected spikes in AWS resource consumption, minimizing the financial impact and ensuring the efficient and cost-effective operation of their serverless infrastructure.

Real-world Use Case #2: Leveraging CloudTrail Insights to Detect Suspicious Role Assumption Attempts and Protect Critical S3 Buckets

Acme Corporation, a leading enterprise organization, has a robust security strategy in place to monitor and protect its AWS environment. As part of this strategy, the security team closely monitors various AWS APIs for any suspicious activity, including the AWS Security Token Service (STS) AssumeRole API and the Amazon S3 DeleteBucket API.

One day, the Acme security team receives an alert from their API monitoring system about a series of failed attempts to assume a high-privilege role. While the initial alert provided some basic information, the team knew they needed more detailed insights to fully understand the nature and scope of the issue.

This is where AWS CloudTrail Insights proved invaluable. By analyzing the CloudTrail logs, the Acme security team was able to gain a deeper understanding of the anomalous API activity.

CloudTrail Insights revealed that there had been a significant increase in the error rate for the STS AssumeRole API, with multiple failed attempts to assume a critical role that provided access to sensitive customer data and financial records. The Insights service also uncovered suspicious activity targeting the company’s Amazon S3 buckets, with someone trying to delete several of the buckets that hosted valuable business-critical data.

CloudTrail Insights event for the spike in error rate for STS AssumeRole API:

CloudTrail Insights Example of STS AssumeRole API Activity
Figure 10: CloudTrail Insights Example of STS AssumeRole API Activity

CloudTrail Insights event graph showing the spike in error rate for STS AssumeRole API:

CloudTrail Insights Example of STS AssumeRole API Activity Graph
Figure 11: CloudTrail Insights Example of STS AssumeRole API Activity Graph

CloudTrail Insights event lists the associated CloudTrail events for further investigation:

CloudTrail Insights Example of STS AssumeRole events list
Figure 12: CloudTrail Insights Example of STS AssumeRole events list

Armed with this detailed information from CloudTrail Insights, the Acme security team was able to quickly investigate the issue and assess the potential risk. They were able to identify the source of the failed role assumption attempts, as well as the specific S3 buckets that were targeted, providing them with the necessary context to take appropriate action.

By leveraging the insights from CloudTrail, the Acme security team was able to revoke any compromised credentials, implement additional security controls to prevent similar incidents in the future, and ensure the continued protection of their mission-critical resources and data.

Moreover, the team was able to use the CloudTrail Insights data to quantify the potential impact of the suspicious activity and take proactive steps to mitigate any cost implications, such as adjusting API pricing tiers or implementing rate limiting mechanisms.

The ability to deeply analyze CloudTrail data and uncover anomalies in API usage patterns was a critical capability for the Acme security team, allowing them to respond effectively to potential threats and safeguard the organization’s AWS environment. CloudTrail Insights proved to be an invaluable tool in their efforts to maintain a robust and secure cloud infrastructure.

Summary

In this blog post, we’ve explored the powerful capabilities of CloudTrail Insights and how it can be leveraged to proactively monitor API usage, detect potential issues or misconfigurations, and set up alerts to notify you of any anomalies. Through two real-world use cases, we’ve demonstrated how CloudTrail Insights can help address challenges related to human error and security breaches, ultimately leading to significant cost savings.

If you’re looking to gain better visibility into your API usage, detect anomalies, and optimize your cloud costs, we encourage you to consider enabling CloudTrail Insights into your AWS environment. By leveraging the power of this feature, you can proactively monitor your API activity, set up alerts, and take immediate action to prevent potential cost overruns and ensure efficient resource utilization within your organization.


About the authors

Guy Bachar

Guy Bachar

Guy is a Senior Solutions Architect at AWS based in New York, he specializes in assisting Capital Markets customers with their cloud transformation journeys. His expertise encompasses identity management, security, and unified communication.

Noam Ouaknine

Noam Ouaknine

Noam is a Senior Technical Account Manager at AWS, and is based in Florida. He helps enterprise customers develop and achieve their long-term strategy through technical guidance and proactive planning.

Akshat Srivastava

Akshat Srivastava

Akshat is a Solutions Architecture Leader at Amazon Web Services (AWS) leading FSI enterprise team of SAs based in New York City. He joined AWS in Jan 2020 as a Solutions Architect and is part of the Cloud Operations field community for which he spoke at re:Invent in 2021. Prior to AWS, Akshat has worked as solutions engineer at AppDynamics, Cisco and Particle. Akshat started his career as a Java developer at Vision Service Plan and worked as a senior developer at IBM. Later, he worked as an IT consultant in New York City before transitioning into sales.

Sayan Chakraborty

Sayan Chakraborty

Sayan is a Senior Solutions Architect at AWS. He helps large enterprises build secure, scalable, and performant solutions in the AWS Cloud. With a background of Enterprise and Technology Architecture, he has experience delivering large scale digital transformation programs across a wide range of industry verticals. He holds a B. Tech. degree in Computer Engineering from Manipal University, Sikkim, India.

Morgan Rankey

Morgan Rankey

Morgan is a Solutions Architect based in New York City, specializing in Hedge Funds. He excels in assisting customers to build resilient workloads within the AWS ecosystem. Prior to joining AWS, Morgan led the Sales Engineering team at Riskified through its IPO. He began his career by focusing on AI/ML solutions for machine asset management, serving some of the largest automotive companies globally.