AWS Cloud Operations & Migrations Blog

Ten Ways to Improve Your AWS Operations

Introduction

When I take my car in for service for a simple oil change, the technician often reads off a litany of other services my car needs that I had put off since the previous service (and maybe the service before that, too). I tend to wait for the “check engine” light to come on before getting the specific service that I need. I don’t recommend this approach. However, preventive maintenance is the key to ensuring that the car runs safely, efficiently, and cost effectively.

Preventive maintenance of an AWS account is a lot like that of a car. Even for customers who are diligent about securing their AWS environments and implementing automation mechanisms, it may not be obvious how to detect when (or if) this automation fails or whether notifications are being sent when things aren’t working as expected. This can be especially true when organizations grow and their use of AWS evolves. In this blog post, I’ll show you ten inspection points to improve your operations with the goal of having things run safely, efficiently, and cost effectively, and I’ll show you how to look for failure conditions that you may not have considered.

Ten-Point Inspection Plan

1. Inspect your cost and usage

One of the best ways to see what’s going on in your AWS accounts is through AWS Cost Explorer. Not only is it a good way to see how much you’re spending across all of your AWS accounts, it’s an effective way to see what services are being used and whether their use is expected.

For example, in development (or “sandbox”) accounts, it’s not unusual during the course of innovation for developers to spin up resources and leave them running. It’s also not unusual for these accounts to incur a greater cost than production accounts, where there tends to be more rigor around usage governance.

Another use case is when you use services that use other services in ways that you didn’t anticipate. For example, creating an Amazon Bedrock Knowledge Base will create an Amazon OpenSearch domain. The domain incurs a cost that you may not have anticipated or may have forgotten about. Cost Explorer is a good inspection mechanism to perform these types of point-in-time observations.

Additionally, some organizations require their security teams to vet an AWS service before it can be used in a production setting so that its security properties can be better understood. While there are preventive strategies that can restrict these services from being used (like using AWS service control policies), Cost Explorer can be a good way to verify that those mechanisms are working.

For example, if there’s a charge for a service that the security team hadn’t expected, then that could merit additional scrutiny.

Monthly maintenance check #1: Perform a review of Cost Explorer. Better yet, if you use AWS Organizations, you can view cost data across all member accounts across different dimensions, like account, service, and region to see which accounts are responsible for the most usage costs or which services are used in which accounts.

Monthly maintenance check #2: Configure and review the Cost and Usage reporting dashboard. With AWS Data Exports, you can take advantage of the pre-built Cost and Usage Dashboard powered by Amazon QuickSight.

Monthly maintenance check #3: Create cost budgets for your accounts. This provides three benefits. First, by setting different cost-related thresholds, you can send alerts to account owners to help mitigate potentially runaway costs. Second, rapid, excessive spending could be an indicator of compromise, so the budget alert could help detect this. Third, if your account owners are funded by research grants, then budget limits can help detect and mitigate potential budget overruns.

2. Inspect GuardDuty configuration

Amazon GuardDuty is an intelligent threat detection service used to identify known malicious activity or anomalous behavior in your account. While enabling it is a best practice and crucial step, it’s only valuable if you’re actively monitoring it for findings. Furthermore, if you are monitoring it for findings and perhaps sending notifications to your security operations center (SOC), will your engineers know how to disposition these findings? Have they been trained in AWS?

Monthly maintenance check #1: Enable and configure GuardDuty. If you’re using AWS Organizations, establish a delegated administrator account to do this, and ensure that you configure GuardDuty to send notifications when findings are generated.

Monthly maintenance check #2: Consider creating playbooks for potential incidents or findings. You can get started using the templates found in this GitHub repository for playbooks for common findings. Meet with your SOC team regularly to understand the frequency and nature of the findings to determine whether further mitigations might be required and whether automation might be a suitable remediation mechanism.

Monthly maintenance check #3 Build in continuous training opportunities for your staff and evaluate the training requirements and frequency according to your organization’s staff development plans. Consider conducting incident response game days as a way to exercise your technical controls, personnel, and procedures. You can find more information here and here.

3. Inspect unused CloudWatch logs

Amazon CloudWatch Logs is a service that allows you to store and persist logs from applications, Amazon EC2 instances, AWS CloudTrail trails, AWS Lambda functions, or other sources. Maintaining logs with CloudWatch Logs is useful for auditing, troubleshooting, querying, retention, or for meeting other business objectives.

Although you can set retention periods for CloudWatch log groups, they’re not set by default, which means that your logs will persist in perpetuity if you don’t alter the retention periods. If you don’t need the logs indefinitely, then they’re otherwise incurring costs, and they could represent a potential data exposure risk.

Monthly maintenance check #1: Evaluate the CloudWatch log groups that don’t have a defined retention policy and determine whether they merit one, and consider whether they can be deleted altogether.

Monthly maintenance check #2: Delete empty CloudWatch log streams. Think of log streams as subgroups of logs within log groups. You don’t get charged for empty log streams, but a large collection of empty log streams or log groups (“log sprawl”) can be overwhelming. Consider following the guidance in this blog post for automating that process.

4. Inspect unused IAM roles

AWS Identity and Access Management (IAM) provides access controls that allow or restrict users and services from performing given actions. Roles are generally used to perform actions by services on your behalf, such as executing AWS Lambda functions, running AWS CodeBuild projects, or executing AWS Step Functions state machines. It’s a best practice to create unique IAM roles for specific purposes. However, as you innovate and build solutions, you may forget to delete the associated IAM roles when removing unused Lambda functions, CodeBuild projects, and Step Functions state machines. This can lead to an accumulation of roles that could potentially introduce risks by providing access to resources beyond their original purpose.

Monthly maintenance check #1: Use IAM Access Analyzer’s “Unused access findings” feature to identity dormant roles and consider deleting them if they’re no longer needed. Follow the guidance in this blog post for identifying and resolving findings.

5. Inspect alerting mechanisms

A vital part of AWS Prescriptive Guidance is the use of responsive controls to react to security events or budget thresholds, to support serverless operations, or to help enforce continuous compliance for your operational and risk requirements. Two common approaches to support responsive architectures are using Amazon EventBridge rules and Amazon Simple Notification Service (SNS) topics. EventBridge rules allow you to create listeners for given events and then perform designated actions on those events. For example, you can create a rule that detects a state change to an AWS Config rule that then triggers an action to inspect and potentially remediate the change.

Monthly maintenance check #1: Ensure that your EventBridge rules have targets that are still valid. For example, ensure that Lambda function targets or SNS topic targets still exist and are valid. Follow the guidance here for how to monitor your rules.

Monthly maintenance check #2: Ensure that your SNS topics have targets that are still valid. For example, if you have topics that target email addresses, ensure that the email addresses are still valid and that they are receiving the SNS notifications as expected.

6. Inspect Lambda execution failures

Lambda functions are used to perform a variety of tasks, from running production workloads to supporting other services in your environment. Unless you explicitly build in failure state monitoring and notifications into the code, it may not be obvious when a Lambda function may fail. This is a critical insight to have in cases where, for example, the Lambda function is used for evaluating a particular security state of your environment.

Monthly maintenance check #1: View the CloudWatch metrics for Lambda execution failures. You can adjust the metric window based on your needs. Do this at least monthly using CloudWatch Metrics Explorer, selecting “Lambda” as the service template, using “Errors” as the metric, “all values” as the tag, “1 Day” as the graph period, and four weeks as the inspection window.

As you can see from the image below, there are function execution errors that could otherwise go unnoticed:

Use CloudWatch Metrics to view Lambda execution failure counts

Use CloudWatch Metrics to view Lambda execution failure counts

Monthly maintenance check #2: The maintenance check above is used to detect Lambda function execution errors. There are likely times, however, when functions catch and record logic errors and still execute successfully. An example of this could be cases where your code will look for a specific condition and log an error if that condition isn’t met and then still continue to perform its processing. Often, the log message will include words like “error” or “exception.” To detect these cases, use CloudWatch Log Insights and select all of the Lambda functions you’d like to query (Lambda function logs usually start with /aws/lambda). Using the Query generator feature, type: Show all log entries that indicate an error or exception condition in the Prompt box. This should generate a query that looks something like this:

fields @timestamp, @message  
| filter @message like /(?i)(error|exception)/

Clicking the Run query button will search the selected log group(s) for logs that contain case-insensitive variations of “error” or “exception”:

Use CloudWatch Log Insights to view instances of log errors and exceptions

Use CloudWatch Log Insights to view instances of log errors and exceptions

The query results show that I have several cases where functions have failed in ways that I wasn’t aware of.

The above maintenance checks are useful if you know what you’re looking for. If you’re looking for anomalies or other patterns in your logs, consider configuring anomaly detection and pattern analytics using the blog post here and evaluate which logs to analyze.

7. Inspect CloudTrail logs for access errors

Similar to detecting errors in Lambda functions, it’s also a good practice to detect access errors across CloudTrail activity. This could be useful as a security mechanism for detecting unauthorized attempts to access your resources, but it’s also useful to detect similar errors in your operational use of AWS services. For example, you might have an AWS Glue Crawler job that regularly runs against an Amazon S3 bucket to discover new data. If someone from your organization’s security team subsequently modifies the bucket policy that prohibits the crawler’s ability to access the bucket’s objects, then it may not be immediately obvious to you that the crawler job will fail.

Monthly maintenance check #1: Similar to the approach for detecting Lambda execution errors, use the CloudWatch Logs Insights feature to regularly check the CloudTrail logs. You’ll need to first select your CloudTrail log group (in an AWS Control Tower environment, the log is typically named aws-controltower/CloudTrailLogs), select the query window (it’s four weeks in this example), and apply the following query;

filter (errorCode='AccessDenied' or errorCode='UnauthorizedOperation') 
| fields eventName, eventTime, errorCode, errorMessage

This will produce results similar to the following screenshot:

Use CloudWatch Log Insights to view instances of unauthorized access attempts

Use CloudWatch Log Insights to view instances of unauthorized access attempts

The query results show that I have several cases where access has been denied in cases that could affect my operations.

8. Inspect “top talker” account activity

Automation is a key element to operating an AWS environment. Automation can be used for DevOps activities, security monitoring and remediation, normal workload operations, and so on. With the accumulation of automated activities, it can sometimes be surprising to see what’s actually happening in your environment and whether the activity reconciles with your expectations. To see what’s really going on in your environment, use CloudTrail activity to aggregate “top talker” actions.

Monthly maintenance check #1: As in the previous section’s maintenance checks, use the CloudWatch Logs Insights feature to run the following query against your CloudTrail log group:

fields userIdentity.invokedBy, eventName, eventTime, userIdentity.accountId 
| stats count(*) as eventCount by userIdentity.invokedBy, eventName, userIdentity.accountId 
| sort eventCount desc 
| limit 10

This will produce results similar to the following screenshot:

Use CloudWatch Logs to view "top talker" CloudTrail activity

Use CloudWatch Logs to view “top talker” CloudTrail activity

 

In the example above, Amazon QuickSight was involved in a large number of AWS KMS decryption calls, so this information might lead to ways to think about how to optimize QuickSight data ingestion.

9. Inspect S3 usage and cost

Amazon S3 is an object storage service that provides enterprise-ready scalability, data availability, security, and performance. Customers use S3 for building data lakes, for backing up data, or for operational use with their workloads. Over time and across multiple accounts, the data can accumulate in ways that you hadn’t anticipated, which could increase your costs or affect your ability to effectively govern the use of the data. S3 Storage Lens is a feature that provides you the insight into the use of buckets and can help tame bucket sprawl. See this “5 Ways to reduce data storage costs using Amazon S3 Storage Lens” blog post for ways to help wrangle your costs and use of S3.

Monthly maintenance check #1: Create an AWS Organizations-wide Storage Lens dashboard by following the guidance in this “Optimize Costs and Gain Visibility into Usage with Amazon S3 Storage Lens” tutorial. Review the Storage Lens monthly for insights on potentially reducing cost.

10. Inspect contact information

At re:Invent in 2019, Stephen Schmidt, AWS CISO at the time, laid out the top 10 most important cloud security tips when using AWS . The first item in the list is to ensure that your contact information is accurate. Doing so establishes contacts for AWS to provide notifications regarding account, billing, or security issues that may come up. This contact information designates appropriate personnel for AWS to reach out to about those matters. Specifically, these contacts should be distribution groups instead of relying on any one individual.

Monthly maintenance check #1: Make sure that each AWS account has correct contact information and that those email address accounts are monitored. Follow the “Update the alternate contacts for your AWS account” guidance here.

Conclusion

In this blog post, I showed 10 ways to help improve AWS operations. Some of these ways involve manual checks or human oversight, but certainly these checks can be automated. This, paradoxically, leads to the question of whether that automation is executing as expected. The point is to verify that commodity operations are automated but that certain governance is still required so that the machinery is still operating normally.

About the author

Rob Barnes author photo.

Rob Barnes

Rob Barnes is a principal consultant for AWS Professional Services. He works with our customers to address security and compliance requirements at scale in complex, multi-account AWS environments through automation.