Understanding AWS Lambda behavior using Amazon CloudWatch Logs Insights

In this guest post, Alice Xiao, Data Analyst at financial services company State Street Corporation discusses how State Street built a framework using CloudWatch Logs Insights to support their innovation process of rapidly converting business ideas into client-centered applications.

Introduction

Customers want to do rapid prototyping and ensure critical defects are quickly resolved, so understanding application behavior is critical. Amazon CloudWatch Logs Insights enables you to create ad hoc queries to understand AWS Lambda behavior based on Lambda logs stored in Amazon CloudWatch Logs. You can then create quick visualizations from that data.

Overview

State Street uses serverless applications to support their innovation process, which focuses heavily on rapid prototyping. For serverless applications—and more specifically, Lambda functions—the application teams must ensure that they are configuring Lambda runtime parameters appropriately.

To do this, State Street uses CloudWatch Logs Insights to perform heuristic analysis of Lambda logs and gain insight into the sizing and billing of each of their Lambda functions. Additionally, they do further analysis by correlating runtime and memory usage.

“The goal was to have a solution that is concise, uses AWS native functionality, empowers the developer, and supports rapid prototyping,” said Nauman Noor, Managing Director of Public Cloud Platform at State Street Corporation.

Lambda automatically monitors Lambda functions on your behalf, reporting metrics through Amazon CloudWatch. To help you troubleshoot failures in a function, Lambda logs all requests handled by your function and also automatically stores logs generated by your code in CloudWatch Logs.

Each invocation of a Lambda function publishes a “REPORT” log entry at the end of the invocation. The entry provides details about the actual duration, billed duration, and amount of memory used, as shown in the following example:

`REPORT RequestId: b68fd6c0-f38e-42d0-82f2-48b7484e5fe5 Duration: 172.63 ms Billed Duration: 200 ms Memory Size: 128 MB Max Memory Used: 35 MBREPORT RequestId: f610598b-9adc-430e-96f4-58340016e6b9 Duration: 179.20 ms Billed Duration: 200 ms Memory Size: 128 MB Max Memory Used: 35 MBREPORT RequestId: 0809c288-8b19-4685-9d1a-6b98852299ff Duration: 179.29 ms Billed Duration: 200 ms Memory Size: 128 MB Max Memory Used: 35 MBREPORT RequestId: 8b4a32a4-7dc0-4342-a998-48b051d683e0 Duration: 182.29 ms Billed Duration: 200 ms Memory Size: 128 MB Max Memory Used: 35 MB`

By using CloudWatch Logs Insights, you can extract this information in the Lambda REPORT logs and better understand and ensure proper Lambda configuration. At a deeper level, this means determining runtime duration characteristics of invocation and analyzing memory usage to mitigate memory occurrences. You can also tell if there is a correlation between memory usage and runtime duration for any specific Lambda function.

Approach

Before CloudWatch Logs Insights, a common approach was to subscribe a Lambda function to the CloudWatch logs generated by Lambda. The function would then parse the log information (Duration, Billed Duration, and Memory Size) and store it in a data store, such as Amazon DynamoDB. Although this approach worked, it required additional effort from the developer to manage code and orchestrate the deployment, configuration, and integration into such a solution. This did not meet State Street’s goal of empowering the developer.

Our approach was to leverage CloudWatch Logs Insights to provide the heuristic analytic capabilities while empowering the developer and removing operational overhead.

The following screenshot shows the user interface for CloudWatch Log Insights, including the query, distribution of log events over time, and the log data.

With no setup needed, CloudWatch Logs Insights provides the developer with the ability to create ad hoc queries. They can use the queries to understand Lambda function behavior based on logs stored in CloudWatch Logs. They can also create quick visualizations.

Scenario Illustration

To illustrate some of these concepts, I will walk you through a scenario focused on understanding the runtime characteristics of a Lambda function.

1. In CloudWatch Logs Insights, select a Lambda log group.

2. Execute the following queries:

filter @type="REPORT"
| stats avg(@billedDuration) as mean_billed_duration,
min(@billedDuration) as min_billed_duration,
max(@billedDuration) as max_billed_duration,
percentile(@billedDuration, 95) as Percentile95

filter @type="REPORT"
| stats avg(@maxMemoryUsed/1024/1024) as mean_MemoryUsed,
min(@maxMemoryUsed/1024/1024) as min_MemoryUsed,
max(@maxMemoryUsed/1024/1024) as max_MemoryUsed,
percentile(@maxMemoryUsed/1024/1024, 95) as Percentile95

The executed queries perform the following:

·       Filter out the “REPORT” logs.
·       Select a key field (@duration, @billedDuration, or @maxMemoryUsed).
·       Get the statistics, such as average, minimum, maximum, and percentile.

This results in the following output:

#	mean_billed_duration	min_billed_duration	max_billed_duration	Percentile95
1	30	38	46	54

#	mean_MemoryUsed	min_MemoryUsed	max_MemoryUsed	Percentile95
1	33.7255	30.5176	69.6182	33.3786

3. To gather more insight and understand the distribution of the fields, generate a frequency table for @maxMemoryUsed with the bin(period) function and determine the range of each bucket, as follows:

filter @type="REPORT"
| fields floor(@maxMemoryUsed/1024/1024) as floor_var,
ceil(@maxMemoryUsed/1024/1024) as ceil_var
| stats min(floor_var) as min_Memory_Used,
(min_Memory_Used + (max_Memory_Used - min_Memory_Used)/5) as bucket1,
(min_Memory_Used + 2 * (max_Memory_Used - min_Memory_Used)/5) as bucket2,
(min_Memory_Used + 3 * (max_Memory_Used - min_Memory_Used)/5) as bucket3,
(min_Memory_Used + 4 * (max_Memory_Used - min_Memory_Used)/5) as bucket4,
max(ceil_var) as max_Memory_Used

#	min_Memory_Used	bucket1	bucket2	bucket3	bucket4	max_Memory_Used
1	30	38	46	54	62	70

4. Count the number of invocations that fall within the @maxMemoryUsed range bucket:

filter @type="REPORT"
| fields (@maxMemoryUsed/1024/1024) as MemoryMB,
(MemoryMB>=30 and MemoryMB<38) as R30_38,
(MemoryMB>=38 and MemoryMB<46) as R38_46,
(MemoryMB>=46 and MemoryMB<54) as R46_54,
(MemoryMB>=54 and MemoryMB<62) as R54_62,
(MemoryMB>=62 and MemoryMB<=70) as R62_70
| stats sum(R30_38) as MemoryUsed30MB_38MB,
sum(R38_46) as MemoryUsed38MB_46MB,
sum(R46_54) as MemoryUsed46MB_54MB,
sum(R54_62) as MemoryUsed54MB_62MB,
sum(R62_70) as MemoryUsed62MB_70MB

#	MemoryUsed30MB_38MB	MemoryUsed38MB_46MB	MemoryUsed46MB_54MB	MemoryUsed54MB_62MB	MemoryUsed62MB_70MB
1	1242	0	0	0	49

So far, most of your Lambda invocations have consumed between 30–38 MB of memory, while 49 invocations consumed between 62–70 MB. These insights make me question if these high-memory usage situations could be correlated with longer invocation runtime.

5. Repeat this process for @billedDuration to get the following results:

filter @type="REPORT"
| stats min(@billedDuration) as min_billed_duration,
(min_billed_duration + (max_billed_duration - min_billed_duration)/5) as bucket1,
(min_billed_duration + 2 * (max_billed_duration - min_billed_duration)/5) as bucket2,
(min_billed_duration + 3 * (max_billed_duration - min_billed_duration)/5) as bucket3,
(min_billed_duration + 4 * (max_billed_duration - min_billed_duration)/5) as bucket4,
max(@billedDuration) as max_billed_duration

#	min_billed_duration	bucket1	bucket2	bucket3	bucket4	max_billed_duration
1	200	360	520	680	840	100

filter @type="REPORT"
| fields (@billedDuration>=200 and @billedDuration<360) as R200_360,
(@billedDuration>=360 and @billedDuration<520) as R360_520,
(@billedDuration>=520 and @billedDuration<680) as R520_680,
(@billedDuration>=680 and @billedDuration<840) as R680_840,
(@billedDuration>=840 and @billedDuration<=1000) as R840_1000
| stats sum(R200_360) as billedDuration200_360,
sum(R360_520) as billedDuration360_520,
sum(R520_680) as billedDuration520_600,
sum(R680_840) as billedDuration680_840,
sum(R840_1000) as billedDuration840_1000

#	billedDuration200_360	billedDuration360_520	billedDuration520_600	billedDuration680_840	billedDuration840_1000
1	1121	161	4	3	2

As you can see from the results, most of the invocations get billed between 200–360 ms of runtime, with a few taking more than 520 ms.

At this point, I wondered if the 49 invocations that consumed the most memory corresponded to the top @billedDuration ranges? In other words, is runtime duration correlated to memory usage, so that high-memory usage requires a longer runtime?

The following query shows a count of the invocations that used 62–70 MB of memory along with the billed duration and memory used.

fields (@maxMemoryUsed/1024/1024) as MemoryMB
| filter @type="REPORT" and (MemoryMB>=62 and MemoryMB<=70)
| stats count(@billedDuration) by MemoryMB, @billedDuration

#	MemoryMB	@billedDuration	count(@billedDuration)
1	69.6182	300	7
2	69.6182	400	21
3	69.6182	500	4
4	69.6182	200	16
5	68.6646	1000	1

Analyzing the 49 invocations that used 62–70 MB memory and their corresponding @billedDuration shows that only one corresponded to the 1,000-ms runtime. The majority of the others ran 400 ms or less. At this point, the relationship is not obvious and the correlation appears to be weak.

6. Expand the scope of study to a larger period (and greater sample size) of 10 hours. Attempt to visualize a five-minute moving average of @billedDuration (ms) vs. @memoryUsed (in 0.1 MB), as shown in the following code:

filter @type="REPORT"
| stats avg(@billedDuration) as BilledDuration,
avg(@maxMemoryUsed/1024/102.4) as MemoryUsedMB_scaled by bin(5m)

The following screenshot shows a line chart visualization of the billed duration and memory used (MB) on the y-axis against time on the x-axis.

After you expand the scope of study to a larger period, in some situations, a spike in memory usage may correspond to a longer runtime duration. However, it appears that the correlation between @maxMemoryUsed and @billedDuration is not clear.

This insight suggests that the application team should review the execution flow. They can see if the runtime is a function of exception handling that is normally not the case in other scenarios. Or, the memory consumption might be a function of the data being processed.

Ultimately, the information that CloudWatch Logs Insights provides helps the team narrow in on the specific invocations of interest from a runtime and performance-optimization perspective.

Conclusion

In this post, I demonstrated how CloudWatch Logs Insights provides a quick understanding of Lambda runtime behavior, such as duration and memory usage and the potential correlation between the two dimensions.

With CloudWatch, you can parse, filter, sort, and infer statistical outputs in a simple and repeatable manner without overhead from the application team having to instrument their applications to enable the analysis.

I really enjoy the ease of use of CloudWatch Logs Insights. And I look forward to future enhancements, such as additional robust query functions and the ability to set custom attributes within queries that aid more complex analysis while simplifying some of the proposed techniques.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

AWS Cloud Operations & Migrations Blog