AWS Machine Learning Blog

Investigating performance issues with Amazon CodeGuru Profiler

Amazon CodeGuru (Preview) analyzes your application’s performance characteristics and provides automatic recommendations on how to improve it. Amazon CodeGuru Profiler provides interactive visualizations to show you where your application spends its time. These flame graphs are a powerful tool to help you troubleshoot which code methods are causing delays or using too much CPU. This post shows you how to interpret flame graphs and use them in the following cases:

  1. Analyzing your application’s performance profile and understanding how it relates to the code you know
  2. Spotting an obvious anomaly in the profile
  3. Investigating a latency issue
  4. Searching for CPU utilization optimization and opportunities to reduce cost

Analyzing your application

Opening a profiling group

Before your start profiling your application, you can quickly get started with CodeGuru Profiler without any setup, thanks to the demo application. It is available for anyone with an AWS account. To access the demo application, on the Amazon CodeGuru console, choose Profiler, Profiling groups. This post uses DemoProfilingGroup-WithoutIssues.

If you choose it from the Profiling groups page, you see a flame graph. The following screenshot shows the flame graph for this specific demo application.

What are flame graphs?

Flame graphs show hierarchical methods similarly to how you see them in a stack trace. CodeGuru Profiler regularly takes snapshots of the stack traces from the code being run in your application over a given period (for example, 5 minutes) and shows them in a single visualization. A horizontal bar, called a frame, represents each method. The more frequently CodeGuru Profiler sees a method in your stack traces, the wider the frame is in the graph. With enough samples, you can see what percentage of time your application spends in each method.

Flame graphs have two concepts for time:

  • Active CPU time – Accounts for the time when the processor was actively working in the method. To view, use CPU view mode.
  • Wall clock time – Also takes into account the time when the process was waiting. To view, use Latency view mode.

What is the demo application doing?

To understand what the visualization shows you, you should know what the actual profiled application is doing.

This demo application runs two separate threads. One thread is for the task publisher, which adds tasks inside an Amazon SQS queue. The other thread is running the image processor, which performs the following actions:

  1. Get tasks from queue
  2. Download images to an Amazon S3 bucket
  3. Copy image to local disk
  4. Transform and upload images to an S3 bucket
  5. Delete image to local disk
  6. Delete tasks from the queue

The following diagram illustrates this architecture.

Identifying the main parts of the overview

After you open the profile for a profiling group, the default visualization type is the Overview in CPU view mode. If you move your mouse over any frame, it displays the full name of the function and additional information. For example, the following screenshot shows that 51.65% of the observed active CPU time in this period was spent in the PNGImageWriter.encodePass function.

The frame below a given frame is the function that called it, and the frames above are the functions it called. The following screenshot illustrates the different functions.

Going from bottom to top, you can see how the threads are started and find your ImageProcessor and TaskPublisher classes being used. Directly above the ImageProcessor.run function are the functions it called. The following table summarizes each action numbered in the preceding screenshot and what it does.

 

Action What it Does Frame Name in the Graph
1 Get tasks from SQS queue AmazonSQSClient.receiveMessage
2 Get image data from S3 bucket AmazonS3Client.getObject
3 Copy image data to file IOUtils.copy
4 Transform image data and upload to S3 bucket

ImageProcessor$BrightenImageProcessor.brightenAndUpload

ImageProcessor$BWImageProcessor.monochromeAndUpload

5 Delete the file N/A (this action is very fast and was not sampled)
6 Delete message on SQS queue AmazonSQSClient.deleteMessage

When you use this visualization, it is important to remember the following:

  • Different functions called from the same context are displayed horizontally in alphabetical order, not in execution order. For example, in the preceding screenshot, action 6 is found to the left of action 3.
  • All the steps from the stack traces appear in the visualization. Some can be confusing. If there is only one function above with the same width, it is simply an intermediate function, and you do not need to worry about each frame. For example, when ImageProcessor class calls private class BrightenImageProcessor.brightenAndUpload, Java creates a BrightenImageProcessor.access$300 method that appears in the graph.
  • The large flat functions are the ones that spend a lot of CPU. You should identify them and make sure your application is supposed to take that much time doing this. In this example, the large deflateBytes is understandable for an application that is an image processing tool, it is called when we write the PNG images.

What can I learn from seeing this view?

Most of the CPU is spent in ImageProcessor and, inside this function, most is spent in action 4, when it transforms the images. The PNRImageWriter is doing compression with Deflater, which takes about 50% of the active CPU usage.

The process does multiple calls to AWS services (Amazon SQS and Amazon S3). The Amazon S3 calls take very little CPU time (for example, step 2 in ImageProcessor is barely visible in the preceding screenshot). However, it can still be impactful for the latency.

How CodeGuru Profiler shows anomalies

The following use case illustrates how CodeGuru Profiler displays anomalies.

Introducing a bug

This use case is a simplified version of a real bug. Assume that you add the following code to track the messages that you have processed and call it for each SQS message:

   static final List<Message> existingMessagesList = new ArrayList<Message>();
    
   void trackMessages(Message message) {
        if (!existingMessagesList.contains(message)) {
            existingMessagesList.add(message);
        }
    }

Unfortunately, although you expected to have only a handful of possible messages, each Message object is different, so the list keeps growing as the application runs. Searching from a long list does not scale well.

Spotting the anomaly in the profile

The longer the application runs, the more CPU it consumes. This type of issue is sometimes hard to find before it runs in production because it needs to run for a long time for the problem to occur. After running for some time, the profile looks like the following screenshot.

There is now a large trackMessages frame that takes most of the width. It indicates how much CPU time this function is using.

The other frames are all smaller. You still spend as much CPU time doing those, but they are now smaller because trackMessages is using more CPU time than before. Remember that all functions widths are displayed relative to the others.

Seeing all the time spent on ArrayList.indexOf indicates that you are searching from a long list and should use a hashmap or make sure fewer items get into that list.

Improving latency

If you have a service-based application, you may be interested in improving the latency of one of your services by performing one of the following actions:

  • Zoom in on the function that you are trying to improve. If you are not CPU bound, you may not be interested in other parts of the visualization—they are probably functions that you use for other service calls or in separate threads and have no impact on the one you are interested in.
  • Switch to Latency view. As mentioned earlier, the default CPU view shows only active CPU time, not wall clock time. The Latency view, in contrast, shows both samples for active thread states and waiting thread states, so it tells you where the actual time is spent. In Latency view, the different colors show how time was distributed into different thread states (so you can tell what part was active processing, blocked, or waiting).

CPU vs. Latency view

The following screenshot zooms in on the ImageProcessor.run function in CPU view.

The CPU view shows that the call to get objects from Amazon S3 looks negligible (0.40% of the total time).

The following screenshot zooms in on the ImageProcessor.run function in Latency view.

In Latency view, you can see that it takes 5.31% of the total time because it takes time to wait for the responses.

Use case: External call is unusually slow

In this use case, you are having an issue in this call to get objects from Amazon S3. For example, you are calling in the wrong Region and the call is now much slower. In this use case, the CPU view is unchanged but the Latency view now looks like the following screenshot.

The call to AmazonS3Client.getObject is 36% of the total time instead of 5% and is the biggest part of the ImageProcess.run. If you are used to how your profile usually looks, you can spot this problem right away. If not, this view shows which part of the process takes longer and has room for improvement.

CPU utilization optimization and opportunities to reduce cost

The following use case illustrates how to optimize your CPU utilization and find opportunities to reduce cost.

Finding the low-hanging fruit with the Hotspots visualization

For this use case, this post uses the other demo application, the DemoProfilingGroup-WithIssues profiling group. It is the same image processing application but with a few inefficiencies. The goal is to find the biggest inefficiencies.

Instead of the default Overview visualization, you can use the Hotspots visualization. It shows at the top the functions with the most CPU usage throughout the whole graph, which is useful when you want to spot the low-level functions that add up to a lot of CPU usage.

For example, for this use case, the overview visualization has multiple stacks that end with ZipFile.entry. Those may not be noticeable separately because they are each around 4% of the total. However, in the Hotspot view, you can see that they add up to about 20% of the total. See the following screenshot.

Learning from this visualization

You still see the stack where PNG images are compressed and written, which is the same action 4, as seen previously. It should take about 50% of the active CPU usage if you did not have inefficiency issues.

The other two big stacks you can identify easily are A and B. Thanks to the highlighting in yellow (when you hover over the frame), when you look at the callers, you can see that A is mostly from logging and B is mostly about constructing AmazonSQSClient objects.

Inspecting a function

AWS clients are usually designed to be reusable, so spending a lot of time constructing them is not optimal. To investigate this a bit more, you can right-click on the AmazonSQSClient.<init> frame and choose Inspect. The following screenshot shows this third visualization, which is a mix of Overview and Hotspot but focuses on one function. Above the frame, you see the other functions it calls, and below are the different functions that called it.

Learning from this visualization

Across all the application code, you spent 36% of the active CPU time constructing AmazonSQSClient objects. It appears that you are creating the client each time you need to use it, instead of creating one static instance and reusing it.

From the top part of the visualization, you can see that it takes time mainly because of class loading.

From the bottom part, you can see the one place where you call this constructor. The first function that is part of your own code is probably the place where you can fix this.

To fix the code, find your sqsClient function. See the following code:

public class Main {
    static AmazonSQS sqsClient() {
       return AmazonSQSClientBuilder.standard().build();
    }
}

Change the code to the following:

public class Main {
    static AmazonSQS sharedSqsClient = 
AmazonSQSClientBuilder.standard().build();
    static AmazonSQS sqsClient() {
        return sharedSqsClient;
    }
}

That small code change could save up to 36% of your CPU usage.

Recommendations

There are many common efficiency issues in applications. To prevent every developer from having to go through the same investigation effort, CodeGuru Profiler tries to detect the issues automatically.

Before going through the investigation that this post walked you through, you should check if the CodeGuru Profiler has recommendations for you. On the CodeGuru console, under Profiling groups, choose your profiling group. If there are available recommendations, choose Recommendations.

The following screenshot shows that the demo profile has four recommendations.

The problem that you just fixed is one that CodeGuru Profiler detected. The following screenshot shows the first recommendation for this issue.

The recommendation explains where the problem is and how to fix it. In this use case, the first recommendation is precisely the one you found and fixed. Always check your recommendation report before starting an investigation; you may save some time.

Is the application idle?

If you are examining a profiling group for an application you do not know much about, looking for opportunities to reduce cost, keep in mind that the application may be mostly idle. (For example, a transaction-based application that receives very few requests is spending most of its time waiting for calls.) In this use case, there are very few samples processing code, so the flame graph may not be accurate, and opportunities to reduce cost are minimal. You should check if the size of your fleet is adapted to the low traffic. You might save more by simply releasing some Amazon EC2 instances.

The following indicators can help you quickly spot an idle application:

  • There are no very small frames. Because there are only a few samples, the width corresponding to one sample is larger than the usual.
  • Even large frames have a small amount of RUNNABLE or NATIVE time.
  • The frame that represents the CodeGuru Profiler overhead (ProfilingCommand.run) is larger than the usual because this is almost the only thing processing in this application.

Conclusion

This post discussed the main use cases in which CodeGuru Profiler can help you find performance issues and opportunities to reduce cost. Each use case showed how to use the different visualizations to either improve CPU usage or latency.

To dive deeper into the demo application profiles from this post, visit the Amazon CodeGuru console and look at the profiles yourself, or create a new profiling group and onboard your own application.


About the author

Pierre Marieu is a Software Development Engineer in the Amazon CodeGuru Profiler team in London. He loves building tools that help the day-to-day life of other software engineers. Previously, he worked at Amadeus IT, building software for the travel industry.