AWS DevOps Blog

Detecting concurrency bugs with Amazon CodeGuru

This post discusses the types of concurrency bugs Amazon CodeGuru detects and how developers can fix them.

CodeGuru automatically analyzes pull requests (created in supported repositories like CodeCommit, GitHub, GitHub Enterprise, and Bitbucket) and generates recommendations about how to improve your code quality.

Why use a tool to automatically detect concurrency bugs?

Stylize view of concurrency bugs in a concurrent condeConcurrency bugs are difficult to catch during unit and system testing. This is because triggering concurrency bugs is timing dependent: threads need to execute instructions in parallel in a particular order for the program to exhibit the buggy behavior (we provide examples later in this post). Additionally, triggering concurrency bugs is non-deterministic: many program executions during testing and production may not exhibit the buggy behavior, but occasionally, (for example, under a slightly different execution timing or system load), an execution may trigger the bug. Even after triggering a concurrency bug, reproducing it for debugging may be difficult due to non-determinism.

Overview of concurrency bugs found by CodeGuru

The concurrency bug detectors in CodeGuru found previously unknown concurrency bugs in 22 mature and widely used Amazon projects, 6 open-source projects (JDK 8 to 14, Elasticsearch, Hadoop, NetBeans, RabbitMQ, and XMage), and 18 internal Amazon projects. We reported these bugs to the developers, who fixed them based on our reports.

In addition to fixing the bugs, developers gave us strongly positive feedback. For example, after we reported bugs in three mature projects, the project developers asked us to run the concurrency bug detectors on other modules.

Developers say the bugs found by the concurrency bug detectors in CodeGuru would be difficult to find by a human though code inspection: “It’s a really neat catch. Bugs like that are super tricky to track down.”, “…it’s pretty subtle, not something easily spottable,” and, “Looks like it just picked up a flaw in the code that I don’t think I’ll ever find out myself.”

Other developers expressed enthusiasm for the concurrency bug detectors: “I’m also impressed by the inspection tool you guys are using,” “He gave us some fantastic deadlock finds that he found with CodeGuru,” “This seems like a very useful tool,” and “CodeGuru is really powerful.”

Types of concurrency bugs CodeGuru detects

CodeGuru detects four types of concurrency bugs:

  • Deadlocks
  • Data races on thread unsafe classes
  • Atomicity violations
  • Over-synchronization

In the following sections, we give real-world examples of bugs from each of the four bug types.

Deadlocks

Deadlocks are severe bugs in which the execution permanently blocks. Deadlocks can cause lost system utilization, long response time, low throughput, and negative customer experience. During a deadlock (also called a circular-wait), one thread waits for a synchronization object held by another thread, while the other thread waits for a synchronization object held by the first thread. The execution therefore can’t make forward progress, and the user needs to forcefully terminate the process. (When synchronization protects one or more code regions, those code regions cannot execute simultaneously in parallel and are forced to execute one at a time).

Example deadlock in JDK 8 to 14

The following code shows a previously unknown deadlock (JDK-8236873) that CodeGuru found in JDK 8 to 14. The JDK developers fixed this deadlock based on our report.

public void run() {
   ...
   synchronized (jobs) {
   ...
      isStopped();
   ...
}

private synchronized boolean isStopped() {
   return stopped;
}

public synchronized void stopWorker() {
   ...
   synchronized (jobs) {
      ...
}

In the preceding code, the deadlock occurs due to the actions two threads can take.

One thread performs the following actions:

  • Enters method run()
  • Acquires synchronization on object jobs
  • Calls method isStopped()
  • Tries to acquire synchronization on object this (because isStopped() is a synchronized method ) and blocks waiting for another thread (see next) to release the synchronization on object this

Another thread performs the following actions:

  • Enters method stopWorker()
  • Acquires synchronization on object this (because stopWorker() is a synchronized method)
  • Tries to acquire synchronization on object jobs and blocks waiting for the first thread to release the synchronization on object jobs

The two thread executions are deadlocked: each thread waits for the other thread to release synchronization, and therefore neither of the two threads can continue executing.

Fixing this deadlock

To break this deadlock cycle, developers made variable stopped a volatile variable and removed the synchronization on this. A general fix strategy for deadlocks (recommended by CodeGuru in its messages for deadlock detections) is to always acquire synchronization objects in the same order, which prevents cycles.

This deadlock may be difficult to detect during testing

Triggering this deadlock requires very particular timing: the first thread acquires jobs but doesn’t yet acquire this and at exactly the same time, the second thread acquires this but doesn’t yet acquire jobs. In a regular unit or system test, such particular timing may be unlikely to occur. However, during production and heavy loads, this particular timing may happen.

Data races on thread unsafe classes

Data races on thread unsafe classes are severe bugs that can cause the execution to crash or to produce wrong results. For example, if one thread iterates over a list while another thread adds to that list at the same time, the list’s internal data structures may become corrupt, potentially causing exceptions, infinite loops, or wrong or lost results. To avoid this, code typically protects accesses to a shared data structure by acquiring synchronization, which ensures thread mutual exclusion. A data race on a thread unsafe class happens when the code wrongly doesn’t acquire synchronization.

Example data race on a thread unsafe class in Hadoop

The following code shows a data race on a thread unsafe class (HDFS-14618) that CodeGuru found in Hadoop (we reported this bug anonymously because we had not yet launched CodeGuru). The Hadoop developers fixed this race based on our report.

public void clear() {
   synchronized (pendingReconstructions) {
      pendingReconstructions.clear();
      timedOutItems.clear();
      timedOutCount = 0L;
   }
}

The preceding code has a data race on timedOutItems. timedOutItems is a field of type ArrayList, a thread unsafe class. The Hadoop code typically (in six code locations) protects operations on timedOutItems with synchronization on timedOutItems. In these six locations, the operations on timedOutItems are add(), another clear(), toArray(), and size().

However, in the preceding code (the seventh location where timedOutItems is accessed), the timedOutItems.clear() call is protected by synchronization on a different object (pendingReconstructions). Two threads synchronizing on different objects can still execute at the same time. Therefore, timedOutItems.clear() can modify timedOutItems while another thread calls add(), the other clear(), or toArray() on timedOutItems.

The code would be buggy even if synchronized (pendingReconstructions) wasn’t present. However, the synchronized (pendingReconstructions) likely obscured for the developer the fact that timedOutItems.clear() isn’t correctly protected by synchronization.

Fixing this data race

The fix for this data race is to surround timedOutItems.clear() with a synchronized (timedOutItems) block. This is the type of fix recommended by CodeGuru in its messages for data race detections.

This race may be difficult to detect during testing

Triggering this data race requires very particular timing: one thread needs to execute timedOutItems.clear() and at exactly the same time, another thread needs to execute one of the other actions (add(), the other clear(), or toArray()) on timedOutItems. In a regular unit or system test, such special timing may be unlikely to happen, but during production and heavy loads, this particular timing may occur.

Atomicity violations

Atomicity violations are severe bugs that can cause the execution to crash or produce wrong results. An atomicity violation happens when two instructions are expected and assumed to execute together (atomically) but another thread executing on the same data breaks the atomicity expectation. As of this writing, CodeGuru focuses on atomicity violations involving concurrent collections. For example, when the code checks isPresent(key) on a ConcurrentHashMap and then the code calls get(key), the code implicitly assumes that no other thread removes the key from the map in between the calls to isPresent() and get(). In other words, the code assumes the isPresent() and get() are executed atomically. If another thread removes the key in the time between isPresent() and get() are executed, get() can return null and the program can crash (if the code wrongly uses the null value).

Example atomicity violation in Amazon code

The following code (anonymized sketch of the original code) shows an atomicity violation CodeGuru found in Amazon code. The Amazon developers fixed this atomicity violation based on our report.

public void methodA(KeyType param) { 
   ...
   if (aConcurrentHashMap.contains(param)) {
      ValueType value = aConcurrentHashMap.get(param);
      ...
      value.callAMethod();
}

public void methodB(KeyType param) {
   ...
   aConcurrentHashMap.remove(param);
}

In the preceding code, aConcurrentHashMap is a field of type ConcurrentHashMap. If a thread executes methodA() and another thread executes methodB(), the following can happen:

  • The first thread executes aConcurrentHashMap.contains(param), which returns true
  • The second thread executes aConcurrentHashMap.remove(param), which removes the key param and corresponding value
  • The first thread executes aConcurrentHashMap.get(param), which returns null (because now aConcurrentHashMap no longer contains the key param—assuming the two parameters for methodA() and methodB() are the same object)
  • The first thread executes value.callAMethod() and crashes with NullPointerException

Fixing this atomicity violation

To fix this atomicity violation, remove the call aConcurrentHashMap.contains(param) and replace it will a null check on the value returned by aConcurrentHashMap.get(param). The null check effectively checks whether aConcurrentHashMap contained the key or not. After the fix, methodA() looks like the following code:

public void methodA(KeyType param) { 
   ...
   ValueType value = aConcurrentHashMap.get(param);
   if (value != null) {
      ...
      value.callAMethod();
}

This is the type of fix recommended by CodeGuru in its messages for atomicity violation detections.

This atomicity violation may be difficult to detect during testing

Triggering this atomicity violation requires very particular timing: the second thread must execute aConcurrentHashMap.remove(param) right in between the calls aConcurrentHashMap.contains(param) and aConcurrentHashMap.get(param) executed by the other thread. In a regular unit or system test, such special timing may be unlikely to occur, but it may happen during production and heavy loads.

Over-synchronizations

Over-synchronizations are bugs that reduce program performance unnecessarily. In an over-synchronization bug, the developer adds coarse-grained synchronization and unnecessarily serializes the parallel execution. Over-synchronizations can exist in many scenarios, but CodeGuru focuses on over-synchronizations involving concurrent collections.

Example over-synchronization in Amazon code

The following code (anonymized sketch of the original code) shows an over-synchronization CodeGuru found in Amazon code. The Amazon developers fixed this over-synchronization based on our report.

synchronized (aConcurrentHashMap) {
   value = aConcurrentHashMap.get(key);
   if (value == null) {
      aConcurrentHashMap.put(key, newValue);
   }
}

In the preceding code, aConcurrentHashMap is a field of type ConcurrentHashMap. The code uses synchronized (aConcurrentHashMap) to guard against the following execution scenario (which may happen if synchronized (aConcurrentHashMap) is missing):

  • Two threads execute aConcurrentHashMap.get(key) on the same key.
  • Both threads get null values (because aConcurrentHashMap doesn’t contain the key).
  • Both threads execute the true branch of the if (value == null) condition.
  • The value put into aConcurrentHashMap by the first thread with aConcurrentHashMap.put(key, newValue) is overwritten by the value put by the second thread. The first thread uses the overwritten value to store data, and therefore that data is lost (the data becomes unreachable in Java) after the value is overwritten.

By using synchronized (aConcurrentHashMap), the code prevents this execution scenario, but the code is unnecessarily slow: synchronization makes execution sequential, which can become a bottleneck under heavy load. Developers can avoid this slowdown by using the putIfAbsent() method in ConcurrentHashMap. The putIfAbsent() method achieves the same behavior but uses fine-grained synchronization, which reduces the sequential bottleneck and increases performance under heavy load.

Fixing this over-synchronization

The fix for this over-synchronization is to simply call putIfAbsent(key, newVal) and remove the synchronized (aConcurrentHashMap). This is the type of fix recommended by CodeGuru in its messages for over-synchronization detections.

This over-synchronization may be difficult to detect during testing

This over-synchronized code may get executed during unit or system testing. However, the code may appear as slow only under heavy load. Additionally, many synchronized regions (that may be slow during heavy load) may be legitimately slow: the code may truly need mutual exclusion. In contrast, the preceding code may be slow without actually needing to be slow.

What CodeGuru brings to concurrency bug detection

The concurrency bug detectors in CodeGuru prioritize accuracy over coverage: we believe it’s better to lose some concurrency bugs than give developers reports that they may consider to be wrong. Consequently, the concurrency bug detectors in CodeGuru report high-confidence detections but can miss bugs in the four categories presented above.

The concurrency detectors in CodeGuru use static analysis algorithms and techniques (static analysis means analyzing code without executing it, which is the use case for CodeGuru during pull requests). For concurrency bugs, static analysis techniques are notoriously prone to false reports: a tool may report what it believes is a bug but the developer may decide the code is benign or works as intended (even though the code may be somewhat unusual). The concurrency detectors in CodeGuru identify likely false reports and filter them out.

For example, sometimes developers implement caches using ConcurrentHashMap, but such caches are typically resilient to slightly stale values and therefore to some forms of atomicity violations. In other cases, a data race may affect only logging or appear in methods that are likely sequential or known to the developer to be thread safe. In other cases, static analysis has trouble tracking objects though fields and complex heap objects.

Additionally, CodeGuru limits the types of concurrency bugs it detects (e.g., it focuses on data races involving thread unsafe classes and atomicity violations involving concurrent collections) to achieve high confidence. We’re working on expanding these bug types.

Conclusion

Concurrency bugs are difficult to detect during testing. The concurrency bug detectors in CodeGuru found previously unknown concurrency bugs in 22 mature and widely used Amazon projects, 6 open-source projects , and 18 internal Amazon projects. Thanks to our reports, developers have fixed these bugs.

CodeGuru has a 90 days free trial. You can easily get your code reviewed as described in Getting started with CodeGuru Reviewer.