Adopting Kotlin at Prime Video for higher developer satisfaction and less code

Choosing a programming language for a new project is a tough decision with long-lasting effects. This involves considering how well the languages integrate with the team’s existing technology stack, how mature the languages are, and what learning curve is required. For example, will there be sufficient time to learn the features of a hitherto unknown language?

In March 2020, Prime Video launched Prime Video profiles. This functionality lets Prime Video users access separate recommendations, season progress, and Watchlist, as these are based on individual profile activity. This new customer experience required the design and implementation of new microservices, and the team decided to use Kotlin (rather than Java) to develop these microservices. This decision also motivated the development of the AWSSSMChaosRunner library (an open source library for chaos engineering on Amazon Web Services [AWS]) in Kotlin.

In this article, we start with a brief overview of Prime Video’s software development culture, followed by a deep dive into the team’s adoption of Kotlin. Examples of Kotlin language features, such as coroutines and data classes, are also detailed. In conclusion, we describe how this adoption leads to code reduction, and we present the results of a developer satisfaction survey.

Kotlin adoption

Prime Video’s software development is organized around many different two-pizza teams—small teams that build, maintain, and own components of the call-stack (microservices). This approach enables fast delivery while maintaining a high degree of independence. The result is a culture of innovation, where software developers are empowered and encouraged to make informed choices about the technologies they build with. Choosing a programming language is one such choice, and this post details the experience of one specific team in choosing and adopting Kotlin.

Kotlin is a cross-platform, statically typed, general-purpose programming language developed by JetBrains and the open source community. It is open sourced under the Apache-2.0 license.

While designing new microservices, a Prime Video software development team decided to write these using Kotlin. The choice centered around Kotlin features (detailed below), which the team believed would lead to less code, fast delivery, and increased application throughput.

Phased approach

Although the team had previous experience with Kotlin, adopting a new language in a high-traffic (think hundreds of thousands of requests per second) and low-latency production environment can be daunting! Thus, a phased approach was chosen.

The first phase consisted of implementing the tests of an existing Java package in Kotlin. Kotlin’s Java interoperability allows the same package to contain code in both languages. This phase was successful, and the team soon started experimenting more with Kotlin. The most-used Kotlin features included:

Null Safety: A common runtime error in Java is the NullPointerException, caused by any type being assignable to null. Kotlin, on the contrary, includes the concept of null directly into the type system. That means there are nullable and non-nullable types, which enable compile time checks to prevent this error.
Data classes: A common pattern in Java is the Plain Old Java Object (POJO), which is used to encapsulate data. It introduces a lot of boilerplate, like getters and setters, toString() method, equalTo() method, hashCode() method, and builder methods. Kotlin data classes make this quite concise (one line of code).
First-class functions: Kotlin functions are first-class, which means they can be stored in variables and data structures, passed as arguments to, and returned from other higher-order functions. You can operate with functions in any way that is possible for other non-function values.
Extension functions: Kotlin classes can be extended with new functionality without the need to inherit them. For example, the String class, can be extended with a user function that can be invoked from any String object.

At this stage, the team was more confident using Kotlin and decided to proceed to the next phase and develop new microservices using it.

Tooling changes

During the second phase, some changes in tooling were necessary:

Detekt: Static code analysis for Kotlin.
ktLint: A Kotlin linter with built-in formatter.
MockK: A mocking library.
Koin: Dependency injectors written in pure Kotlin.

AWSSSMChaosRunner library

Running failure injection tests (chaos testing) against the newly developed microservices was considered critical. This requirement lead the team to develop the AWSSSMChaosRunner library (open sourced under the Apache-2.0 license). A deep dive into failure injection on AWS using this library can be found in the AWS open source blog post “Building resilient services at Prime Video with chaos engineering”.

Because the services under test were written in Kotlin, the AWSSSMChaosRunner library was implemented in Kotlin as well. Given Kotlin’s Java interoperability, this library can also be used in Java packages without any modifications.

Examples of Kotlin features used in the library are:

Data class

The failure injection configuration is specified as a Kotlin data class. Marking a class with the keyword data leads the Kotlin compiler to auto-generate the constructor, toString(), equals(), hashCode(), and additional copy() and componentN() functions. A Kotlin data class is compact compared to a Plain Old Java Object (POJO).

Usage in AWSSSMChaosRunner:

data class AttackConfiguration(
                val name: String,
                val duration: String,
                val timeoutSeconds: Int,
                val cloudWatchLogGroupName: String,
                val targets: List<Target>,
                val concurrencyPercentage: Int,
                val otherParameters: Map<String, String>
)

When conditional

when defines a conditional expression with multiple branches. This is compact and slightly more readable compared to a Java switch statement.

Usage in AWSSSMChaosRunner:

fun getAttack(ssm: AWSSimpleSystemsManagement, configuration: AttackConfiguration): SSMAttack = when (configuration.name) {
            "NetworkInterfaceLatencyAttack" -> NetworkInterfaceLatencyAttack(ssm, configuration)
            "DependencyLatencyAttack" -> DependencyLatencyAttack(ssm, configuration)
            "DependencyPacketLossAttack" -> DependencyPacketLossAttack(ssm, configuration)
            "MultiIPAddressLatencyAttack" -> MultiIPAddressLatencyAttack(ssm, configuration)
            "MultiIPAddressPacketLossAttack" -> MultiIPAddressPacketLossAttack(ssm, configuration)
            "MemoryHogAttack" -> MemoryHogAttack(ssm, configuration)
            "CPUHogAttack" -> CPUHogAttack(ssm, configuration)
            "DiskHogAttack" -> DiskHogAttack(ssm, configuration)
            "AWSServiceLatencyAttack" -> AWSServiceLatencyAttack(ssm, configuration)
            "AWSServicePacketLossAttack" -> AWSServicePacketLossAttack(ssm, configuration)
            else -> throw NotImplementedError("${configuration.name} is not a valid SSMAttack")
}

Kotlin in distributed systems

A key requirement for the new microservices (being designed to support Prime Video profiles) was supporting thousands of requests per second per host while maintaining high availability. The requests would be I/O heavy, and asynchronous (non-blocking) programming would improve application throughput by increasing parallelization, and reducing I/O wait-times from blocking subsequent execution.

This can be done in Java using Futures, but they quickly become complex if there are multiple dependent calls, and all failure scenarios need to be evaluated. Kotlin coroutines improve this process. They can be thought of as lightweight threads and are more readable.

Let’s compare with an example. The following shows a block of synchronous code in Kotlin:

fun getProfile(id: String) : Profile {
    val avatar = avatarService.loadAvatar(id)
    val cacheEntry = cache.get(id)

    val profileInfo = cacheEntry.value ?: profileService.getProfileInfo(id)
    return Profile(profileInfo, avatar)
}

This code calls two different services synchronously. Thus, loadAvatar has to wait for the cache call to finish, although there are no dependencies between the two. This creates an unnecessary increase in latency.

This latency increase can be avoided using Futures:

fun getProfile(id: String): Profile {
    val avatarFuture = avatarService.loadAvatarFuture(id)
    val cacheFuture = cache.getFuture(id)
    val profileFuture = cacheFuture.thenApply { cacheEntry ->
        cacheEntry.value ?: profileService.getProfileInfo(id)
    }

    return profileFuture.thenCombine(avatarFuture) { profileInfo, avatar -> Profile(profileInfo, avatar) }.join()
}

The code above calls loadAvatar and the cache in parallel, and, once the cache call is finished, it checks whether the profile should be fetched from the data source. This approach avoids the latency increase, but the code readability is decreased, and it includes the added expense of learning new APIs to use CompletableFuture.

Let’s take a look at the Kotlin coroutines implementation:

suspend fun getProfile(id: String) : Profile = coroutineScope {
    val avatarDeferred = async { avatarService.loadAvatar(id) }
    val cacheEntry = cache.get(id)

    val profileInfo = cacheEntry.value ?: profileService.getProfileInfo(id)
    Profile(profileInfo, avatarDeferred.await())
}

The above code is similar to the synchronous implementation, with the difference being the use of the suspend and async keyword for the parallel calls. The cache and loadAvatar calls are done in parallel, and the cached value is retrieved as needed with the call await().

Comparing them side by side, the differences between both versions and the blocking version are clearer:

Screenshot highlighting the difference between blocking implementation versus futures and coroutines.

Another difference introduced in the coroutine example is coroutineScope. This keyword enables structured concurrency, which provides a hierarchical way of defining coroutines scopes. In the code sample above utilizing Futures, if the avatar service (loadAvatar call) times out, the cache calls will still be performed, which is a waste of I/O resources. Structured concurrency ensures that when a coroutine is cancelled or fails in its scope, the subsequent coroutines are cancelled, too. This approach makes the application more robust and decreases resource waste.

Gotcha

As mentioned previously, coroutines are similar to lightweight threads, but ultimately they are executed in a thread. We learned this the hard way. Coroutine scopes let you specify the Dispatcher where it will be run, and, if none is specified, it will run on the Dispatcher of the parent scope. If none is selected in the root scope, it will run by default in the default Dispatcher, which is backed by a number of threads equal to the number of CPU cores.

Normally, this would not be an issue, but if any of the calls in the coroutine are blocking, the underlying thread won’t be released while waiting, thereby quickly consuming all available threads. The obvious way to solve this issue is to use a non-blocking client; if that is not possible, then you should specify Dispatcher.IO or configure a purpose-fit Dispatcher.

These kind of bugs are hard to spot without representative load, which in many cases is only found in the production environment. The AWSSSMChaosRunner library was used to validate the fix for this issue by simulating a latency increase that re-created the behavior (described in a previous blog post).

Conclusion

We presented a brief overview of the Kotlin adoption by a software development team at Prime Video, along with examples of the most-used language features.

The Kotlin adoption has been a positive experience with long-lasting benefits, such as code reduction, increased application throughput, and higher developer satisfaction. The team’s Kotlin code base is more readable (compared to a similar Java code base), while also making it easier to work with NullPointer exceptions and leveraging a more robust type system. For this team of Java developers, the adoption of Kotlin was smooth, and the main lessons learned centered around the increased use of Kotlin coroutines.

Code reduction

The team’s Kotlin code base is more concise (compared to a similar Java code bases), which is important because less code leaves less room for bugs. A like-for-like comparison has not been done, but the observed reductions are similar to what is quoted in the Kotlin FAQ—approximately 40 percent reduction in lines of code.

Kotlin satisfaction surveys

We conducted an anonymous satisfaction survey to gauge the team’s happiness with adopting and using Kotlin. The Kotlin team satisfaction survey consisted of eight Amazon employees. The results were positive, as seen in the figures below. Two of the questions (Figure 2) received a mixed response because of the steeper learning curve for Kotlin coroutines and the gotcha detailed previously (specifying and configuring the Dispatcher correctly).

Kotlin team satisfaction survey: Q&A.

Kotlin team satisfaction survey: Overall results.