Anatomy of CVE-2019-5736: A runc container escape!
This post is courtesy of Samuel Karp, Senior Software Development Engineer — Amazon Container Services.
On Monday, February 11, CVE-2019-5736 was disclosed. This vulnerability is a flaw in runc, which can be exploited to escape Linux containers launched with Docker, containerd, CRI-O, or any other user of runc. But how does it work? Dive in!
This concern has already been addressed for AWS, and no customer action is required. For more information, see the security bulletin.
A review of Linux process management
Before I explain the vulnerability, here’s a review of some Linux basics.
- Processes and syscalls
- What is /proc?
- Dynamic linking
Processes and syscalls
Processes form the core unit of running programs on Linux. Every launched program is represented by one or more processes. Processes contain a variety of data about the running program, including a process ID (pid), a table tracking in-use memory, a pointer to the currently executing instruction, a set of descriptors for open files, and so forth.
Processes interact with the operating system to perform a variety of operations (for example, reading and writing files, taking input, communicating on the network, etc.) via system calls, or syscalls. Syscalls can perform a variety of actions. The ones I’m interested in today involve creating other processes (typically through
clone(2)) and changing the currently running program into something else (
File descriptors are how a process interacts with files, as managed by the Linux kernel. File descriptors are short identifiers (numbers) that are passed to the appropriate syscalls for interacting with files:
read(2), write(2), close(2), and so forth.
Sometimes a process wants to spawn another process. That might be a shell running a program you typed at the terminal, a daemon that needs a helper, or even concurrent processing without threads. When this happens, the process typically uses the
These syscalls have some differences, but they both operate by creating another copy of the currently executing process and sharing some state. That state can include things like the memory structures (either shared memory segments or copies of the memory) and file descriptors.
After the new process is started, it’s the responsibility of both processes to figure out which one they are (am I the parent? Am I the child?). Then, they take the appropriate action. In many cases, the appropriate action is for the child to do some setup, and then execute the
The following example shows the use of
fork(2), in pseudocode:
execve(2) syscall instructs the Linux kernel to replace the currently executing program with another program, in-place. When called, the Linux kernel loads the new executable as specified and pass the specified arguments. Because this is done in place, the pid is preserved and a variety of other contextual information is carried over, including environment variables, the current working directory, and any open files.
Wait…open files? By default, open files are passed across the
execve(2) boundary. This is useful in cases where the new program can’t open the file, for example if there’s a new mount covering the existing path. This is also the mechanism by which the standard I/O streams (
stdin, stdout, and
stderr) are made available to the new program.
While convenient in some use cases, it’s not always desired to preserve open file descriptors in the new program. This behavior can be changed by passing the
O_CLOEXEC flag to
open(2) when opening the file or by setting the
FD_CLOEXEC flag with
FD_CLOEXEC (which are both short for close-on-exec) prevents the new program from having access to the file descriptor.
What is /proc?
proc(5)) is a pseudo-filesystem that provides access to a number of Linux kernel data structures. Every process in Linux has a directory available for it called
/proc/[pid]. This directory stores a bunch of information about the process, including the arguments it was given when the program started, the environment variables visible to it, and the open file descriptors.
The special files inside
/proc/[pid]/fd describe the file descriptors that the process has open. They look like symbolic links (symlinks), and you can see the original path of the file, but they aren’t exactly symlinks. You can pass them to
open(2) even if the original path is inaccessible and get another working file descriptor.
Another file inside
/proc/[pid] is called
exe. This file is like the ones in
/proc/[pid]/fd except that it points to the binary program that is executing inside that process.
/proc/[pid] also has a companion directory,
/proc/self. This directory is always the same as
/proc/[pid] of the process that is accessing it. That is, you can always read your own
/proc data from
/proc/self without knowing your pid.
When writing programs, software developers typically use libraries—collections of previously written code intended to be reused. Libraries can cover all sorts of things, from high-level concerns like machine learning to lower-level concerns like basic data structures or interfaces with the operating system.
In the code example above, you can see the use of a library through a call to a function defined in a library (
Libraries are made available to programs through linking: a mechanism for resolving symbols (types, functions, variables, etc.) to their definition. On Linux, programs can be statically linked, in which case all the linking is done at compile time and all symbols are fully resolved. Or they can be dynamically linked, in which case at least some symbols are unresolved until a runtime linker makes them available.
Dynamic linking makes it possible to replace some parts of the resulting code without recompiling the whole application. This is typically used for upgrading libraries to fix bugs, enhance performance, or to address security concerns. In contrast, static linking requires re-compiling and re-linking each program that uses a given library to affect the same change.
On Linux, runtime linking is typically performed by
ld-linux.so(8), which is provided by the GNU project toolchain. Dynamically linked libraries are specified by a name embedded into the compiled binary. This dynamic linker reads those names and then performs a search across a standard set of paths to find the associated library file (a shared object file, or .so).
The dynamic linker’s search path can be influenced by the
LD_LIBRARY_PATH environment variable. The LD_PRELOAD environment variable can tell the linker to load additional, user-specified libraries before all others. This is useful in debugging scenarios to allow selective overriding of symbols without having to rebuild a library entirely.
Now that the cast of characters is set (
fork(2), execve(2), open(2), proc(5), file descriptors, and linking), I can start talking about the vulnerability in runc.
runc is a container runtime. Like a shell, its primary purpose is to launch other programs. However, it does so after manipulating Linux resources like cgroups, namespaces, mounts, seccomp, and capabilities to make what is referred to as a “container.”
The primary mechanism for setting up some of these resources, like namespaces, is through flags to the
clone(2) syscall that take effect in the new process. The target of the final
execve(2) call is the program the user requested. With a container, the target of the final
execve(2) call can be specified in the container image or through explicit arguments.
The CVE announcement states:
“The vulnerability allows a malicious container to […] overwrite the host runc binary […]. The level of user interaction is being able to run any command […] as root within a container [when creating] a new container using an attacker-controlled image.”
The operative parts of this are: being able to overwrite the host runc binary (that seems bad) by running a command (that’s…what runc is supposed to do…). Note too that the vulnerability is as simple as running a command and does not require running a container with elevated privileges or running in a non-default configuration.
Don’t containers protect against this?
Containers are, in many ways, intended to isolate the host from a given workload or to isolate a given workload from the host. One of the main mechanisms for doing this is through a separate view of the filesystem. With a separate view, the container shouldn’t be able to access the host’s files and should only be able to see its own. runc accomplishes this using a mount namespace and mounting the container image’s root filesystem as /. This effectively hides the host’s filesystem.
Even with techniques like this, things can pass through the mount namespace. For example, the
/proc/cmdline file contains the running Linux kernel’s command-line parameters. One of those parameters typically indicates the host’s root filesystem, and a container with enough access (like
CAP_SYS_ADMIN) can remount the host’s root filesystem within the container’s mount namespace.
That’s not what I’m talking about today, as that requires non-default privileges to run. The interesting thing today is that the
/proc filesystem exposes a path to the original program’s file, even if that file is not located in the current mount namespace.
What makes this troublesome is that interacting with Linux primitives like namespaces typically requires you to run as root, somewhere. In most installations involving runc (including the default configuration in Docker, Kubernetes, containerd, and CRI-O), the whole setup runs as root.
runc must be able to perform a number of operations that require elevated privileges, even if your container is limited to a much smaller set of privileges. For example, namespace creation and mounting both require the elevated capability
CAP_SYS_ADMIN, and configuring the network requires the elevated capability
CAP_NET_ADMIN. You might see a pattern here.
An alternative to running as root is to leverage a user namespace. User namespaces map a set of UIDs and GIDs inside the namespace (including ones that appear to be root) to a different set of UIDs and GIDs outside the namespace. Kernel operations that are user-namespace-aware can delineate privileged actions occurring inside the user namespace from those that occur outside.
However, user namespaces are not yet widely employed and are not enabled by default. The set of kernel operations that are user-namespace-aware is still growing, and not everyone runs the newest kernel or user-space software.
/proc exposes a path to the original program’s file, and the process that starts the container runs as root. What if that original program is something important that you knew would run again… like runc?
runc’s job is to run commands that you specify. What if you specified
/proc/self/exe? It would cause runc to spawn a copy of itself, but running inside the context of the container, with the container’s namespaces, root filesystem, and so on. For example, you could run the following command:
docker run --rm amazonlinux:2 /proc/self/exe
This, by itself, doesn’t get you far—runc doesn’t hurt itself.
Generally, runc is dynamically linked against some libraries that provide implementations for
seccomp(2), SELinux, or AppArmor. If you remember from earlier,
ld-linux.so(8) searches a standard set of file paths to provide these implementations at runtime. If you start runc again inside the container’s context, with its separate filesystem, you have the opportunity to provide other files in place of the expected library files. These can run your own code instead of the standard library code.
There’s an easier way, though. Instead of having to make something that looks like (for example) libseccomp, you can take advantage of a different feature of the dynamic linker:
LD_PRELOAD. And because runc lets you specify environment variables along with the path of the executable to run, you can specify this environment variable, too.
LD_PRELOAD, you can specify your own libraries to load first, ahead of the other libraries that get loaded. Because the original libraries still get loaded, you don’t actually have to have a full implementation of their interface. Instead, you can selectively override some common functions that you might want and omit others that you don’t.
So now you can inject code through
LD_PRELOAD and you have a target to inject it into: runc, by way of
/proc/self/exe. For your code to get run, something must call it. You could search for a target function to override, but that means inspecting runc’s code to figure out what could get called and how. Again, there’s an easier way. Dynamic libraries can specify a “constructor” that is run immediately when the library is loaded.
Using the “constructor” along with
LD_PRELOAD and specifying the command as
/proc/self/exe, you now have a way to inject code and get it to run. That’s it, right? You can now write to
/proc/self/exe and overwrite runc!
Not so fast.
The Linux kernel does have a bit of a protection mechanism to prevent you from overwriting the currently running executable. If you open
/proc/self/exe for writing, you get
-ETXTBSY. This error code indicates that the file is busy, where “TXT” refers to the text (code) section of the binary.
You know from earlier that
execve(2) is a mechanism to replace the currently running executable with another, which means that the original executable isn’t in use anymore. So instead of just having a single library that you load with
LD_PRELOAD, you also must have another executable that can do the dirty work for you, which you can
Normally, doing this would still be unsuccessful due to file permissions. Executables are typically not world-writable. But because runc runs as root and does not change users, the new runc process that you started through
/proc/self/exe and the helper program that you executed are also run as root.
After you gain write access to the runc file descriptor and you’ve replaced the currently executing program with
execve(2), you can replace runc’s content with your own. The other software on the system continues to start runc as part of its normal operation (for example, creating new containers, stopping containers, or performing
exec operations inside containers). Your code has the chance to operate instead of runc. When it gets run this way, your code runs as root, in the host’s context instead of in the container’s context.
Now you’re done! You’ve successfully escaped the container and have full root access.
Putting that all together, you get something like the following pseudocode:
Pseudocode for preload.so
Pseudocode for the rewrite program
How does the patch work?
If you try the same approach with a patched runc, you instead see that opening the file with
O_RDWR is denied. This means that the patch is working!
The runc patch operates by taking advantage of some Linux kernel features introduced in kernel 3.17, specifically a syscall called
memfd_create(2). This syscall creates a temporary memory-backed file and a file descriptor that can be used to access the file. This file descriptor has some special semantics: It is automatically removed when the last reference to it is dropped. It’s also in memory, so that just equates to freeing the memory. It supports another useful feature: file sealing. File sealing allows the file to be made immutable, even to processes that are running as root.
The runc patch changes the behavior of runc so that it creates a copy of itself in one of these temporary file descriptors, and then seals it. The next time a process launches (via
fork(2)) or a process is replaced (via
/proc/self/exe will be this sealed, memory-backed file descriptor. When your rewrite program attempts to modify it, the Linux kernel prevents it as it’s a sealed file.
Could I have avoided being vulnerable?
Yes, a few different mechanisms were available before the patch that provided mitigation for this vulnerability. The one that I mentioned earlier is user namespaces. Mapping to a different user namespace inside the container would mean that normal Linux file permissions would effectively prevent runc from becoming writable because the compromised process inside the container is not running as the real root user.
Another mechanism, which is used by Google Container-Optimized OS, is to have the host’s root filesystem mounted as read-only. A read-only mount of the runc binary itself would also prevent the runc binary from becoming writable.
SELinux, when correctly configured, may also prevent this vulnerability.
A different approach to preventing this vulnerability is to treat the Linux kernel as belonging to a single tenant, and spend your effort securing the kernel through another layer of separation. This is typically accomplished using a hypervisor or a virtual machine.
Amazon invests heavily in this type of boundary. Amazon EC2 instances, AWS Lambda functions, and AWS Fargate tasks are secured from each other using techniques like these. Amazon EC2 bare-metal instances leverage the next-generation Nitro platform that allows AWS to offer secure, bare-metal compute with a hardware root-of-trust. Along with traditional hypervisors, the Firecracker virtual machine manager can be used to implement this technique with function- and container-like workloads.
The original researchers who discovered this vulnerability have published their own post, CVE-2019-5736: Escape from Docker and Kubernetes containers to root on host. They describe how they discovered the vulnerability, as well as several other attempts.
I’d like to thank the original researchers who discovered the vulnerability for reporting responsibly and the OCI maintainers (and Aleksa Sarai specifically) who coordinated the disclosure. Thanks to Linux distribution maintainers and cloud providers who made updated packages available quickly. I’d also like to thank the Amazonians who made it possible for AWS customers to be protected:
- AWS Security who ran the whole process
- Clare Liguori, Onur Filiz, Noah Meyerhans, iliana weller, and Tom Kirchner who performed analysis and validation of the patch
- The Amazon Linux team (and iliana weller specifically) who backported the patch and built new Docker RPMs
- The Amazon ECS, Amazon EKS, and AWS Fargate teams for making patched infrastructure available quickly
- And all of the other AWS teams who put in extra effort to protect AWS customers