Bottlerocket: a special-purpose container operating system
On March 10, 2020, we introduced Bottlerocket, a new special-purpose operating system designed for hosting Linux containers. In this post, I want to take you through some of the goals we started with, engineering choices we made along the way, and our vision for how the OS will continue to evolve in the future.
In 2014, we launched Amazon Elastic Container Service (ECS), an orchestration service for Linux containers. Along with the service, we launched a pre-configured and ready-to-use operating system for hosting containers: the Amazon ECS-optimized AMI. This AMI was optimized for ECS in two ways. First, it had all the necessary software installed to run Docker containers with ECS, and would be ready to go as soon as it booted. And second, it was based on a somewhat stripped-down version of the Amazon Linux AMI, with the goals of reducing unnecessary software that had to be maintained and conserving disk space. However, this AMI was still based on a general-purpose operating system designed for running traditional software applications outside of containers.
In 2017, when we launched Amazon Elastic Kubernetes Service (EKS) we did the same thing: the Amazon EKS-optimized AMI as a pre-configured and ready-to-use operating system for hosting Kubernetes pods. Like the Amazon ECS-optimized AMI, the Amazon EKS-optimized AMI had all the necessary software installed to run pods with EKS. And like the Amazon ECS-optimized AMI, this AMI was still based on a general-purpose operating system designed for running traditional software applications outside of containers.
The Bottlerocket project started as the result of lessons we’ve learned over a long time running production services at scale in Amazon, and is colored by the lessons we’ve learned over the past six years about how to run containers. Along with internal experience and feedback from engineers at Amazon, customers gave us a broad set of container-specific feedback about the ECS-optimized AMI, the EKS-optimized AMI, and other container-focused operating systems. A few themes have stood out and led us to building what has become Bottlerocket: enhancing security, ensuring the instances in the cluster are identical, and having good operational behaviors and tooling. We believe that Bottlerocket improves each of these situations, and we’re looking to make it even better in the future!
It’s also important to recognize that Bottlerocket isn’t the first operating system to have made some of these choices; like many new software projects, Bottlerocket stands on the shoulders of those that came before. In designing and building Bottlerocket, we were inspired by traditional general-purpose Linux distributions as well as some container-focused operating systems like CoreOS Container Linux, Rancher OS, and Project Atomic. Some of the engineering choices we made have similarities to these operating systems, but we’ve tried to incorporate both what worked well and what could have worked better into our own designs.
Before we get too deep into technical details, I want to talk about how containers are typically used and why we see some consistent feedback about those themes. In any environment, booting a computer can take a while. But what’s harder than booting is deploying a random application to that computer, and doing so reliably. Containers make this process a lot easier. A container image provides a reliable and repeatable mechanism for packaging up the set of local dependencies for an application, including its dynamically linked libraries, other programs to invoke, and assets. The Linux kernel primitives that power containers, including cgroups and namespaces, provide some amount of resource and visibility isolation. Containers also start up much more quickly than a whole computer. These properties enable each application to pretend that it’s the only application running, enables subdividing larger computers into smaller parts so more of these applications can run together without conflict, and makes it attractive to use one computer for running multiple applications or even a cluster of computers to run many copies of those applications.
The larger ecosystem of container orchestration enables some powerful properties for deploying and operating software systems. Container orchestrators provide tools and mechanisms for managing many copies of applications and many different applications on the same set of computers. Orchestrators also provide mechanisms and features like service discovery, network policy management, load balancing, application tracing, and more, all of which are popular pieces of a microservice-based architecture. However, running containers at a broader scale, across many computers, relies on those computers also being consistent, predictable, and secure. With Bottlerocket, we’re hoping to take the positive qualities of containers and drive those into the operating system that hosts those containers.
The container ecosystem has grown and thrived partly due to the larger open source community. Many of the core components for developing, running, and operating containers are open source, including Docker, containerd, Kubernetes, and Linux itself. We want Bottlerocket to fit well into the container ecosystem and are developing it as an open source project; check out the end of this post for how you can get involved!
I’d like to dig into some of the engineering choices we made to help support our goals around security, consistency, and operability. Many of the choices we made support multiple goals, so it’s not straightforward to categorize the choices by each goal. However, I am going to try to roughly order these choices around the primary goal they support.
I’ll start with security. The big concepts here are a reduced attack surface, verified software, and enforced permission boundaries.
Bottlerocket contains less software, and notably eliminates some components you might expect: Bottlerocket doesn’t have SSH, any interpreters like Python, or even a shell; we expect Bottlerocket to be “hands-off” most of the time, and we believe that removing components like this makes it harder for an attacker to gain a foothold in the system. (And there are mechanisms for troubleshooting and debugging covered below.) Beyond removal of software, Bottlerocket also reduces the attack surface of the operating system by applying software hardening techniques like building position-independent executables (PIE), using relocation read-only (RELRO) linking, and building all first-party software with memory-safe languages like Rust and Go.
Bottlerocket cryptographically verifies itself. The operating system is composed of a disk image that is verified on boot with
dm-verity; unexpected changes to the contents of the disk image will cause the operating system to fail to boot. Bottlerocket uses its own software updater rather than a more common Linux package manager. Updates to Bottlerocket are vended from a repository that follows The Update Framework (TUF) specification; TUF mitigates common classes of attacks against software repositories present in traditional package manager systems.
Bottlerocket uses SELinux in enforcing mode to restrict modifications to itself even from privileged containers. SELinux is an implementation of Mandatory Access Control (MAC) enforced by the Linux kernel, and limits the set of actions processes can take. Today, Bottlerocket’s SELinux policy is intended to restrict orchestrated containers from causing undesired and unexpected changes to the operating system. Going forward, we want to extend this policy to apply to all categories of persistent threats.
We want Bottlerocket to help enforce consistency in your environments; when you run a cluster of computers to run your containers, you should be able to run the same workloads on any of them. Bottlerocket primarily enforces consistency through three approaches: image-based updates, a read-only root filesystem, and API-driven configuration.
Most commonly used, general-purpose Linux distributions have an integrated package management system for installing and updating software. This makes the distributions very flexible; they can be used to run a variety of different workloads. However, when managing large fleets of hosts, this flexibility can be a downside: different packages and different versions of packages might be installed on each host, rendering them inconsistent with each other. The large variety of available packages in a package manager can also contribute to challenges; the combination of packages you install may have never been tested together. Bottlerocket is different here; there is no package manager with a wide selection of software to install. Instead, Bottlerocket uses a pre-constructed image that contains the software for the operating system, and it’s easy to run other software like diagnostic and observability tools in containers. When updates are available, Bottlerocket can download the entire new disk image and apply the update with a simple reboot. Image-based deployments ensure consistency: all the Bottlerocket hosts in your fleet can run the exact same software and you can be assured that the specific versions of each component included in a Bottlerocket image have been tested together.
Bottlerocket is designed to run containers and has an image-based deployment to ensure consistency. However, we recognize that there is not a one-size-fits-all set of software and configuration for every use-case of running containers. Today, Bottlerocket has support for running as nodes in a Kubernetes cluster on AWS. However, we want Bottlerocket to be able to run in different locations (like on a Raspberry Pi) and with different orchestrators (like Amazon ECS). Bottlerocket approaches this difference in requirements through a “variant” system, with a different image suited for different use-cases. The variant available at launch is published by AWS for use with Kubernetes 1.15 and is called
aws-k8s-1.15. We plan to publish additional variants for other versions of Kubernetes as they become available in Amazon EKS as well as a variant for Amazon ECS. Bottlerocket also includes the tooling to build your own variant when you have your own needs.
Unlike traditional Linux distributions, the Bottlerocket operating system is configured with a read-only root filesystem. This is another mechanism to enforce consistency and reduce drift; applications are unable to modify the disk image and introduce changes from one host to another. When Bottlerocket downloads an update and is ready to install, the update is written to a secondary partition. On reboot, Bottlerocket’s bootloader understands how to boot into the correct partition, changing the primary and leaving the old version of the image available as a secondary. This same mechanism can be used for quickly rolling back, if you experience a problem with the update. Bottlerocket is also equipped with a separate, writable portion of the filesystem that is designed for persistent user data, like container images and volumes.
It’s relatively common to store software configuration settings on Linux in the
/etc directory. Bottlerocket has
/etc for compatibility, but exposes it as a memory-backed temporary filesystem that is regenerated on every boot. Instead of persisting configuration there and potentially allowing applications to mutate the configuration of Bottlerocket, Bottlerocket exposes an API for configuration that supports rich semantics around structured settings, transactions, and automatic migrations. The API is accessible from the Bottlerocket “control” container via AWS Systems Manager for interactive changes, but can also be configured programmatically. If you’re using Bottlerocket on EC2, you can also set configuration using TOML-formatted user data.
There are also some settings that Bottlerocket knows how to generate on its own. Early in the boot process, Bottlerocket configures itself with data not known until boot like hostname and network configuration. When using the aws-k8s-1.15 variant of Bottlerocket, a helper program runs to configure Kubernetes-specific settings like the cluster DNS settings and the name of the pause container image. You can override these settings using the API, or if you’re using Bottlerocket on EC2, using TOML-formatted user data.
The last goal I want to talk about today is operability. Bottlerocket is different from other Linux-based operating systems, but it does have facilities for regular operations like software updates and for troubleshooting. Bottlerocket behaves in well-defined ways and has settings for changing its behavior. It has mechanisms for performing automatic software updates, including integration with Kubernetes for reducing disruption with coordinated node cordoning and draining. It has tools for regular management tasks like changing settings and manually installing software updates, but it also has tools for emergency scenarios when you really want extra capabilities.
Bottlerocket’s update capability is facilitated by a few different components. First, there is a TUF-based repository that contains the updated image and signatures that cover the integrity of the image as well as the integrity of the repository itself. Second, there’s Bottlerocket’s on-host tool for interacting with the repository and retrieving updates, called updog. Updog has the ability to query for updates and apply updates to Bottlerocket immediately. However, updog defaults to using a “wave”-based update strategy; waves provide a mechanism for updates to become available to different hosts in your cluster at different times rather than every host seeing updates immediately. This reduces the chance of all your hosts attempting to update at the same time, causing disruption to your container-based workloads, and gives you the opportunity to stop updates if you find that they introduce a problem. Each host will assign itself to a random wave at boot, though this is configurable.
Bottlerocket’s update capability can also be integrated with container orchestrators. As part of the preview launch, Bottlerocket comes with a Kubernetes operator that you can deploy to your cluster to perform updates using updog. The operator will ensure that only one host in your cluster gets updated at a time, and will handle cordoning and draining the pods from the host before the update is applied. The updater is in a fairly early stage of development, and we welcome input into how its functionality should be expanded.
Because Bottlerocket does not have SSH installed, a different mechanism is needed to control the operating system, interact with the API, and “break-glass” into an administrative mode. Bottlerocket has two tools for this: a “control” container for typical expected maintenance tasks like changing settings, and an “admin” container for emergency use. The control container is launched on boot and contains the Amazon SSM agent; you can interact with it using the AWS Systems Manager API. This control container has a program called
apiclient to facilitate interaction with the Bottlerocket API and a small helper program called
enable-admin-container, which automates the API calls needed to start the emergency admin container.
The admin container is meant for emergency use. It is launched with full privileges and is unconstrained, except by the SELinux profile applied to it. The admin container is based on the Amazon Linux 2 container image and has tooling that you would expect in a general-purpose Linux distribution. It has SSH installed and running; you can connect to it over Bottlerocket’s primary network interface using the SSH key specified when the instance was launched. It also has a tool called
sheltie to transition the working context (Linux namespaces) into that of the host, so you can operate on the host from within the admin container. The admin container is not enabled by default, and we recommend keeping it disabled in production deployments of Bottlerocket.
Bottlerocket runs containers managed by an orchestrator and containers for local operations that we call “host containers.” These host containers include the control and admin containers described above. Bottlerocket uses two separate container runtimes to run these: two different copies of containerd. This is done for three reasons. First, the orchestrated containers and host containers can have separate security requirements enforced by separate SELinux profiles. Second, the orchestrated containers can be launched by a different runtime (like Docker or CRI-O) than the host container. And third, the orchestrated containers and host containers can have separate fault domains for configuration changes or failures in the container runtime. The control container is included by default and the admin container can be added when needed, but you can also use the host container system to run your own diagnostic, operational, and administrative tools on Bottlerocket.
Bottlerocket is in a preview phase right now, and we’re continuing to work on a number of enhancements before we make it generally available. We have a public roadmap, but I want to highlight a few individual details here.
A major theme both before Bottlerocket is generally available and further into the future is security. We’re happy with what we’ve done in Bottlerocket so far, but there is always an opportunity to continue to improve. Before Bottlerocket is generally available, our SELinux policies will be completed. We’re exploring ways to reduce the level of filesystem access to regular orchestrated containers, including potentially running the orchestrator’s copy of containerd in a separate mount namespace. We’re also taking a look at alternative methods of running containerized workloads, including inside microVMs with Firecracker for use-cases that require high degrees of isolation.
Bottlerocket supports Kubernetes today, but Bottlerocket is not meant to be a Kubernetes-only operating system. It’s on our roadmap to add support for Amazon ECS on Bottlerocket and to integrate similar behaviors around non-disruptive updates into Amazon ECS clusters. If there are other orchestrators that you want to see in Bottlerocket, come and get involved!
Bottlerocket is a fully open-source operating system. The operating system consists of existing open-source components like the Linux kernel and around 50 packages as well as new components written specifically for Bottlerocket (primarily in Rust and Go). The existing open-source components that Bottlerocket uses are licensed under their own original licenses, while all the Bottlerocket-specific components are licensed similarly to the Rust language: under the Apache 2.0 license or the MIT license at your choice.
Bottlerocket’s components are open-source as is its roadmap. Our intent is for Bottlerocket to be a collaborative community project, so you have the ability to contribute directly and to make your own customized versions. We will produce a set of official images and updates for our supported integrations like Amazon EKS and (in the future) Amazon ECS. However, we expect that there will be needs we can’t anticipate or support in our official images, and we want you to be able to build your own images and updates with the same set of tooling that we use.
You are welcome to get involved with Bottlerocket! Check out our GitHub repository for discussion via issues and contribution via pull request. We also have the
#bottlerocket channel for informal interaction in the AWS Developer Slack; you can sign up here.
Bottlerocket is a very different operating system from traditional general-purpose Linux distributions, but we think the changes lead to long-term improvements in security and operations, and we hope that the tools we’ve built into Bottlerocket (including “break-glass” mechanisms like the admin container) will ease the transition.
We hope you have the opportunity to play around with the preview of Bottlerocket today, and we’re always happy to hear your feedback!