Containers

Preventing Kubernetes misconfigurations using Datree

David Feldstein, Sr. Containers Specialist AWS co-authored with Shimon Tolts, AWS Community Hero, CEO & Co-founder Datree.io

Introduction

Kubernetes has taken the world by storm, according to the Cloud Native Computing Foundation’s (CNCF) Annual Survey of 2021, with 96% of organizations as either using or evaluating Kubernetes.

Kubernetes is a production-grade container orchestration platform that runs on most cloud vendors and on-prem. To be infrastructure agnostic, Kubernetes has taken the approach of being very flexible, which allows it to serve almost any workload on any infrastructure.

However, this approach comes with increased configuration complexity. You can configure Kubernetes however you’d like, and that means you can also easily misconfigure it. Furthermore, all operations are recommended to be performed via Infrastructure-as-code (IaC), which brings great benefits but adds another layer of complexity.

Organizations that adopt Kubernetes are quickly becoming overwhelmed by the sheer amount of changes being done to workloads. When trying to delegate IaC responsibilities to engineers in the organization, they face knowledge gaps because most engineers are neither infrastructure nor Kubernetes experts. Consequently, mistakes happen when implementing new technology. According to a recent Kubernetes security survey done by Redhat, 93% of respondents have experienced an incident in Kubernetes environment in the last 12 months. Of these incidents, 53% was caused due to a misconfiguration. The primary concern organizations are most worried about in container and Kubernetes environments is misconfigurations.

Kubernetes misconfiguration examples

The following examples are the most common types of misconfigurations, which are presented by area:

Resource management

As per the Amazon EKS best practices guide, it’s recommended to ensure each container has a configured central processing unit (CPU) and memory limits, which are assigned from a capacity management and security perspective. A pod without requests or limits can theoretically consume all the resources available on a host. As additional pods are scheduled onto a node, the node may experience CPU or memory pressure that can cause the Kubelet to terminate or evict pods from the node. While you can’t prevent this from happening, setting requests and limits help minimize resource contention and mitigate the risk from poorly written applications that consume an excessive resources.

The following code snippet provides an example of setting resource limits and requests in the Kubernetes pod specification:

spec:
  containers:
  - name: app
  image: nginx:1.19.8
  resources: 
    limits: 
      memory: "128Mi"
      cpu: "250m" 
    requests: 
      memory: "128Mi" 
      cpu: "250m"

Not restricting which registries that container images can be pulled from

Only allow pulling images from a pre-approved private container image registry. For example the Amazon Elastic Container Registry (Amazon ECR) that has a security configuration to prevent images being downloaded from untrusted registries.

Operational stability

Ensure each container has a configured readiness probe. This is simply a signal to inform Kubernetes when to put this pod behind the load balancer and when to put this service behind the proxy to serve traffic. If you put an application behind the load balancer before it’s ready, then a user can reach this pod but won’t get the expected response of a healthy server.

Ensure each container has a configured liveliness probe. The liveness probe let’s Kubernetes know if the pod is in a healthy state. If it isn’t healthy, then Kubernetes should restart it.

The following code snippet provides an example of setting the liveness probe and readiness probe on the Kubernetes pod specification:

spec: 
  containers: 
  - name: app 
  image: nginx:1.19.8 
  livenessProbe: 
    httpGet: 
      path: /health 
      port: 8000 
  readinessProbe: 
    httpGet: 
      path: /health 
      port: 8000

Ensure each container image has a digest tag. To ensure the container always uses the same version of the image, you can specify it’s digest. The digest uniquely identifies a specific version of the image, so it will never be updated by Kubernetes unless you change the digest value.

The following code snippet provides an example of setting the image digest tag in the Kubernetes pod specification:

spec: 
  containers: 
  - name: app 
  image: nginx:79c77eb7ca32f9a117ef91bc6ac486014e0d0e75f2f06683ba24dc298f9f4dd4

Ensure Deployment has more than one replica configured. Running multiple replicas pods of an application using a Deployment helps it run in a highly-available manner. If one replica fails, the remaining replicas still function, albeit at reduced capacity, until Kubernetes creates another pod to make up for the loss. Furthermore, you can use the Horizontal Pod Autoscaler to scale replicas automatically based on workload demand.

The following code snippet provides an example of setting the number of replicas for the Kubernetes Deployment specification:

kind: Deployment
spec:
  replicas: 2

Security

Prevent containers from having root access capabilities. The National Security Agency (NSA) and the Cybersecurity and Infrastructure Security Agency (CISA) encourages developers to build container applications to execute as a non-root user. Having non-root execution integrated at build time provides better assurance that applications function correctly without root privileges. Therefore, it’s recommended for containers to run with the least privileges possible.

The following code snippet provides an example of setting removing root capabilities on the Kubernetes pod specification:

kind: Deployment 
spec: 
  containers:
  - name: myContainer 
  securityContext: 
    runAsNonRoot: true

Deprecation

Prevent the usage of deprecated APIs as the Kubernetes API evolves, they are periodically reorganized or modified. When APIs evolve, the old API is deprecated and eventually removed. Using deprecated Kubernetes APIs adds additional maintenance effort when preparing for Kubernetes upgrades. If a Kubernetes API is no longer available, then this impacts deployments and workload availability.

How organizations are managing Kubernetes policies today

The Manual Approach:

Identifying misconfigurations that caused production outages, send emails, and write wikis with policy rules for people to follow.

  1. Identifying misconfigurations:
  2. Finding best practices and solutions to past mistakes
  3. Distribute the desired policy amongst engineers
  4. Keeping the policy up to date
  5. Continuously checking – security is a continuous process!

The Managed approach:

Locking up the Kubernetes environment, engineers no longer have access and must rely on a centralized team to create new resources and make changes to existing ones. This adds frustration for operations and engineering teams due to the bottleneck of one team.

The Silos Approach:

Anyone does anything and has access to all clusters. This approach brings chaos due to the lack of guardrails or standards for engineers to follow. This causes misconfigurations that lead to production incidents and security issues.

Solution overview

A centralized policy enforcement approach: from Dev to Production

Datree is an (open-source) scanner that inspects Kubernetes configuration for misconfigurations.

Datree prevents misconfigurations by blocking resources that do not meet your policy. Datree comes with a built-in rules, so you don’t have to worry about codifying your policies by yourself. Dozens of rules are ready in various areas: Container, Workload, CronJob, Network, Security, Deprecation, Argo, NSA-hardening-guidelines, and more.

Datree is used on the command line, admission webhook, or even as a kubectl plugin to run policies against Kubernetes objects.

Solution Walkthrough

Installing Datree on a Kubernetes cluster:

Initially after installation Datree will NOT BLOCK any workload and will just provide insights. The webhook catches create, apply, and edit operations and initiates a policy check against the configs associated with each operation. If any misconfigurations are found, the webhook can be configured to reject the operation, and display a detailed output with instructions on how to resolve each misconfiguration.

1. Add the Datree Helm repository​

Run the following command in your terminal:

helm repo add datree-webhook https://datreeio.github.io/admission-webhook-datree && helm repo update

2. Install Datree on your cluster​

Replace <TOKEN> with the token from your dashboard, and run the following command in your terminal:

helm install -n datree datree-webhook datree-webhook/datree-admission-webhook --debug \
--create-namespace \
--set datree.token=<TOKEN>

This will create a new namespace (datree), where Datree’s services and application resources will reside. datree.token is used to connect your dashboard to your cluster. Note that the installation can take up to 5 minutes.

3. You’re all set! 🎉​

Datree will now run in the background, scanning your cluster for misconfigurations. A report will be sent to the email connected to your account.

Sample output of applying Datree to k8s cluster

Using the Datree CLI & CI integration

To get started:

  1. Choose your OS, for example MacOS, run the installation command:
curl https://get.datree.io | /bin/bash
  1. Run a policy check against a Kubernetes manifest
datree test ~/.datree/k8s-demo.yaml

You’ll get the following output in your command line interface (CLI):

Sample output of a datree CLI run

Datree integrates in all development stages:

The following image represents Datree integration phases during software development lifecycle.

Datree runs across your entire pipeline: Locally, CI/CD and in production

Where does the policy check occur?

Datree’s policy evaluation process is entirely local. Your files and their contents are not sent to Datree backend, as the CLI performs the policy check on your local machine. Only metadata is sent to Datree backend, which is used to display your policy check history on your dashboard.

Infographic showing Datree’s workflow

How we make sure that everyone has the same policy

Centralized policy – configure once, propagate everywhere

By using a centralized policy enforcement solution, organizations enjoy the agility of the silo approach while gaining the confidence of the managed approach.

Datree is not only a policy enforcement solution, but also a policy management solution. It has a SaaS platform that allows Kubernetes administrators to customize their policy to fit their organizational needs. For example, many administrators create one policy for production and one for staging. Once a policy is updated in the SaaS, it automatically propagates to all the clients using it. That means that if you have numerous clusters and CI pipelines, you can make sure they all use the same policy.

Conclusion

In this post we discussed the danger of Kubernetes misconfigurations and discussed the two main approaches for dealing with them: Managed and silos. Both the managed approach and the silos approach are not feasible ways to delegate IaC responsibilities to engineers. The first approach cripples engineering teams, while the second approach puts the stability of the product in danger. Ultimately the continued expansion of Kubernetes relies on finding a third approach that gives the infrastructure teams the confidence to avoid misconfigurations that reach production, while giving the engineers the means to deploy workloads independently.

This is what Datree offers. On the one hand, with the CLI tool, it gives engineers the means to validate their workloads before deploying them to production., On the other hand, with the Kubernetes Admission Webhook, it gives the infrastructure teams the guardrails that ensure no misconfigurations enter their cluster. It empowers engineers, but does so responsibly, without giving up on the governance of the infra teams.

You can learn more about Datree here.

David Feldstein

David Feldstein

David Feldstein is a Senior Container Specialist at Amazon Web Services. David is a System & Software Architect with 17+ years of experience working with mission critical systems both in startups and large global corporates. David now focuses on Go To Market strategy for AWS Container Services and on helping customers build and develop highly scalable and resilient architectures in AWS environments.