Centralized Container Logging with Fluent Bit

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. Visit the website to learn more.

by Wesley Pettit and Michael Hausenblas

AWS is built for builders. Builders are always looking for ways to optimize, and this applies to application logging. Not all logs are of equal importance. Some require real-time analytics, others simply need to be stored long-term so that they can be analyzed if needed. It’s therefore critical to be able to easily route your logs to a wide variety of tools for storage and analytics provided by AWS and its partners.

That’s why we are supporting Fluent Bit to help create an easy extension point for streaming logs from containerized applications to AWS’ and partners’ solutions for log retention and analytics. With the newly-launched Fluent Bit plugin for AWS container image, you can route logs to Amazon CloudWatch and Amazon Kinesis Data Firehose destinations (which include Amazon S3, Amazon Elasticsearch Service, and Amazon Redshift). In this post we will show you the Fluent Bit plugin in action on both Amazon ECS and EKS clusters. You might also want to check out the tutorial on the basics of Fluentd and the Kinesis Firehose, if you’re not familiar with the tooling itself, as well as review the relevant issues in the AWS containers roadmap, especially #10 and #66.

Introduction to log routing

Conceptually, log routing in a containerized setup such as Amazon ECS or EKS looks like this:

Log routing concept

On the left-hand side of above diagram, the log sources are depicted (starting at the bottom):

The host and control plane level is made up of EC2 instances, hosting your containers. These instances may or may not be accessible directly by you. For example, for containers running on Fargate, you will not see instances in your EC2 console. On this level you’d also expect logs originating from the EKS control plane, managed by AWS.
The container runtime level commonly includes logs generated by the Docker engine, such as the agent logs in ECS. These logs are usually most useful to people in infrastructure admin roles, but can also assist developers in troubleshooting situations.
The application level is where the user code runs. This level generates application-specific logs, such as a log entry on the outcome of an operation in your own app, or the app logs from off-the-shelf application components such as NGINX.

Next comes the routing component: this is Fluent Bit. It takes care of reading logs from all sources and routing log records to various destinations, also known as log sinks. This routing component needs to run somewhere, for example as a sidecar in a Kubernetes pod / ECS task, or as a host-level daemon set.

The downstream log sinks consume logs for different purposes and audiences. These include a number of use cases, from log analysis to compliance (requiring that logs be stored for a given retention period), alerting when a human user needs to be notified of an event, and dashboard logs that provide a collection of (real-time) graphs to help human users absorb the overall state of the system at a glance.

With these basics out of the way, let’s now look at a concrete use case: centralized logging of a multi-cluster app using Fluent Bit. All the container definitions and configurations ace available in the Amazon ECS Fluent Bit Daemon Service GitHub repo.

Centralized logging in action: multi-cluster log analysis

To show Fluent Bit in action, we will perform a multi-cluster log analysis across both an Amazon ECS and an Amazon EKS cluster, with Fluent Bit deployed and configured as daemon sets. The application-level logs generated by NGINX apps running in each cluster is captured by Fluent Bit and streamed via Amazon Kinesis Data Firehose to Amazon S3, where we can query them using Amazon Athena:

Setup of the centralized logging demo app

Setup for Amazon ECS

Create an ECS on EC2 cluster with the following user data—in our case, in a file called enable-fluent-log-driver.sh (source)—to enable the Fluentd log driver in the ECS agent:

#!/bin/bash
echo "ECS_AVAILABLE_LOGGING_DRIVERS=[\"awslogs\",\"fluentd\"]" >> /etc/ecs/ecs.config

For example, we created the ECS on EC2 cluster like so; this step assumes that you have the ECS CLI installed:

$ ecs-cli up \
          --size 2 \
          --instance-type t2.medium \
          --extra-user-data enable-fluent-log-driver.sh \
          --keypair fluent-bit-demo-key \
          --capability-iam \
          --cluster-config fluent-bit-demo

Next, we need to build a container image containing the Fluent Bit configuration. We’ll do that by creating a Dockerfile (source) with the following content:

FROM amazon/aws-for-fluent-bit:1.2.0
ADD fluent-bit.conf /fluent-bit/etc/
ADD parsers.conf /fluent-bit/etc/

NOTE Counter to good security practice, the USER is not defined, making it run as root. This is intentionally done so, because Fluent Bit currently requires to run as root.

The above Dockerfile in turn depends on two configuration files:

the fluent-bit.conf file (source) defining the routing to the Firehose delivery stream, and
the parsers.conf file (source), defining the NGINX log parsing.

Now, we’ll build our custom container image and push it to an ECR repository called fluent-bit-demo:

$ docker build --tag fluent-bit-demo:0.1 .
$ ecs-cli push fluent-bit-demo:0.1

Verify that your custom log routing image build and push was successful by visiting the ECR console; you should see something like this:

Amazon ECR repo with custom Fluent Bit container image

We’re now in a position to launch an ECS service with daemon scheduling strategy to deploy our custom-configured Fluent Bit into our cluster, using the above container image:

$ aws cloudformation deploy \
      --template-file ecs-fluent-bit-daemonset.yml \
      --stack-name ecs-fluent-bit-daemon-service \
      --parameter-overrides \
      EnvironmentName=fluentbit-daemon-service \
      DockerImage=XXXXXXXXXXXX.dkr.ecr.us-west-2.amazonaws.com/fluent-bit-demo:0.1 \
      Cluster=fluent-bit-demo \
      --region $(aws configure get region) \
      --capabilities CAPABILITY_NAMED_IAM

In the ECS console you should now see something like this:

Now we can launch an ECS service, running NGINX, based on following task definition:

{
    "taskDefinition": {
        "taskDefinitionArn": "arn:aws:ecs:us-west-2:XXXXXXXXXXXX:task-definition/nginx:1",
        "containerDefinitions": [
            {
                "name": "nginx",
                "image": "nginx:1.17",
                "memory": 100,
                "essential": true,
                "portMappings": [
                    {
                        "hostPort": 80,
                        "protocol": "tcp",
                        "containerPort": 80
                    }
                ],
                "logConfiguration": {
                    "logDriver": "fluentd",
                    "options": {
                        "fluentd-address": "unix:///var/run/fluent.sock",
                        "tag": "logs-from-nginx"
                    }
                }
            }
        ],
        "family": "nginx"
    }
}

After creating the above task definition, you should now see the following in your ECS console:

And now we can launch the ECS service based on above task definition:

$ aws ecs create-service \
      --cluster fluent-bit-demo \
      --service-name nginx-svc \
      --task-definition nginx:1 \
      --desired-count 1

If everything worked out, you should see something like the following in the ECS console:

Amazon ECS services

With this, we’ve set up the ECS part. Now we configure the same setup on our Kubernetes cluster running on Amazon EKS.

Setup for Amazon EKS

Create an Amazon EKS cluster named fluent-bit-demo using eksctl, as shown in the EKS docs, and then create a policy file called eks-fluent-bit-daemonset-policy.json (source) with the following content:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "firehose:PutRecordBatch"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:PutLogEvents",
            "Resource": "arn:aws:logs:*:*:log-group:*:*:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:DescribeLogStreams",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:log-group:*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "*"
        }
    ]
}

To attach this policy file to the EKS on EC2 worker nodes, execute the following sequence:

$ STACK_NAME=$(eksctl get nodegroup --cluster fluent-bit-demo -o json | jq -r '.[].StackName')

$ INSTANCE_PROFILE_ARN=$(aws cloudformation describe-stacks --stack-name $STACK_NAME | jq -r '.Stacks[].Outputs[] | select(.OutputKey=="InstanceProfileARN") | .OutputValue')

$ ROLE_NAME=$(aws cloudformation describe-stacks --stack-name $STACK_NAME | jq -r '.Stacks[].Outputs[] | select(.OutputKey=="InstanceRoleARN") | .OutputValue' | cut -f2 -d/)

$ aws iam put-role-policy \
    --role-name $ROLE_NAME \
    --policy-name FluentBit-DS \
    --policy-document file://eks-fluent-bit-daemonset-policy.json

And now we move on to defining the Kubernetes RBAC settings – that is, the service account the Fluent Bit pods will be using along with the role and role binding.

First create the service account fluent-bit (this is what we will later use in the daemon set) by executing kubectl create sa fluent-bit.

Next, define the role and binding in a file named eks-fluent-bit-daemonset-rbac.yaml(source):

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: pod-log-reader
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: pod-log-crb
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pod-log-reader
subjects:
- kind: ServiceAccount
  name: fluent-bit
  namespace: default

Now, in order to make the access permissions for the Fluent Bit plugin effective, you create the role and role binding, defined above, by executing the command kubectl apply -f eks-fluent-bit-daemonset-rbac.yaml.

In contrast to the ECS case, where we backed the configuration into a custom image, in our Kubernetes setup we’re using a config map to define the log parsing and routing for the Fluent Bit plugin. For this, use a file called eks-fluent-bit-configmap.yaml(source) with the following content:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  labels:
    app.kubernetes.io/name: fluentbit
data:
  fluent-bit.conf: |
    [SERVICE]
        Parsers_File  parsers.conf
    [INPUT]
        Name              tail
        Tag               kube.*
        Path              /var/log/containers/*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Refresh_Interval  10
    [FILTER]
        Name parser
        Match **
        Parser nginx
        Key_Name log
    [OUTPUT]
        Name firehose
        Match **
        delivery_stream eks-stream
        region us-west-2 
  parsers.conf: |
    [PARSER]
        Name   nginx
        Format regex
        Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")? \"-\"$
        Time_Key time
        Time_Format %d/%b/%Y:%H:%M:%S %z

Create this config map by executing the command kubectl apply -f eks-fluent-bit-configmap.yaml and then define the Kubernetes Daemonset (using said config map) in a file called eks-fluent-bit-daemonset.yaml (source) with below content:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentbit
  labels:
    app.kubernetes.io/name: fluentbit
spec:
  selector:
    matchLabels:
      name: fluentbit
  template:
    metadata:
      labels:
        name: fluentbit
    spec:
      serviceAccountName: fluent-bit
      containers:
      - name: aws-for-fluent-bit
        image: amazon/aws-for-fluent-bit:1.2.0
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        - name: mnt
          mountPath: /mnt
          readOnly: true
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 500m
            memory: 100Mi
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      - name: mnt
        hostPath:
          path: /mnt

Finally, launch the Fluent Bit daemonset by executing kubectl apply -f eks-fluent-bit-daemonset.yaml and verify the Fluent Bit daemonset by peeking into the logs like so:

$ kubectl logs ds/fluentbit
Found 3 pods, using pod/fluentbit-9zszm
Fluent Bit v1.1.3
Copyright (C) Treasure Data

[2019/07/08 13:44:54] [ info] [storage] initializing...
[2019/07/08 13:44:54] [ info] [storage] in-memory
[2019/07/08 13:44:54] [ info] [storage] normal synchronization mode, checksum disabled
[2019/07/08 13:44:54] [ info] [engine] started (pid=1)
[2019/07/08 13:44:54] [ info] [in_fw] listening on unix:///var/run/fluent.sock
...
[2019/07/08 13:44:55] [ info] [sp] stream processor started

Next, deploy the following NGINX app via kubectl apply -f eks-nginx-app.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app.kubernetes.io/name: nginx
spec:
  replicas: 4 
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.17
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: nginx

With that, we’re done setting up the log sources and routing. Now let’s move on to actually doing something with all the log data we’re collecting from the NGINX containers running in ECS and EKS: we will perform a centralized analysis of the logs.

Log analysis across clusters

The goal is to do a log analysis of the NGINX containers running in the ECS and EKS clusters. For this, we’re using Amazon Athena, which allows us to interactively query the service log data from Amazon S3 using SQL. Before we can query the data in S3, however, we need to get the log data there.

Remember that in the Fluent Bit configurations for ECS and EKS (above) we set the output to delivery_stream xxx-stream. That’s an Amazon Kinesis Firehose delivery stream, and we first have to create it, for ECS and EKS.

First, set up the access control part by defining a policy that effective allows Firehose to write to S3. To do this, we need to create a new IAM Role with two policy files. First, firehose-policy.json(source):

{
  "Version": "2012-10-17",
  "Statement": {
      "Effect": "Allow",
      "Principal": {
        "Service": "firehose.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
  }
}

Second, in the firehose-delivery-policy.json policy file (source), replace the XXXXXXXXXXXX with your own account ID (if you’re unsure what it is, you can get the account ID by executing aws sts get-caller-identity --output text --query 'Account'). Also, in the S3 section, replace mh9-firelens-demo with your own bucket name.

Now we can create the firehose_delivery_role to use for both the ECS and the EKS delivery streams:

$ aws iam create-role \
        --role-name firehose_delivery_role \
        --assume-role-policy-document file://firehose-policy.json

From the resulting JSON output of the above command, note down the role ARN, which will be something in the form of arn:aws:iam::XXXXXXXXXXXXX:role/firehose_delivery_role. We will use this soon to create the delivery stream, but before that can happen we have to put in place the policy defined in the firehose-delivery-policy.json:

$ aws iam put-role-policy \
        --role-name firehose_delivery_role \
        --policy-name firehose-fluentbit-s3-streaming \
        --policy-document file://firehose-delivery-policy.json

Now create the ECS delivery stream:

$ aws firehose create-delivery-stream \
            --delivery-stream-name ecs-stream \
            --delivery-stream-type DirectPut \
            --s3-destination-configuration \
RoleARN=arn:aws:iam::XXXXXXXXXXXX:role/example_firehose_delivery_role,\
BucketARN="arn:aws:s3:::mh9-firelens-demo",\
Prefix=ecs

NOTE The spacing in above command matters: RoleARN etc. must be on one line without spaces.

Now we have to repeat the above for the EKS delivery stream, re-using the role created in the first step. (In other words, you only need to repeat the aws firehose create-delivery-stream command replacing ecs-stream with eks-streamand Prefix=ecs with Prefix=eks.)

It will take a couple of minutes for the delivery streams to be created and active. When you see something like the following, you’re ready to move on to the next step:

Amazon Kinesis Firehose delivery streams

We now need to generate some load for the NGINX containers running in ECS and EKS. You can grab the load generator files for ECS and EKS and execute the commands below; this will curl the respective NGINX services every two seconds (executing in the background), until you kill the scripts:

$ ./load-gen-ecs.sh &
$ ./load-gen-eks.sh &

Now that we have some log data from the NGINX webservers, we can query the log entries in S3 from Athena. For this, we first have to create tables for ECS and EKS, telling Athena about the schema we’re using (here shown for the ECS log data and the same applies for EKS):

CREATE EXTERNAL TABLE fluentbit_ecs (
    agent string,
    code string,
    host string,
    method string,
    path string,
    referer string,
    remote string,
    size string,
    user string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mh9-firelens-demo/ecs2019/'

NOTE Amazon Athena does not import or ingest data; it queries the data directly in S3. So, as log data arrives from the NGINX containers via Fluent Bit and the Firehose delivery stream in the S3 bucket, it is available for you to query using Athena.

Next create a consolidated view of both the ECS and EKS log entries with the following SQL statement:

CREATE OR REPLACE VIEW "fluentbit_consolidated" AS
SELECT * , 'ECS' as source
FROM fluentbit_ecs
UNION
SELECT * , 'EKS' as source
FROM fluentbit_eks

This allows us to merge the two tables (using the same schema) and add an additional column that flags the source, ECS or EKS. We can now perform a SQL query to figure out who the top 10 users of our NGINX services are, across the two clusters:

SELECT source,
         remote AS IP,
         count(remote) AS num_requests
FROM fluentbit_consolidated
GROUP BY  remote, source
ORDER BY  num_requests DESC LIMIT 10

This yields something like the following result:

That’s it! You’ve successfully set up the Fluent Bit plugin and used it across two different managed AWS container environments (ECS and EKS) to perform log analytics.

When you’re done, don’t forget to delete the respective workloads, including the Kubernetes NGINX service (which in turn removes the load balancer), and tear down the EKS and ECS clusters, destroying the containers with it. Last but not least, you will want to clean up the Kinesis delivery streams and the S3 bucket with the log data.

Looking ahead, we are also working on a feature to further simplify installing and configuring fluent bit plugins on AWS Fargate, Amazon ECS, and Amazon EKS. You can follow this feature via the Issue 10 of our AWS container roadmap.

Notes on performance and next steps

To get a better feeling for the performance, we performed a benchmarking test to compare the above Fluent Bit plugin with the Fluentd CloudWatch and Kinesis Firehose plugins. All our tests were performed on a c5.9xlarge EC2 instance. Here are the results:

CloudWatch Plugins: Fluentd vs Fluent Bit

Log Lines Per second	Data Out	Fluentd CPU	Fluent Bit CPU	Fluentd Memory	Fluent Bit Memory
100	25 KB/s	0.013 vCPU	0.003 vCPU	146 MB	27 MB
1000	250 KB/s	0.103 vCPU	0.03 vCPU	303 MB	44 MB
10000	2.5 MB/s	1.03 vCPU	0.19 vCPU	376 MB	65 MB

Our tests show that the Fluent Bit plugin is more resource-efficient than Fluentd. On average, Fluentd uses over four times the CPU and six times the memory of the Fluent Bit plugin.

Kinesis Firehose Plugins: Fluentd vs Fluent Bit

Log Lines Per second	Data Out	Fluentd CPU	Fluent Bit CPU	Fluentd Memory	Fluent Bit Memory
100	25 KB/s	0.006 vCPU	0.003 vCPU	84 MB	27 MB
1000	250 KB/s	0.073 vCPU	0.033 vCPU	102 MB	37 MB
10000	2.5 MB/s	0.86 vCPU	0.13 vCPU	438 MB	55 MB

In this benchmark, on average Fluentd uses over three times the CPU and four times the memory than the Fluent Bit plugin consumes. Keep in mind that this data does not represent a guarantee; your footprint may differ. However, the above data points suggest that the Fluent Bit plugin is significantly more efficient than Fluentd.

Next Steps

We’re excited for you to try this out on your own clusters. Let us know if something doesn’t work the way you expect, and also please share your insights on performance/footprint as well as use cases. Please leave comments on the issue in GitHub, or open an issue on the AWS containers roadmap on GitHub.

Wesley Pettit

Software developer in the AWS container service team.