Choosing container logging options to avoid backpressure

Logging provides an easy way to observe the behavior of your container-based application. At AWS, we offer customers several choices for configuring logging. Many customers will start with the simplest or default option. I recently helped diagnose a customer issue that we traced back to container logging configuration. In short, we discovered that an application can become blocked using the default awslogs driver, so it might be best to switch it to non-blocking mode. A more general solution would be to use FireLens for Amazon ECS and ensure it is configured to allow filesystem buffering of logs.

In this blog post, I’m going to explore choices for logging from containers and the implications for backpressure when the container’s connection to the logging service is interrupted. Backpressure is generally defined as resistance to the flow of a liquid through pipes. In data terms, it’s resistance to the flow of data through a system. In exceptional cases, backpressure can cause your service to fail if it’s not configured correctly.

In the diagram on the left, a container is working fine and sending its logs to Amazon CloudWatch. In the diagram on the right, the connection to CloudWatch is interrupted. This interruption might be due to a networking configuration change that was made by mistake or even a temporary event affecting the CloudWatch service. The standard logging configuration is used in this case, so the container keeps trying to send logs until it becomes blocked. This is backpressure. The container eventually freezes and the application it was hosting stops responding to requests. When the connection to CloudWatch resumes, the container starts running again.

Example setup

I created a very simple Python web application that responds to requests on port 8080 and logs a fair amount of information. I made it multi-threaded so that it can respond to several requests at once, but also so that if one thread gets blocked, the application can still operate.

To test the application, I sent lots of requests using Apachebench (ab). I also used iptables to block onward connections to simulate a loss of connectivity to the logging server. The combination of ab and iptables creates backpressure in the right circumstances.

This diagram shows the test setup:

Now we’re ready to do some testing!

awslogs log driver

The easiest way to handle container logs in AWS is to send them to CloudWatch Logs. There is a container log driver specifically for this purpose, awslogs. I created this Amazon ECS task definition for my application so I could run it.

    "family": "logtest-task",
    "containerDefinitions": [
        {
            "name": "logtest",
            "image": "<accountid>.dkr.ecr.eu-west-1.amazonaws.com/logtest:latest",
            "memoryReservation": 10,
            "portMappings": [
                {
                    "hostPort": 8080,
                    "containerPort": 8080,
                    "protocol": "tcp"
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "awslogs-logtest",
                    "awslogs-region": "eu-west-1",
                    "awslogs-create-group": "true",
                    "awslogs-stream-prefix": "awslogs-logtest"
                }
            },
            "essential": true
        }
    ]
}

The logConfiguration section sets up the ability to send logs to CloudWatch Logs. I registered this task definition using aws ecs register-task-definition. I then ran it using aws ecs run-task. In seconds, the task should be running, which I verified in the AWS Management Console.

Next, I tested the application using curl:

$ curl <ip address of container host>:8080

Hello caller

It works! In CloudWatch Logs, there is a new log group, awslogs-logtest, that contains the logs from the application. You can open and explore this log group and watch log entries arrive soon after the container sends them.

Then, using ab, I started giving my application more traffic. I used this command to send 1000 requests:

$ ab -n 1000 http://<ip address of container host>:8080/

So the application is working, Now, let’s try to break it.

Simulate a failure to create backpressure

To create backpressure, I decided to deliberately simulate a failure to connect to CloudWatch Logs. I used Linux iptables to block the traffic.

By looking in logs in the CloudWatch console, I was able to see the logging had halted. The container keeps running, but because the logs cannot be flushed to CloudWatch, they are building up in an in-memory buffer. By running ab again, the container continued to serve requests for a while, but I reached a point where connections timed out. My application stopped working entirely and was blocked due to backpressure! When I removed the restriction, the application started working again. Why does logging behave like this?

Blocked application

According to the topic in the Docker documentation:

Docker provides two modes for delivering messages from the container to the log driver:

direct, blocking delivery from container to driver (default)

non-blocking delivery that stores log messages in an intermediate per-container ring buffer for consumption by driver

The non-blocking message delivery mode prevents applications from blocking due to logging backpressure. Applications are likely to fail in unexpected ways when STDERR or STDOUT streams block.

So the default logging mode is blocking, which means when the memory buffer is full, the call to log to CloudWatch, and your application, gets blocked! Fortunately, there is a non-blocking mode. Let’s give that a go and see what difference it makes.

awslogs log driver (non-blocking)

To switch to non-blocking mode, I just added: "mode": "non-blocking" to the logConfiguration options section of the ECS task definition. Then I reregistered the task definition. This will give you a new task ARN that you need to specify when you use run-task.

Now I was able to relaunch the task, test it, and then put the iptables restriction in place. When I ran ab a second time, no matter how much load I sent to the application, it worked perfectly! While the iptables restriction was in place, I didn’t see any log updates in CloudWatch Logs. As soon as I removed the restriction, the log updates returned.

At first glance, the fix seems to be to just use non-blocking mode. But there is a problem: I lost a chunk of log messages. When the logging buffer is full, some messages will be discarded. Although this might not be a problem for everyone, for some critical applications (for example in regulated industries), keeping a complete set of logs is an absolute requirement. Fortunately, there is a better solution to the problem: FireLens.

Using FireLens to manage logging

FireLens is a tool that makes dealing with container logs a lot easier. For more information, see the Announcing Firelens: A New Way to Manage Container Logs blog post.

FireLens works with either Fluent Bit or Fluentd. Whichever you choose, it is run as a sidecar container in the same ECS task. Your container outputs logs as normal while FireLens forwards, filters, buffers, and streams your logs to whichever solution you choose.

I used Fluent Bit for my tests. To configure Fluent Bit, I followed the advice in the How to set Fluentd and Fluent Bit input parameters in FireLens blog post, which shows how you can customize the Fluent Bit config file.

Here is the Dockerfile for the FireLens container I created:

FROM amazon/aws-for-fluent-bit:latest
ADD fluent-bit.conf /fluent-bit/alt/fluent-bit.conf 
CMD ["/fluent-bit/bin/fluent-bit", "-e", "/fluent-bit/firehose.so", "-e", "/fluent-bit/CloudWatch.so", "-e", "/fluent-bit/kinesis.so", "-c", "/fluent-bit/alt/fluent-bit.conf"]

I supplied a fluent-bit.conf file. In this configuration, Fluent Bit uses filesystem buffering rather than just relying on a memory buffer. The total size of storage is limited to 100 M, but you can adjust it, if you like. For more information about these options, see Buffering and Storage in the Fluent Bit documentation.

[SERVICE]
    flush                     1
    log_Level                 info
    storage.path              /var/log/flb-storage/
    storage.sync              normal
    storage.checksum          off
    storage.backlog.mem_limit 5M
    
[INPUT]
    Name    forward
    unix_path /var/run/fluent.sock
    storage.type  filesystem

[OUTPUT]
    Name CloudWatch
    Match   **
    region eu-west-1
    log_group_name fluent-bit-CloudWatch
    log_stream_prefix from-fluent-bit-
    auto_create_group true
    storage.total_limit_size 100M

My ECS task definition is a bit bigger now because it has two containers. Here’s the first half, including the FireLens container. In the awslogs section, you can configure logs for debugging FireLens itself. I set this task definition to non-blocking mode to avoid any backpressure issues, but FireLens doesn’t appear to output many log entries.

{
    "family": "logtest-task-firelens",
    "containerDefinitions": [
        {
            "essential": true,
            "image": "<account id>.dkr.ecr.eu-west-1.amazonaws.com/aws-for-fluent-bit:latest",
            "name": "log_router",
            "firelensConfiguration": {
                "type": "fluentbit"
            },
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "firelens-container",
                    "awslogs-region": "eu-west-1",
                    "awslogs-create-group": "true",
                    "awslogs-stream-prefix": "firelens",
                    "mode": "non-blocking"
                }
            },
            "memoryReservation": 50
         },

Here’s the second half, which shows the changes to the application container:

        {
            "name": "logtest",
            "image": "<account id>.dkr.ecr.eu-west-1.amazonaws.com/logtest:latest",
            "memoryReservation": 10,
            "portMappings": [
                {
                    "hostPort": 8080,
                    "containerPort": 8080,
                    "protocol": "tcp"
                }
            ],
            "logConfiguration": {
                "logDriver":"awsfirelens",
                 "options": {
                    "Name": "CloudWatch",
                    "region": "eu-west-1",
                    "log_group_name": "awslogs-logtest-firelens",
                    "auto_create_group": "true",
                    "log_stream_name": "logtest-firelens"
                }
            },
            "essential": true
        }
    ]
}

To deploy this, I first built, tagged, and pushed the FireLens container to ECR. I then registered the new task definition. I removed all the previous firewall rules and stopped the old running task. I then ran the new task. Using the Amazon ECS console, I could see that the new task was running and consisted of two containers.

Now I could start testing again. The task created two log groups in CloudWatch Logs. The first was the debug output from FireLens letting me know it was working. The second was the output from my application. FireLens had started routing logs for me! My next step was to try to create backpressure.

Simulate failure, again

I used iptables again to sever the connection to CloudWatch Logs. By watching the logs in the CloudWatch , I could see that my application was no longer able to send its logs to CloudWatch. I tried running ab again to flood the application with requests and cause it to log a lot of data, but I was unable to cause it to freeze. Success!

To verify what was happening in FireLens, I did two things.

First, I launched a shell in the FireLens container as follows:

docker exec -it ecs-logtest-task-firelens-3-logrouter-a0c9d2b7eccdf6f69201 /bin/bash

From there, I could check that log chunks were building up in /var/log/flb-storage/. This means that FireLens is saving my logs to local storage, so as long as there is enough space, I’m not going to lose any logs this time.

Second, I was able to view the Fluent Bit logs because I hadn’t blocked them. The logs showed:

2020-12-10T11:32:59.118+00:00    time="2020-12-10T11:32:59Z" level=error msg="[CloudWatch 0] RequestError: send request failed\ncaused by: Post https://logs.eu-west-1.amazonaws.com/: dial tcp 52.94.220.120:443: i/o timeout\n"
2020-12-10T11:34:59.520+00:00    time="2020-12-10T11:34:59Z" level=error msg="[CloudWatch 0] RequestError: send request failed\ncaused by: Post https://logs.eu-west-1.amazonaws.com/: dial tcp 52.94.220.120:443: i/o timeout\n

FireLens is failing to forward logs due to an IO timeout. This is what I expected. After I removed the iptables restriction, things improved:

2020-12-10T11:36:46.428+00:00    [2020/12/10 11:36:46] [ info] [engine] flush chunk '1-1607599882.826321417.flb' succeeded at retry 1: task_id=1, input=forward.0 > output=CloudWatch.0
2020-12-10T11:36:46.428+00:00    [2020/12/10 11:36:46] [ info] [engine] flush chunk '1-1607599882.772535957.flb' succeeded at retry 1: task_id=1, input=forward.0 > output=CloudWatch.0

The chunks of logs saved to disk were successfully sent to CloudWatch. This time, no logs were lost.

FireLens memory buffer configuration

I was curious about what would happen with the default FireLens configuration, which uses a memory buffer instead of filesystem. You’ll remember that in the fluent-bit.conf file, I changed the configuration to storage.type filesystem. I repeated this test with a default configuration instead. Under load, the application still kept running and didn’t experience backpressure, but the memory buffer filled, which means I did lose some log data. The FireLens logs reported:

2020-12-10T11:12:08.647+00:00    [2020/12/10 11:12:08] [ warn] [input] forward.0 paused (mem buf overlimit)

If you want to make sure you don’t lose logs, I recommend that you adjust the Fluent Bit configuration to support filesystem buffering.

Conclusion

In this post, I showed you a few ways to configure logging from your containers. The awslogs driver is the simplest way to get your container logs into CloudWatch, but to avoid issues with backpressure, you should consider the configuration carefully. FireLens offers a log driver that is a little bit more complex, but if configured correctly, it helps you avoid backpressure entirely. When choosing between these drivers, consider your use case, specifically whether all logs must be retained in all circumstances.

Containers