AWS Open Source Blog

Improving zlib-cloudflare and comparing performance with other zlib forks

We worked with the maintainers of the Cloudflare fork of zlib (zlib-cloudflare) to improve the decompression performance on Arm and x86. With the changes, at level 6:
On Arm:

  • Compression performance: ~90 percent faster than zlib-madler (original zlib).
  • Decompression performance: ~52 percent faster than zlib-madler.

On x86:

  • Compression performance: ~113 percent faster than zlib-madler.
  • Decompression performance: ~44 percent faster than zlib-madler.

We recommend using zlib-cloudflare on Arm and x86 platforms. Please see the documentation in GitHub for more information on how to use it.

The zlib compression library is widely used to compress and decompress data. This library is utilized by popular programs like the Java Development Kit (JDK), Linux distributions, libpng, Git, and many others. Because zlib is widely adopted, the maintainer of the original version accepts only bug fixes with significant impact. This approach has resulted in the creation of several zlib forks, as there are multiple places in the code where performance improvements can be made.

zlib forks

Until recently, the Cloudflare fork of zlib (zlib-cloudflare) had the best performance for compression, and the Chromium fork (zlib-chromium) had the best performance for decompression among the four zlib forks considered:

  • zlib-madler (original zlib)
  • zlib-ng
  • zlib-cloudflare
  • zlib-chromium

As a result, no single zlib fork was fastest in both compression and decompression.

In this article, we’ll describe the work done in zlib-cloudflare to improve its decompression performance. Additionally, we compare the four zlib forks on both compression and decompression operations, using the Silesia Corpus (a collection of modern-day workloads) and a custom benchmark tool.

Motivation

The aim of improving zlib-cloudflare was to have a single version of zlib that could be used for both compression and decompression operations, regardless of the CPU architecture.

Between zlib-cloudflare and zlib-chromium, we chose zlib-cloudflare because it has an easy-to-use build system (make) compared with zlib-chromium (gn, ninja). A handcrafted build script is required to compile zlib-chromium on both Arm and x86.

Work done

We improved the performance of zlib-cloudflare by porting the decompression performance enhancement patches from zlib-chromium. After the patches are merged with the Silesia Corpus, we can see that zlib-cloudflare compresses better (smaller file size) and faster when compared against the original zlib (zlib-madler) on both Arm and x86. At compression level 6 (default) on Arm, we see:

  • zlib-cloudflare is on average 90 percent faster than zlib-madler in compression operations.
  • zlib-cloudflare is on average 52 percent faster than zlib-madler in decompression operations.

With these changes, zlib-cloudflare is now the best performing zlib fork for compression and decompression operations on both Arm and x86 systems.

Improvements

We selected the relevant patches from zlib-chromium and sent the pull request to the zlib-cloudflare repo. The patches are:

The improvements seen in zlib-cloudflare are primarily due to the usage of Arm NEON, x86 SSE intrinsics, and loads wider than 1 byte at a time. These are used in the Adler-32 checksum and when performing wider loads/stores in inflate_fast().

Benchmark

We created a benchmark using an example implementation of zlib by Mark Adler (original author of zlib). It has been modified to run for 100M streams and tested with a wide variety of workloads in the Silesia Corpus.

We measure the throughput (MB/s) and compression ratio for levels 0 to 9 and report three numbers (compression level, throughput, compression ratio).

After porting the patches to zlib-cloudflare, we ran the benchmark using zlib-cloudflare and compared the result against zlib-madler in both M6g (Arm) and M5 (Intel).

Comparing compression performance between zlib-cloudflare and zlib-madler, we see that zlib-cloudflare compresses better and faster than zlib-madler for the corresponding compression level on M6g.

Dickens (text file, higher and to the right is better; compression level is the number above the graph points):

Graph illustrating Dickens throughput to compression ratio.

The other workloads in the Silesia Corpus show similar trends.

Arm throughput (MB/s, higher is better):

  • At level 6, zlib-cloudflare is, on average, 90 percent faster in compression operations than zlib-madler.Bar graph illustrating zlib-cloudflare compression operations.
  • zlib-cloudflare is, on average, 52 percent faster in decompression operations than zlib-madler.Graph illustrating the zlib-cloudflare decompression operations.
  • On M5 (x86), we see compression and decompression performance similar to M6g (Arm).

Conclusion

With these changes, we were able to make a platform-agnostic library that is easy to use and integrate. Overall, we see zlib-cloudflare compressing better and faster when compared to zlib-madler (original zlib). We recommend trying zlib-cloudflare for compression/decompression needs on Arm and x86 platforms.

Appendix

Benchmark tool main loop:

for (i = 0; i < runs; i++) {
    int counter = 0;
    long processed = 0;
    long process = 100000000L;
    start = get_clock_time();
    while (1) {
        if (action & DEFLATE) {
            run_deflate(inflated, deflated, comp_level_start);
            processed += in_size;
        }
        if (action & INFLATE) {
            run_inflate(deflated, inflated);
            processed += out_size;
        }
        counter++;
        if (processed > process)
            break;
    }    
    end = get_clock_time();
    set_metrics((end - start), in_size, out_size, processed, comp_level_start, &m);
    print_metrics(&m, counter);
}
Janakarajan Natarajan

Janakarajan Natarajan

Janakarajan is Systems Development Engineer at AWS (Annapurna Labs) working on Graviton processors in Austin, Texas. Prior to joining AWS he worked at AMD for 4 years doing Linux kernel development for EPYC processors.

Volker Simonis

Volker Simonis

Volker Simonis is a Principal Software Engineer in the Corretto team at Amazon Web Services. He works on Java Virtual Machines since 2004 and is an OpenJDK Member, Reviewer and Committer right from the start. Before joining Amazon he worked for SAP, Sun Microsystems and the University of Tübingen from where he holds a master and PhD degree in Computer Science. He represented SAP in the Executive Committee of the JCP and was a member of the JCP Expert Groups for Java SE 9 to 13. He's a passionate and frequent speaker at conferences around the globe and can easily be contacted at @volker_simonis.