AWS Open Source Blog
Improving zlib-cloudflare and comparing performance with other zlib forks
We worked with the maintainers of the Cloudflare fork of zlib (zlib-cloudflare) to improve the decompression performance on Arm and x86. With the changes, at level 6:
On Arm:
- Compression performance: ~90 percent faster than zlib-madler (original zlib).
- Decompression performance: ~52 percent faster than zlib-madler.
On x86:
- Compression performance: ~113 percent faster than zlib-madler.
- Decompression performance: ~44 percent faster than zlib-madler.
We recommend using zlib-cloudflare on Arm and x86 platforms. Please see the documentation in GitHub for more information on how to use it.
The zlib compression library is widely used to compress and decompress data. This library is utilized by popular programs like the Java Development Kit (JDK), Linux distributions, libpng, Git, and many others. Because zlib is widely adopted, the maintainer of the original version accepts only bug fixes with significant impact. This approach has resulted in the creation of several zlib forks, as there are multiple places in the code where performance improvements can be made.
zlib forks
Until recently, the Cloudflare fork of zlib (zlib-cloudflare) had the best performance for compression, and the Chromium fork (zlib-chromium) had the best performance for decompression among the four zlib forks considered:
- zlib-madler (original zlib)
- zlib-ng
- zlib-cloudflare
- zlib-chromium
As a result, no single zlib fork was fastest in both compression and decompression.
In this article, we’ll describe the work done in zlib-cloudflare to improve its decompression performance. Additionally, we compare the four zlib forks on both compression and decompression operations, using the Silesia Corpus (a collection of modern-day workloads) and a custom benchmark tool.
Motivation
The aim of improving zlib-cloudflare was to have a single version of zlib that could be used for both compression and decompression operations, regardless of the CPU architecture.
Between zlib-cloudflare and zlib-chromium, we chose zlib-cloudflare because it has an easy-to-use build system (make) compared with zlib-chromium (gn, ninja). A handcrafted build script is required to compile zlib-chromium on both Arm and x86.
Work done
We improved the performance of zlib-cloudflare by porting the decompression performance enhancement patches from zlib-chromium. After the patches are merged with the Silesia Corpus, we can see that zlib-cloudflare compresses better (smaller file size) and faster when compared against the original zlib (zlib-madler) on both Arm and x86. At compression level 6 (default) on Arm, we see:
- zlib-cloudflare is on average 90 percent faster than zlib-madler in compression operations.
- zlib-cloudflare is on average 52 percent faster than zlib-madler in decompression operations.
With these changes, zlib-cloudflare is now the best performing zlib fork for compression and decompression operations on both Arm and x86 systems.
Improvements
We selected the relevant patches from zlib-chromium and sent the pull request to the zlib-cloudflare repo. The patches are:
- Adler32 SIMD Arm support
- Arm inflate improvements
- Inflate using wider loads and stores
- Improve zlib inflate speed using chunk copy
- Increase inflate speed: read decoder input into a uint64_t
- Intel SIMD and inflate improvements
The improvements seen in zlib-cloudflare are primarily due to the usage of Arm NEON, x86 SSE intrinsics, and loads wider than 1 byte at a time. These are used in the Adler-32 checksum and when performing wider loads/stores in inflate_fast()
.
Benchmark
We created a benchmark using an example implementation of zlib by Mark Adler (original author of zlib). It has been modified to run for 100M streams and tested with a wide variety of workloads in the Silesia Corpus.
We measure the throughput (MB/s) and compression ratio for levels 0 to 9 and report three numbers (compression level, throughput, compression ratio).
After porting the patches to zlib-cloudflare, we ran the benchmark using zlib-cloudflare and compared the result against zlib-madler in both M6g (Arm) and M5 (Intel).
Comparing compression performance between zlib-cloudflare and zlib-madler, we see that zlib-cloudflare compresses better and faster than zlib-madler for the corresponding compression level on M6g.
Dickens (text file, higher and to the right is better; compression level is the number above the graph points):
The other workloads in the Silesia Corpus show similar trends.
Arm throughput (MB/s, higher is better):
- At level 6, zlib-cloudflare is, on average, 90 percent faster in compression operations than zlib-madler.
- zlib-cloudflare is, on average, 52 percent faster in decompression operations than zlib-madler.
- On M5 (x86), we see compression and decompression performance similar to M6g (Arm).
Conclusion
With these changes, we were able to make a platform-agnostic library that is easy to use and integrate. Overall, we see zlib-cloudflare compressing better and faster when compared to zlib-madler (original zlib). We recommend trying zlib-cloudflare for compression/decompression needs on Arm and x86 platforms.
Appendix
Benchmark tool main loop: