AWS Developer Blog

A Fast and Correct Base 64 Codec

by Hanson Char | on | in Java | Permalink | Comments |  Share

In AWS, we always strive to make our tools and services better for our customers. One example is the recent improvement we made to the AWS Java SDK’s Base 64 encoding and decoding. In essence, we’ve replaced the use of Jakarta Commons Codec 1.x with a different implementation throughout the entire SDK. Why, you may wonder? There are two reasons.


The first is about performance. Here is a graph that summarizes the situation:

Base64 Performance Comparision

This graph is the frequency distribution of a thousand data points captured for each of the two codec’s, Jakarta Commons 1.x vs. AWS SDK for Java. Each data point represents the total number of milliseconds it takes in each iteration to Base 64 encode and decode 2 MB of random binary data. The test was conducted in a Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode). On average, the Java SDK’s Base 64 codec is about 2.47x faster, with a reduction in time variance of about 42.81%. (For readers who are statistically inclined, details are provided in the Appendix below.)


The second reason is correctness. Here is a quick quiz:

What is the correct result of Base 64 decoding the string "ZE==" ?

(Stop reading further in case the answer spoils the fun.)

The answer is: the decoding should fail. Why? Even though "ZE==" may look like a valid Base 64 encoded string, it is technically impossible to construct such string via Base 64 encoding from any binary data in the first place! (Don’t take my word of it.  Try it yourself!)

If such invalid string is passed to the latest Java SDK’s Base 64 codec, the Base 64 decoding routine would correctly fail fast with an IllegalArgumentException. As far as I know, there seems to be no other existing Base 64 codec that handles such "illegal" input correctly. Most Base 64 decoders (including the latest in Java 8) would simply silently return some implementation-specific, arbitrary values that could never be Base 64 re-encoded back to the original input string. You could probably imagine how such "random" behavior could make the security engineers quite uncomfortable. :)


Under the hood, the latest Base 64 codec in the AWS SDK for Java is a hybrid implementation. For encoding from bytes to string, we directly use javax.xml.bind.DataTypeConverter available from the JDK (1.6+). For decoding, we use our own implementation for reasons of both speed and correctness as discussed above.


This fast and correct Base 64 codec is now available in the AWS SDK for Java 1.8.3 or later. You can of course directly and independently make use of it. For example:

import com.amazonaws.util.Base64;
byte[] bytes = ...
// Base 64 encode
String encoded = Base64.encodeAsString(bytes);
// Base 64 decode
byte[] decoded = Base64.decode(encoded);

For more details, check out Enjoy!


(Performance statistics associated with the graph generated via R above.)

                 vars    n  mean  sd median trimmed  mad min max range skew kurtosis  se
Commons    1 1000 47.69 6.75     46    46.9      5.93  38   84     46   1.43     3.62    0.21
SDK             2 1000 19.51 2.89     19    19.1      2.97  16   46     30   1.84     7.90    0.09

   Commons           SDK       
 Min.      :38.00    Min.      :16.00  
 1st Qu. :42.00    1st Qu. :17.00  
 Median :46.00    Median :19.00  
 Mean    :47.69    Mean    :19.51  
 3rd Qu. :51.00    3rd Qu. :21.00  
 Max.     :84.00    Max.     :46.00