Unlock cost savings using compression with Amazon DocumentDB

Amazon DocumentDB (with MongoDB compatibility) is a fully managed native JSON document database that makes it straightforward and cost-effective to operate critical document workloads at virtually any scale without managing infrastructure. You can use the same application code written using MongoDB API (versions 3.6, 4.0, and 5.0) compatible drivers, and tools to run, manage, and scale workloads on Amazon DocumentDB without worrying about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it straightforward to store, query, and index JSON data.

In the post Reduce cost and improve performance by migrating to Amazon DocumentDB 5.0, we discussed various ways to reduce costs by migrating your workload to Amazon DocumentDB. In this post, we demonstrate the document compression feature in Amazon DocumentDB to reduce storage usage and I/O cost.

Solution overview

Amazon DocumentDB now supports document compression using the LZ4 compression algorithm. Compressed documents in Amazon DocumentDB are up to seven times smaller than uncompressed. The amount of compression achieved depends on your data—for example, text fields with repeating patterns are more compressible than numeric data. You can use the Amazon DocumentDB Compression Review Tool to get a sense of how compressible your data is, before enabling compression.

Compressed documents require less storage space and fewer I/O operations during database reads and writes, leading to lower storage and I/O costs. Documents are also compressed in the buffer cache allowing the cache to fit more of your working set. Document compression/decompression requires additional CPU as well as increases read/write latency – but the benefits will outweigh the overhead if you have collections with compressible data.

You can configure document compression for individual Amazon DocumentDB collections based on collection access patterns and storage requirements. Using existing APIs, you can monitor compression status and collection size after compression, as we demonstrate later in this post.

Keep in mind the following:

The default compression setting for new collections on a cluster is determined by the cluster parameter default_collection_compression. This parameter is set to disabled by default. You can leave the default value unless most of your collections will benefit from using compression.
You can apply document compression to existing collections. However, this method will only compress documents that are inserted or updated after the compression is turned on. To apply compression to all the existing documents on these collections, one strategy can be to issue dummy updates (to a new field that is not used by the application) that touch documents that existed prior to compression being turned on, in controlled slow-rate – these updates will apply compression as part of the operation.
Document compression is only supported on Amazon DocumentDB version 5.0. Amazon DocumentDB by default only compresses documents of size 2 KB and larger. However, you can set the threshold between 128–8,000 bytes following the steps mentioned in Setting the compression thresholds. Our testing shows that the benefits of compression are not significant below 128 bytes, which is the smallest value you can select for the threshold – depending on the compressibility of your documents, you can set the ideal threshold for your workload.
As of the publication of this post, only collection data is compressed in Amazon DocumentDB, not indexes.

Sample datasets and compression results

To compare results between compressed and uncompressed collections, we load the same dataset into an uncompressed collection and a compressed collection separately. Because compression is turned off for all Amazon DocumentDB collections by default, you have to create a collection with compression turned on as the first step for loading data that you want to compress. To create a collection called compressed_collection with compression on, use the following command at the mongo shell command prompt:

db.runCommand( {
    create: "compressed_collection",
    storageEngine: {
        DocumentDB: {
            compression: { "enable": true }
        }
    }
} )

After you create the collection, you can load the data into the collection using the mongoimport command. After you load the data into both uncompressed and compressed collections, you can use the stats() command from the mongo shell on individual collections to display collection statistics.

In the following sections, we explore a few sample datasets and the level of compression that you can achieve on these datasets.

FeTaQA: Free-form Table Question Answering dataset

For this test, we use the FeTaQA: Free-form Table Question Answering dataset. The dataset contains several JSON files.

The following code shows the trimmed structure of a sample document:

rs0:PRIMARY> db.fetaqav1.findOne()
{
    "_id" : ObjectId("66f174d0305bdb12a2513b6d"),
    "feta_id" : 18162,
    "table_source_json" : "totto_source/train_json/example-10461.json",
    "page_wikipedia_url" : "http://en.wikipedia.org/wiki/1982 …”,
    "table_page_title" : "1982 Illinois gubernatorial election",
    "table_section_title" : "Results",
    "table_array" : [
         [
            "Party",
            "Party",
            "Candidate",
            "Votes",
            "%",
            "±"
         ],
         [
            "-",
            "Republican",
            "James R. Thompson (incumbent)",
            "1,816,101",
            "49.44",
            "-"
         ],
…
…

The following table compares the storageSize(Kilo bytes) from the collection stats command output.

Uncompressed Compressed

rs0:PRIMARY> db.fetaqav1.stats(1024)
{
    "ns" : "qa.fetaqav1",
    "count" : 7326,
    "size" : 16180.55078125,
    "avgObjSize" : 2261.65492765,
    "storageSize" : 17480,
    "compression" : {
        "enable" : false
    },
…
…

rs0:PRIMARY> db.fetaqav1_comp.stats(1024)
{
    "ns" : "qa.fetaqav1_comp",
    "count" : 7326,
    "size" : 16180.55078125,
    "avgObjSize" : 2261.6549276549276,
    "storageSize" : 12568,
    "compression" : {
        "enable" : true,
        "threshold" : 2032
    },
…
…

The storage space saved using compression in this dataset for 7,326 documents is (17480 - 12568) = 4912 Kilo bytes or ((4912 * 100) / 17480) = 28%. In terms of compression ratio, this sample attains 12568 / 17480 = 1:0.7 compression.

Tweets about news

For this test, we use the twitter-news dataset hosted on Kaggle related to news articles. The following code shows the trimmed structure of a sample document:

rs0:PRIMARY> db.us_news.findOne()
{
    "_id" : ObjectId("66b2efa90bb97b46d3081039"),
    "_type" : "snscrape.modules.twitter.Tweet",
    "url" : "https://twitter.com/MarketsCafe/status/1558153858920202240",
    "date" : "2022-08-12T18:10:03+00:00",
    "content" : "July consumer price inflation comes in at …”,
    "renderedContent" : "July consumer price inflation …",
    "id" : NumberLong("1558153858920202240"),
    "user" : {
             "_type" : "snscrape.modules.twitter.User",
             "username" : "MarketsCafe",
             "id" : NumberLong("1263898793012916224"),
             "displayname" : "Market’s Cafe",
             "description" : "This twitter profile will get you the …",
             "rawDescription" : "This twitter profile will get you …",
             "descriptionUrls" : null,
             "verified" : false,
             "created" : "2020-05-22T18:25:46+00:00",
             "followersCount" : 1986,
             "friendsCount" : 21,
             "statusesCount" : 553848,
…
…

The following table compares the storageSize(Kilo bytes) from the collection stats command output.

Uncompressed Compressed

rs0:PRIMARY> db.us_news.stats(1024)
{
    "ns" : "tweets.us_news",
    "count" : 2877354,
    "size" : 7789087.1953125,
    "avgObjSize" : 2772.544,
    "storageSize" : 8791784,
    "compression" : {
        "enable" : false
    },
…
…

rs0:PRIMARY> db.us_news_comp.stats(1024)
{
    "ns" : "tweets.us_news_comp",
    "count" : 2877354,
    "size" : 7730078.958984375,
    "avgObjSize" : 2751.1825,
    "storageSize" : 5826144,
    "compression" : {
        "enable" : true,
        "threshold" : 2032
    },
…
…

The storage space saved using compression in this collection for 2,877,354 documents is (8791784 - 5826144) = 2965640 Kilo bytes or ((2965640 * 100) / 8791784) = 34%. In terms of compression ratio, this sample attains 5826144 / 8791784 = 1:0.7 compression.

In these examples, we are able to achieve up to 34% reduction in the storage size as a result of using document compression. If your collection has a very high data size, this can result in significant storage cost savings.

Compression has a slight overhead during read and write operations, so the best way to determine whether compression is a right fit is for your workload is to perform a similar exercise with your dataset and weigh the benefits. Cloning a volume for an Amazon DocumentDB cluster is a safe, fast and cost-effective mechanism to perform tests on your production data without impacting production systems. You can always enable compression on a collection as your workload changes in future.

Summary

In this post, we showed you how compression in Amazon DocumentDB can reduce your storage size. We used sample datasets and, based on the document content, the reduction in size varied. For more information about recent launches and blog posts, see Amazon DocumentDB (with MongoDB compatibility) resources.

We welcome your feedback. Leave your thoughts or questions in the comments section.

About the Authors

Sourav Biswas is a Senior Amazon DocumentDB Specialist Solutions Architect at AWS. He has been helping Amazon DocumentDB customers successfully adopt the service and implement best practices around it. Before joining AWS, he worked extensively as an application developer and solutions architect for various NoSQL vendors.

Nikhil Goyal is a Customer Solutions Manager at AWS with over 20 years of experience in designing, implementing, and managing complex, mission-critical IT solutions across various industries. He is passionate about helping AWS customers maximize the value of their cloud services by facilitating the adoption of new technologies, optimizing existing infrastructure, and advocating for customer needs within AWS.

AWS Database Blog