Optimize data archival costs in Amazon DocumentDB using rolling collections

Amazon DocumentDB (with MongoDB compatibility) is a scalable, highly durable, and fully managed database service for operating mission-critical MongoDB workloads. Amazon DocumentDB emulates the responses that a client expects from a MongoDB server by implementing the Apache 2.0 open-source MongoDB 3.6, 4.0 or 5.0 APIs on a purpose-built, distributed, fault-tolerant, and self-healing storage system that gives you the performance, scalability, and availability you need when operating mission-critical MongoDB workloads at scale. As a document database, Amazon DocumentDB makes it easy to store, query, and index JSON data.

It’s essential to implement appropriate data archival strategies to ensure that only operational data is stored in a document database, and archived data is stored in an appropriate storage like Amazon Simple Storage Service (Amazon S3). Lack of an appropriate data archival strategy can result in increased cost and poor performance due to inefficient use of system resources. In this post, I discuss common solutions to archive data from Amazon DocumentDB. I also elaborate on design considerations and sample implementation for archiving data using a rolling collections pattern. Implementing a data archival solution involves deleting data from the operational store and persisting the same in cold storage. There are various solutions to delete data, and it’s essential to select one considering the performance and cost impact on your workload.

Archiving documents using TTL indexes

Amazon DocumentDB organizes data across namespaces called collections within a database. Indexes in Amazon DocumentDB are created at a per-collection level, and one of the index types that Amazon DocumentDB supports is a Time to Live (TTL) index. This index allows you to set a timeout on each document. When a document reaches the TTL age limit, it’s deleted from the collection. You can implement solutions using change streams to store data in an appropriate cold storage and update the storage to indicate document deletion.

TTL works as a best effort process, and documents are not guaranteed to be deleted within a specific period. Factors like instance size, resource utilization, document size, overall throughput, number of indexes, and whether the indexes and the working set can fit in memory can affect this.

TTL indexes result in an explicit delete operation. Document deletion incurs I/O, which is one of the pricing dimensions for Amazon DocumentDB. Increase in throughput and TTL deletes increases your bill due to increased I/O usage. Using a TTL index is a solution to delete documents from a collection. However, if your workload is write-heavy, TTL deletes can be suboptimal considering cost. TTL deletes require the document to be present in the buffer cache, and read operations retrieve documents from the storage volume, if not already present in the buffer cache. These read operations along with delete operations increase cost. Also, cache updates due to these read operations preceding delete operations can result in data eviction from the buffer cache, resulting in increased latencies for regular read and write operations.

Archiving documents using rolling collections

An alternate solution to using a TTL index is rolling collections, where you segment documents into collections based on the retention period and drop these collections after the retention period expires. Dropping a collection doesn’t result in explicit delete operations, and you don’t incur any I/O costs. Dropping a collection is also efficient from a performance standpoint because the buffer cache isn’t updated and therefore regular read and write operations aren’t impacted.

Solution design

A best practice is to define the retention period for data in your operational data store. Let’s say we want to retain the most recent 30 days of data in Amazon DocumentDB and archive the data older than 30 days to Amazon S3. To implement this requirement using the rolling collections pattern, we must make the following design considerations regarding collections, document modeling, applications, and data archiving.

Collection design

The solution uses two 30-day collections:

One collection stores the most recent 30 days of data. Let’s call it current_month.
The other collection stores the previous 30 days of data. Let’s call it previous_month.

After 30 days, the solution drops the previous_month collection, renames the current_month collection to previous_month, and a creates a new current_month collection.

In general, the number of collections follows an N + 1 approach, where N is the number of collections needed to host data for the required retention period. The additional collection is required to accommodate for a rolling window across the retention period. For example, if the collection contains data from January 1 to January 30, on January 15, the application needs access to data from January 1 to January 15 and December 16 to December 31.

Document modeling

You should include an updated_on date field in your document to track the last modified date.

Additionally, include an isArchived field in the query to indicate if the document is soft deleted. This is optional and helps keep track of documents updated within the retention window.

Application design

With the segmentation of collections as described earlier, the most recent 30 days of data spans across the current_month and previous_month collections. The application should handle inserting the most recent data into the current_month collection. Read operations should query both collections to retrieve the most recent 30 days of data. To optimize performance for read operations, you can use asynchronous drivers to issue parallel queries. For update operations, the application writes the most recent data to the current_month collection. If the document was originally present in the previous_month collection, the application updates this collection to indicate soft deletion. Delete operations remove the document from the current_month and previous_month collections.

Archiving data

To archive data to cold storage, use change streams to stream changes from current_month to your S3 bucket in near-real time. You can find a reference implementation in the Amazon DocumentDB workshop Archiving data with Amazon DocumentDB change streams.

Sample implementation

Now let’s looks at a sample implementation for the rolling collections pattern. Let’s consider the following user profile document:

{
        "_id" : ObjectId("618ef1347cd252d0493ec8bc"),
        "f_name" : "Mark",
        "l_name" : "Green",
        "email" : "Mark.Green@gmail.com",
        "phone" : "6623745143",
        "updated_on" : ISODate("2021-11-03T00:00:00Z"),
        "isArchived" : false
}

Sample queries on a collection of these documents go through the following steps.

Insert

The application writes the most recent data to the current_month collection:

db.current_month.insertOne({"f_name":"Richard", "l_name":"Roe", "email" : " Richard. Roe @gmail.com", "phone": 5555560100, "updated_on":new Date(),"isArchived":false})

Read

Queries look for documents with isArchivedFlag as false. The application determines the boundary value for the last modified date for both collections and performs the union of results. It creates an index on the lastModifiedDate and isArchived fields and uses it as a predicate in the following query:

db.current_month.find({"l_name":"Green","updated_on":{"$lte":new Date()},"isArchived":false}).pretty()

db.previous_month.find({"l_name":"Green","updated_on":{"$gte":new Date(ISODate().getTime() - 1000 * 86400 * 30)},"isArchived":false}).pretty()

Update

The application writes the most recent data to the current_month collection. If the document was originally present in the previous_month collection, the application updates the isArchived attribute to true in the previous_month collection. See the following code:

db.current_month.updateOne(
{"email":"Alan.Green@gmail.com","updated_on":{"$lte":new Date()},"isArchived":false},
{$set:
{"f_name" : "Alan","l_name" : "Green","email" : "Alan.Green@gmail.com","phone" : "9249756242","updated_on":new Date() ,"isArchived":false}
},
{upsert:true}
)

The preceding query returns the following JSON:

{
        "acknowledged" : true,
        "matchedCount" : 0,
        "modifiedCount" : 0,
        "upsertedId" : ObjectId("618ef5c7e6c4f51d31339941")
}

If the updated document doesn’t exist in the current_month collection, the query inserts a new document and returns matchedCount as 0. Then the application updates the previous_month collection as follows:

db.previous_month.updateOne(
{"email":"Alan.Green@gmail.com","updated_on":{"$gte":new Date(ISODate().getTime() - 1000 * 86400 * 30)},"isArchived":false},
{$set:
{"updated_on":new Date() ,"isArchived":true}
},
{upsert:false}
)

If you don’t need to track lineage for the documents that were updated during the retention period, you can ignore this step and just delete the documents from the previous_month collection.

Delete

Delete operations delete documents from either or both collections, as per the application requirements:

db.previous_month.deleteOne(
{ "email":"Alan.Green@gmail.com","isArchived":true }
)

db.current_month.deleteOne(
{ "email":"Alan.Green@gmail.com" }
)

At the end of the 30-day retention period, perform the following steps to roll the collections. This approach doesn’t require any code changes for your application. You can package these commands into a shell script or AWS Lambda function and schedule it to run at a frequency matching your retention period.

Drop the previous_month collection:
```
db.previous_month.drop()
```

Rename the collection current_month to previous_month:

db.adminCommand({modifyChangeStreams: 1,     database: "profiles",     collection: "current_month",      enable: false});
db.current_month.renameCollection(“previous_month”)

It’s a best practice to perform this step during non-peak hours to avoid disruption to inflight queries. Renaming a collection creates an invalidate event for change streams opened on the current_month collection. This event prohibits you from resuming change streams. To overcome this limitation, disable the change streams on the current_month collection before renaming the collection.

Create a collection named current_month and enable change streams on this collection:

db.createCollection(“current_month”)
db.adminCommand({modifyChangeStreams: 1,     database: "profiles",     collection: "current_month",      enable: true});

Create indexes in the current_month collections.

Considerations

As discussed earlier, one advantage of a rolling collections pattern is reduced cost because no additional IOPS are generated. Also, performance for regular application queries isn’t degraded using this pattern because the data retained in the buffer cache related to the workload isn’t evicted. The downside of this pattern is increased application complexity due to query considerations we discussed.

It’s important to analyze the pros and cons of the data archival solutions discussed in the post to identify the best fit for your workload. A best practice is to implement your workload considering data archival requirements because it helps you design your collections, document models, and queries appropriately and avoid rework.

Summary

In this post, I discussed a data archival strategy for Amazon DocumentDB and the possible solutions along with pros and cons. I explained design considerations and provided a sample implementation for a data archival strategy using the rolling collections pattern.

Do you have follow-up questions or feedback? Leave a comment. I’d love to hear your thoughts and suggestions. To get started with Amazon DocumentDB, refer to the Developer Guide.

About the Author

Karthik Vijayraghavan is a Senior DocumentDB Specialist Solutions Architect at AWS. He has been helping customers modernize their applications using NoSQL databases. He enjoys solving customer problems and is passionate about providing cost effective solutions that performs at scale. Karthik started his career as a developer building web and REST services with strong focus on integration with relational databases and hence can relate to customers that are in the process of migration to NoSQL.