AWS Database Blog
Optimize data archival costs in Amazon DocumentDB using rolling collections
Amazon DocumentDB (with MongoDB compatibility) is a scalable, highly durable, and fully managed database service for operating mission-critical MongoDB workloads. Amazon DocumentDB emulates the responses that a client expects from a MongoDB server by implementing the Apache 2.0 open-source MongoDB 3.6, 4.0 or 5.0 APIs on a purpose-built, distributed, fault-tolerant, and self-healing storage system that gives you the performance, scalability, and availability you need when operating mission-critical MongoDB workloads at scale. As a document database, Amazon DocumentDB makes it easy to store, query, and index JSON data.
It’s essential to implement appropriate data archival strategies to ensure that only operational data is stored in a document database, and archived data is stored in an appropriate storage like Amazon Simple Storage Service (Amazon S3). Lack of an appropriate data archival strategy can result in increased cost and poor performance due to inefficient use of system resources. In this post, I discuss common solutions to archive data from Amazon DocumentDB. I also elaborate on design considerations and sample implementation for archiving data using a rolling collections pattern. Implementing a data archival solution involves deleting data from the operational store and persisting the same in cold storage. There are various solutions to delete data, and it’s essential to select one considering the performance and cost impact on your workload.
Archiving documents using TTL indexes
Amazon DocumentDB organizes data across namespaces called collections within a database. Indexes in Amazon DocumentDB are created at a per-collection level, and one of the index types that Amazon DocumentDB supports is a Time to Live (TTL) index. This index allows you to set a timeout on each document. When a document reaches the TTL age limit, it’s deleted from the collection. You can implement solutions using change streams to store data in an appropriate cold storage and update the storage to indicate document deletion.
TTL works as a best effort process, and documents are not guaranteed to be deleted within a specific period. Factors like instance size, resource utilization, document size, overall throughput, number of indexes, and whether the indexes and the working set can fit in memory can affect this.
TTL indexes result in an explicit delete operation. Document deletion incurs I/O, which is one of the pricing dimensions for Amazon DocumentDB. Increase in throughput and TTL deletes increases your bill due to increased I/O usage. Using a TTL index is a solution to delete documents from a collection. However, if your workload is write-heavy, TTL deletes can be suboptimal considering cost. TTL deletes require the document to be present in the buffer cache, and read operations retrieve documents from the storage volume, if not already present in the buffer cache. These read operations along with delete operations increase cost. Also, cache updates due to these read operations preceding delete operations can result in data eviction from the buffer cache, resulting in increased latencies for regular read and write operations.
Archiving documents using rolling collections
An alternate solution to using a TTL index is rolling collections, where you segment documents into collections based on the retention period and drop these collections after the retention period expires. Dropping a collection doesn’t result in explicit delete operations, and you don’t incur any I/O costs. Dropping a collection is also efficient from a performance standpoint because the buffer cache isn’t updated and therefore regular read and write operations aren’t impacted.
Solution design
A best practice is to define the retention period for data in your operational data store. Let’s say we want to retain the most recent 30 days of data in Amazon DocumentDB and archive the data older than 30 days to Amazon S3. To implement this requirement using the rolling collections pattern, we must make the following design considerations regarding collections, document modeling, applications, and data archiving.
Collection design
The solution uses two 30-day collections:
- One collection stores the most recent 30 days of data. Let’s call it
current_month
. - The other collection stores the previous 30 days of data. Let’s call it
previous_month
.
After 30 days, the solution drops the previous_month
collection, renames the current_month
collection to previous_month
, and a creates a new current_month
collection.
In general, the number of collections follows an N + 1 approach, where N is the number of collections needed to host data for the required retention period. The additional collection is required to accommodate for a rolling window across the retention period. For example, if the collection contains data from January 1 to January 30, on January 15, the application needs access to data from January 1 to January 15 and December 16 to December 31.
Document modeling
You should include an updated_on
date field in your document to track the last modified date.
Additionally, include an isArchived
field in the query to indicate if the document is soft deleted. This is optional and helps keep track of documents updated within the retention window.
Application design
With the segmentation of collections as described earlier, the most recent 30 days of data spans across the current_month
and previous_month
collections. The application should handle inserting the most recent data into the current_month
collection. Read operations should query both collections to retrieve the most recent 30 days of data. To optimize performance for read operations, you can use asynchronous drivers to issue parallel queries. For update operations, the application writes the most recent data to the current_month
collection. If the document was originally present in the previous_month
collection, the application updates this collection to indicate soft deletion. Delete operations remove the document from the current_month
and previous_month
collections.
Archiving data
To archive data to cold storage, use change streams to stream changes from current_month
to your S3 bucket in near-real time. You can find a reference implementation in the Amazon DocumentDB workshop Archiving data with Amazon DocumentDB change streams.
Sample implementation
Now let’s looks at a sample implementation for the rolling collections pattern. Let’s consider the following user profile document:
Sample queries on a collection of these documents go through the following steps.
Insert
The application writes the most recent data to the current_month
collection:
Read
Queries look for documents with isArchivedFlag
as false. The application determines the boundary value for the last modified date for both collections and performs the union of results. It creates an index on the lastModifiedDate
and isArchived
fields and uses it as a predicate in the following query:
Update
The application writes the most recent data to the current_month
collection. If the document was originally present in the previous_month
collection, the application updates the isArchived
attribute to true in the previous_month
collection. See the following code:
The preceding query returns the following JSON:
If the updated document doesn’t exist in the current_month
collection, the query inserts a new document and returns matchedCount
as 0. Then the application updates the previous_month
collection as follows:
If you don’t need to track lineage for the documents that were updated during the retention period, you can ignore this step and just delete the documents from the previous_month collection.
Delete
Delete operations delete documents from either or both collections, as per the application requirements:
At the end of the 30-day retention period, perform the following steps to roll the collections. This approach doesn’t require any code changes for your application. You can package these commands into a shell script or AWS Lambda function and schedule it to run at a frequency matching your retention period.
- Drop the
previous_month
collection: - Rename the collection
current_month
toprevious_month
:
It’s a best practice to perform this step during non-peak hours to avoid disruption to inflight queries. Renaming a collection creates an invalidate event for change streams opened on the current_month
collection. This event prohibits you from resuming change streams. To overcome this limitation, disable the change streams on the current_month
collection before renaming the collection.
- Create a collection named
current_month
and enable change streams on this collection: - Create indexes in the
current_month
collections.
Considerations
As discussed earlier, one advantage of a rolling collections pattern is reduced cost because no additional IOPS are generated. Also, performance for regular application queries isn’t degraded using this pattern because the data retained in the buffer cache related to the workload isn’t evicted. The downside of this pattern is increased application complexity due to query considerations we discussed.
It’s important to analyze the pros and cons of the data archival solutions discussed in the post to identify the best fit for your workload. A best practice is to implement your workload considering data archival requirements because it helps you design your collections, document models, and queries appropriately and avoid rework.
Summary
In this post, I discussed a data archival strategy for Amazon DocumentDB and the possible solutions along with pros and cons. I explained design considerations and provided a sample implementation for a data archival strategy using the rolling collections pattern.
Do you have follow-up questions or feedback? Leave a comment. I’d love to hear your thoughts and suggestions. To get started with Amazon DocumentDB, refer to the Developer Guide.
About the Author
Karthik Vijayraghavan is a Senior DocumentDB Specialist Solutions Architect at AWS. He has been helping customers modernize their applications using NoSQL databases. He enjoys solving customer problems and is passionate about providing cost effective solutions that performs at scale. Karthik started his career as a developer building web and REST services with strong focus on integration with relational databases and hence can relate to customers that are in the process of migration to NoSQL.