Migrate an application from using GridFS to using Amazon S3 and Amazon DocumentDB (with MongoDB compatibility)
In many database applications there arises a need to store large objects, such as files, along with application data. A common approach is to store these files inside the database itself, despite the fact that a database isn’t the architecturally best choice for storing large objects. Primarily, because file system APIs are relatively basic (such as list, get, put, and delete), a fully-featured database management system, with its complex query operators, is overkill for this use case. Additionally, large objects compete for resources in an OLTP system, which can negatively impact query workloads. Moreover, purpose-built file systems are often far more cost-effective for this use case than using a database, in terms of storage costs as well as computing costs to support the file system.
The natural alternative to storing files in a database is on a purpose-built file system or object store, such as Amazon Simple Storage Service (Amazon S3). You can use Amazon S3 as the location to store files or binary objects (such as PDF files, image files, and large XML documents) that are stored and retrieved as a whole. Amazon S3 provides a serverless service with built-in durability, scalability, and security. You can pair this with a database that stores the metadata for the object along with the Amazon S3 reference. This way, you can query the metadata via the database APIs, and retrieve the file via the Amazon S3 reference stored along with the metadata. Using Amazon S3 and Amazon DocumentDB (with MongoDB compatibility) in this fashion is a common pattern.
GridFS is a file system that has been implemented on top of the MongoDB NoSQL database. In this post, I demonstrate how to replace the GridFS file system with Amazon S3. GridFS provides some nonstandard extensions to the typical file system (such as adding searchable metadata for the files) with MongoDB-like APIs, and I further demonstrate how to use Amazon S3 and Amazon DocumentDB to handle these additional use cases.
For this post, I start with some basic operations against a GridFS file system set up on a MongoDB instance. I demonstrate operations using the Python driver, pymongo, but the same operations exist in other MongoDB client drivers. I use an Amazon Elastic Compute Cloud (Amazon EC2) instance that has MongoDB installed; I log in to this instance and use Python to connect locally.
To demonstrate how this can be done with AWS services, I use Amazon S3 and an Amazon DocumentDB cluster for the more advanced use cases. I also use AWS Secrets Manager to store the credentials for logging into Amazon DocumentDB.
- A VPC with three private and one public subnets
- An Amazon DocumentDB cluster
- An EC2 instance with the MongoDB tools installed and running
- A secret in Secrets Manager to store the database credentials
- Security groups to allow the EC2 instance to communicate with the Amazon DocumentDB cluster
The only prerequisite for this template is an EC2 key pair for logging into the EC2 instance. For more information, see Create or import a key pair. The following diagram illustrates the components in the template. This CloudFormation template incurs costs, and you should consult the relevant pricing pages before launching it.
First, launch the CloudFormation stack using the template. For more information on how to do this via the AWS CloudFormation console or the AWS Command Line Interface (AWS CLI), see Working with stacks. Provide the following inputs for the CloudFormation template:
- Stack name
- Instance type for the Amazon DocumentDB cluster (default is db.r5.large)
- Master username for the Amazon DocumentDB cluster
- Master password for the Amazon DocumentDB cluster
- EC2 instance type for the MongoDB database and the machine to use for this example (default: m5.large)
- EC2 key pair to use to access the EC2 instance
- SSH location to allow access to the EC2 instance
- Username to use with MongoDB
- Password to use with MongoDB
After the stack has completed provisioning, I log in to the EC2 instance using my key pair. The hostname for the EC2 instance is reported in the ClientEC2InstancePublicDNS output from the CloudFormation stack. For more information, see Connect to your Linux instance.
I use a few simple files for these examples. After I log in to the EC2 instance, I create five sample files as follows:
Basic operations with GridFS
In this section, I walk through some basic operations using GridFS against the MongoDB database running on the EC2 instance. All the following commands for this demonstration are available in a single Python script. Before using it, make sure to replace the username and password to access the MongoDB database with the ones you provided when launching the CloudFormation stack. I use the Python shell. To start the Python shell, run the following code:
Next, we import a few packages we need:
Next, we connect to the local MongoDB database and create the GridFS object. The CloudFormation template created a MongoDB username and password based on the parameters entered when launching the stack. For this example, I use
labdb for the username and
labdbpwd for the password, but you should replace those with the parameter values you provided. We use the
gridfs database to store the GridFS data and metadata:
Now that we have connected to MongoDB, we create a few objects. The first,
db, represents the MongoDB database we use for our GridFS, namely
gridfs. Next, we create a GridFS file system object,
fs, that we use to perform GridFS operations. This GridFS object takes as an argument the MongoDB database object that was just created.
Now that this setup is complete, list the files in the GridFS file system:
We can see that there are no files in the file system. Next, insert one of the files we created earlier:
This put command returns an
ObjectId that identifies the file that was just inserted. I save this
ObjectID in the variable h. We can show the value of
h as follows:
Now when you list the files, you see the file we just inserted:
Insert another file that you created earlier and list the files:
Read the first file you inserted. One way to read the file is by the
GridFS also allows searching for files, for example by filename:
We can see one file with the name
hello.txt. The result is a cursor to iterate over the files that were returned. To get the first file, call the
Next, delete the
hello.txt file. To do this, use the
ObjectId of the
res0 file object, which is accessible via the
Only one file is now in the file system.
Next, overwrite the
bye.txt file with different data, in this case the
goodbye.txt file contents:
This overwrite doesn’t actually delete the previous version. GridFS is a versioned file system and keeps older versions unless you specifically delete them. So, when we find the files based on the
bye.txt, we see two files:
GridFS allows us to get specific versions of the file, via the
get_version() method. By default, this returns the most recent version. Versions are numbered in a one-up counted way, starting at 0. So we can access the original version by specifying version 0. We can also access the most recent version by specifying version -1. First, the default, most recent version:
Next, the first version:
The following code is the second version:
The following code is the latest version, which is the same as not providing a version, as we saw earlier:
An interesting feature of GridFS is the ability to attach metadata to the files. The API allows for adding any keys and values as part of the
put() operation. In the following code, we add a key-value pair with the key
somekey and the value
We can access the custom metadata as a field of the file:
Now that we have the metadata attached to the file, we can search for files with specific metadata:
We can retrieve the value for the key
somekey from the following result:
We can also return multiple documents via this approach. In the following code, we insert another file with the
somekey attribute, and then we can see that two files have the
somekey attribute defined:
Basic operations with Amazon S3
In this section, I show how to get the equivalent functionality of GridFS using Amazon S3. There are some subtle differences in terms of unique identifiers and the shape of the returned objects, so it’s not a drop-in replacement for GridFS. However, the major functionality of GridFS is covered by the Amazon S3 APIs. I walk through the same operations as in the previous section, except using Amazon S3 instead of GridFs.
First, we create an S3 bucket to store the files. For this example, I use the bucket named
blog-gridfs. You need to choose a different name for your bucket, because bucket names are globally unique. For this demonstration, we want to also enable versioning for this bucket. This allows Amazon S3 to behave similarly as GridFS with respect to versioning files.
As with the previous section, the following commands are included in a single Python script, but I walk through these commands one by one. Before using the script, make sure to replace the secret name with the one created by the CloudFormation stack, as well as the Region you’re using, and the S3 bucket you created.
First, we import a few packages we need:
Next, we connect to Amazon S3 and create the S3 client:
It’s convenient to store the name of the bucket we created in a variable. Set the bucket variable appropriately:
Now that this setup is complete, we list the files in the S3 bucket:
The output is more verbose, but we’re most interested in the
Contents field, which is an array of objects. In this example, it’s absent, denoting an empty bucket. Next, insert one of the files we created earlier:
put_object command takes three parameters:
- Body – The bytes to write
- Bucket – The name of the bucket to upload to
- Key – The file name
The key can be more than just a file name, but can also include subdirectories, such as
put_object command returns information acknowledging the successful insertion of the file, including the
Now if we list the files, we see the file we just inserted:
Next, insert the other file we created earlier and list the files:
Read the first file. In Amazon S3, use the bucket and key to get the object. The
Body field is a streaming object that can be read to retrieve the contents of the object:
Similar to GridFS, Amazon S3 also allows you to search for files by file name. In the Amazon S3 API, you can specify a prefix that is used to match against the key for the objects:
We can see one file with the name
Next, delete the
hello.txt file. To do this, we use the bucket and file name, or key:
The bucket now only contains one file.
Let’s overwrite the
bye.txt file with different data, in this case the
goodbye.txt file contents:
Similar to GridFS, with versioning turned on in Amazon S3, an overwrite doesn’t actually delete the previous version. Amazon S3 keeps older versions unless you specifically delete them. So, when we list the versions of the bye.txt object, we see two files:
As with GridFS, Amazon S3 allows us to get specific versions of the file, via the
get_object() method. By default, this returns the most recent version. Unlike GridFS, versions in Amazon S3 are identified with a unique identifier,
VersionId, not a counter. We can get the versions of the object and sort them based on their
LastModified field. We can access the original version by specifying the
VersionId of the first element in the sorted list. We can also access the most recent version by not specifying a
Similar to GridFS, Amazon S3 provides the ability to attach metadata to the files. The API allows for adding any keys and values as part of the Metadata field in the
put_object() operation. In the following code, we add a key-value pair with the key
somekey and the value
We can access the custom metadata via the
We can also print the contents of the file:
One limitation with Amazon S3 versus GridFS is that you can’t search for objects based on the metadata. To accomplish this use case, we employ Amazon DocumentDB.
Use cases with Amazon S3 and Amazon DocumentDB
Some use cases may require you to find objects or files based on the metadata, beyond just the file name. For example, in an asset management use case, we may want to record the author or a list of keywords. To do this, we can use Amazon S3 and Amazon DocumentDB to provide a very similar developer experience, but leveraging the power of a purpose-built document database and a purpose-built object store. In this section, I walk through how to use these two services to cover the additional use case of needing to find files based on the metadata.
First, we import a few packages:
We use the credentials that we created when we launched the CloudFormation stack. These credentials were stored in Secrets Manager. The name of the secret is the name of the stack that you used to create the stack (for this post,
-DocDBSecret appended to
docdb-mongo-DocDBSecret. We assign this to a variable. You should use the appropriate Secrets Manager secret name for your stack:
Next, we create a Secrets Manager client and retrieve the secret. Make sure to set the Region variable with the Region in which you deployed the stack:
This secret contains the four pieces of information that we need to connect to the Amazon DocumentDB cluster:
- Cluster endpoint
Next we connect to the Amazon DocumentDB cluster:
We use the database
fs and the collection
files to store our file metadata:
Because we already have data in the S3 bucket, we create entries in the Amazon DocumentDB collection for those files. The information we store is analogous to the information in the
GridFS fs.files collection, namely the following:
- bucket – The S3 bucket
- filename – The S3 key
- version – The S3
- length – The file length in bytes
- uploadDate – The S3
Additionally, any metadata that was stored with the objects in Amazon S3 is also added to the document in Amazon DocumentDB:
Now we can find files by their metadata:
To read the file itself, we can use the bucket, file name, and version to retrieve the object from Amazon S3:
Now we can put another file with additional metadata. To do this, we write the file to Amazon S3 and insert the metadata into Amazon DocumentDB:
Finally, we can search for files with
somekey defined, as we did with GridFS, and see that two files match:
You can delete the resources created in this post by deleting the stack via the AWS CloudFormation console or the AWS CLI.
Storing large objects inside a database is typically not the best architectural choice. Instead, coupling a distributed object store, such as Amazon S3, with the database provides a more architecturally sound solution. Storing the metadata in the database and a reference to the location of the object in the object store allows for efficient query and retrieval operations, while reducing the strain on the database for serving object storage operations.
In this post, I demonstrated how to use Amazon S3 and Amazon DocumentDB in place of MongoDB’s GridFS. I leveraged Amazon S3’s purpose-built object store and Amazon DocumentDB, a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads.
For more information about recent launches and blog posts, see Amazon DocumentDB (with MongoDB compatibility) resources.
About the author
Brian Hess is a Senior Solution Architect Specialist for Amazon DocumentDB (with MongoDB compatibility) at AWS. He has been in the data and analytics space for over 20 years and has extensive experience with relational and NoSQL databases.