Migrate an application from using GridFS to using Amazon S3 and Amazon DocumentDB (with MongoDB compatibility)

In many database applications there arises a need to store large objects, such as files, along with application data. A common approach is to store these files inside the database itself, despite the fact that a database isn’t the architecturally best choice for storing large objects. Primarily, because file system APIs are relatively basic (such as list, get, put, and delete), a fully-featured database management system, with its complex query operators, is overkill for this use case. Additionally, large objects compete for resources in an OLTP system, which can negatively impact query workloads. Moreover, purpose-built file systems are often far more cost-effective for this use case than using a database, in terms of storage costs as well as computing costs to support the file system.

The natural alternative to storing files in a database is on a purpose-built file system or object store, such as Amazon Simple Storage Service (Amazon S3). You can use Amazon S3 as the location to store files or binary objects (such as PDF files, image files, and large XML documents) that are stored and retrieved as a whole. Amazon S3 provides a serverless service with built-in durability, scalability, and security. You can pair this with a database that stores the metadata for the object along with the Amazon S3 reference. This way, you can query the metadata via the database APIs, and retrieve the file via the Amazon S3 reference stored along with the metadata. Using Amazon S3 and Amazon DocumentDB (with MongoDB compatibility) in this fashion is a common pattern.

GridFS is a file system that has been implemented on top of the MongoDB NoSQL database. In this post, I demonstrate how to replace the GridFS file system with Amazon S3. GridFS provides some nonstandard extensions to the typical file system (such as adding searchable metadata for the files) with MongoDB-like APIs, and I further demonstrate how to use Amazon S3 and Amazon DocumentDB to handle these additional use cases.

Solution overview

For this post, I start with some basic operations against a GridFS file system set up on a MongoDB instance. I demonstrate operations using the Python driver, pymongo, but the same operations exist in other MongoDB client drivers. I use an Amazon Elastic Compute Cloud (Amazon EC2) instance that has MongoDB installed; I log in to this instance and use Python to connect locally.

To demonstrate how this can be done with AWS services, I use Amazon S3 and an Amazon DocumentDB cluster for the more advanced use cases. I also use AWS Secrets Manager to store the credentials for logging into Amazon DocumentDB.

An AWS CloudFormation template is provided to provision the necessary components. It deploys the following resources:

A VPC with three private and one public subnets
An Amazon DocumentDB cluster
An EC2 instance with the MongoDB tools installed and running
A secret in Secrets Manager to store the database credentials
Security groups to allow the EC2 instance to communicate with the Amazon DocumentDB cluster

The only prerequisite for this template is an EC2 key pair for logging into the EC2 instance. For more information, see Create or import a key pair. The following diagram illustrates the components in the template. This CloudFormation template incurs costs, and you should consult the relevant pricing pages before launching it.

Architecture Diagram of the solution. It shows an EC2 instance on a publica subnet that interacts with the Client, AWS Secrets Manager, and Amazon DocumentDB, the latter, in a Private Subnet

Initial setup

First, launch the CloudFormation stack using the template. For more information on how to do this via the AWS CloudFormation console or the AWS Command Line Interface (AWS CLI), see Working with stacks. Provide the following inputs for the CloudFormation template:

Stack name
Instance type for the Amazon DocumentDB cluster (default is db.r5.large)
Master username for the Amazon DocumentDB cluster
Master password for the Amazon DocumentDB cluster
EC2 instance type for the MongoDB database and the machine to use for this example (default: m5.large)
EC2 key pair to use to access the EC2 instance
SSH location to allow access to the EC2 instance
Username to use with MongoDB
Password to use with MongoDB

After the stack has completed provisioning, I log in to the EC2 instance using my key pair. The hostname for the EC2 instance is reported in the ClientEC2InstancePublicDNS output from the CloudFormation stack. For more information, see Connect to your Linux instance.

I use a few simple files for these examples. After I log in to the EC2 instance, I create five sample files as follows:

cd /home/ec2-user
echo Hello World! > /home/ec2-user/hello.txt
echo Bye World! > /home/ec2-user/bye.txt
echo Goodbye World! > /home/ec2-user/goodbye.txt
echo Bye Bye World! > /home/ec2-user/byebye.txt
echo So Long World! > /home/ec2-user/solong.txt

Basic operations with GridFS

In this section, I walk through some basic operations using GridFS against the MongoDB database running on the EC2 instance. All the following commands for this demonstration are available in a single Python script. Before using it, make sure to replace the username and password to access the MongoDB database with the ones you provided when launching the CloudFormation stack. I use the Python shell. To start the Python shell, run the following code:

$ python3
Python 3.7.9 (default, Aug 27 2020, 21:59:41)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Next, we import a few packages we need:

>>> import pymongo
>>> import gridfs

Next, we connect to the local MongoDB database and create the GridFS object. The CloudFormation template created a MongoDB username and password based on the parameters entered when launching the stack. For this example, I use labdb for the username and labdbpwd for the password, but you should replace those with the parameter values you provided. We use the gridfs database to store the GridFS data and metadata:

>>> mongo_client = pymongo.MongoClient(host="localhost")
>>> mongo_client["admin"].authenticate(name="labdb", password="labdbpwd")

Now that we have connected to MongoDB, we create a few objects. The first, db, represents the MongoDB database we use for our GridFS, namely gridfs. Next, we create a GridFS file system object, fs, that we use to perform GridFS operations. This GridFS object takes as an argument the MongoDB database object that was just created.

>>> db = mongo_client.gridfs
>>> fs = gridfs.GridFS(db)

Now that this setup is complete, list the files in the GridFS file system:

>>> print(fs.list())
[]

We can see that there are no files in the file system. Next, insert one of the files we created earlier:

>>> h = fs.put(open("/home/ec2-user/hello.txt", "rb").read(), filename="hello.txt")

This put command returns an ObjectId that identifies the file that was just inserted. I save this ObjectID in the variable h. We can show the value of h as follows:

>>> h
ObjectId('601b1da5fd4a6815e34d65f5')

Now when you list the files, you see the file we just inserted:

>>> print(fs.list())
['hello.txt']

Insert another file that you created earlier and list the files:

>>> b = fs.put(open("/home/ec2-user/bye.txt", "rb").read(), filename="bye.txt")
>>> print(fs.list())
['bye.txt', 'hello.txt']

Read the first file you inserted. One way to read the file is by the ObjectId:

>>> print(fs.get(h).read())
b'Hello World!\n'

GridFS also allows searching for files, for example by filename:

>>> res = fs.find({"filename": "hello.txt"})
>>> print(res.count())
1

We can see one file with the name hello.txt. The result is a cursor to iterate over the files that were returned. To get the first file, call the next() method:

>>> res0 = res.next()
>>> res0.read()
b'Hello World!\n'

Next, delete the hello.txt file. To do this, use the ObjectId of the res0 file object, which is accessible via the _id field:

>>> fs.delete(res0._id)
>>> print(fs.list())
['bye.txt']

Only one file is now in the file system.

Next, overwrite the bye.txt file with different data, in this case the goodbye.txt file contents:

>>> hb = fs.put(open("/home/ec2-user/goodbye.txt", "rb").read(), filename="bye.txt")
>>> print(fs.list())
['bye.txt']

This overwrite doesn’t actually delete the previous version. GridFS is a versioned file system and keeps older versions unless you specifically delete them. So, when we find the files based on the bye.txt, we see two files:

>>> res = fs.find({"filename": "bye.txt"})
>>> print(res.count())
2

GridFS allows us to get specific versions of the file, via the get_version() method. By default, this returns the most recent version. Versions are numbered in a one-up counted way, starting at 0. So we can access the original version by specifying version 0. We can also access the most recent version by specifying version -1. First, the default, most recent version:

>>> x = fs.get_version(filename="bye.txt")
>>> print(x.read())
b'Goodbye World!\n'

Next, the first version:

>>> x0 = fs.get_version(filename="bye.txt", version=0)
>>> print(x0.read())
b'Bye World!\n'

The following code is the second version:

>>> x1 = fs.get_version(filename="bye.txt", version=1)
>>> print(x1.read())
b'Goodbye World!\n'

The following code is the latest version, which is the same as not providing a version, as we saw earlier:

>>> xlatest = fs.get_version(filename="bye.txt", version=-1)
>>> print(xlatest.read())
b'Goodbye World!\n'

An interesting feature of GridFS is the ability to attach metadata to the files. The API allows for adding any keys and values as part of the put() operation. In the following code, we add a key-value pair with the key somekey and the value somevalue:

>>> bb = fs.put(open("/home/ec2-user/byebye.txt", "rb").read(), filename="bye.txt", somekey="somevalue")
>>> c = fs.get_version(filename="bye.txt")
>>> print(c.read())
b'Bye Bye World!\n'

We can access the custom metadata as a field of the file:

>>> print(c.somekey)
somevalue

Now that we have the metadata attached to the file, we can search for files with specific metadata:

>>> sk0 = fs.find({"somekey": "somevalue"}).next()

We can retrieve the value for the key somekey from the following result:

>>> print(sk0.somekey)
somevalue

We can also return multiple documents via this approach. In the following code, we insert another file with the somekey attribute, and then we can see that two files have the somekey attribute defined:

>>> h = fs.put(open("/home/ec2-user/solong.txt", "rb").read(), filename="solong.txt", somekey="someothervalue", key2="value2")
>>> print(fs.find({"somekey": {"$exists": True}}).count())
2

Basic operations with Amazon S3

In this section, I show how to get the equivalent functionality of GridFS using Amazon S3. There are some subtle differences in terms of unique identifiers and the shape of the returned objects, so it’s not a drop-in replacement for GridFS. However, the major functionality of GridFS is covered by the Amazon S3 APIs. I walk through the same operations as in the previous section, except using Amazon S3 instead of GridFs.

First, we create an S3 bucket to store the files. For this example, I use the bucket named blog-gridfs. You need to choose a different name for your bucket, because bucket names are globally unique. For this demonstration, we want to also enable versioning for this bucket. This allows Amazon S3 to behave similarly as GridFS with respect to versioning files.

As with the previous section, the following commands are included in a single Python script, but I walk through these commands one by one. Before using the script, make sure to replace the secret name with the one created by the CloudFormation stack, as well as the Region you’re using, and the S3 bucket you created.

First, we import a few packages we need:

>>> import boto3

Next, we connect to Amazon S3 and create the S3 client:

session = boto3.Session()
s3_client = session.client('s3')

It’s convenient to store the name of the bucket we created in a variable. Set the bucket variable appropriately:

>>> bucket = "blog-gridfs"

Now that this setup is complete, we list the files in the S3 bucket:

>>> s3_client.list_objects(Bucket=bucket)
{'ResponseMetadata': {'RequestId': '031B62AE7E916762', 'HostId': 'UO/3dOVHYUVYxyrEPfWgVYyc3us4+0NRQICA/mix//ZAshlAwDK5hCnZ+/wA736x5k80gVcyZ/w=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'UO/3dOVHYUVYxyrEPfWgVYyc3us4+0NRQICA/mix//ZAshlAwDK5hCnZ+/wA736x5k80gVcyZ/w=', 'x-amz-request-id': '031B62AE7E916762', 'date': 'Wed, 03 Feb 2021 22:37:12 GMT', 'x-amz-bucket-region': 'us-east-1', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'IsTruncated': False, 'Marker': '', 'Name': 'blog-gridfs', 'Prefix': '', 'MaxKeys': 1000, 'EncodingType': 'url'}

The output is more verbose, but we’re most interested in the Contents field, which is an array of objects. In this example, it’s absent, denoting an empty bucket. Next, insert one of the files we created earlier:

>>> h = s3_client.put_object(Body=open("/home/ec2-user/hello.txt", "rb").read(), Bucket=bucket, Key="hello.txt")

This put_object command takes three parameters:

Body – The bytes to write
Bucket – The name of the bucket to upload to
Key – The file name

The key can be more than just a file name, but can also include subdirectories, such as subdir/hello.txt. The put_object command returns information acknowledging the successful insertion of the file, including the VersionId:

>>> h
{'ResponseMetadata': {'RequestId': 'EDFD20568177DD45', 'HostId': 'sg8q9KNxa0J+4eQUMVe6Qg2XsLiTANjcA3ElYeUiJ9KGyjsOe3QWJgTwr7T3GsUHi3jmskbnw9E=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'sg8q9KNxa0J+4eQUMVe6Qg2XsLiTANjcA3ElYeUiJ9KGyjsOe3QWJgTwr7T3GsUHi3jmskbnw9E=', 'x-amz-request-id': 'EDFD20568177DD45', 'date': 'Wed, 03 Feb 2021 22:39:19 GMT', 'x-amz-version-id': 'ADuqSQDju6BJHkw86XvBgIPKWalQMDab', 'etag': '"8ddd8be4b179a529afa5f2ffae4b9858"', 'content-length': '0', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'ETag': '"8ddd8be4b179a529afa5f2ffae4b9858"', 'VersionId': 'ADuqSQDju6BJHkw86XvBgIPKWalQMDab'}

Now if we list the files, we see the file we just inserted:

>>> list = s3_client.list_objects(Bucket=bucket)
>>> print([i["Key"] for i in list["Contents"]])
['hello.txt']

Next, insert the other file we created earlier and list the files:

>>> b = s3_client.put_object(Body=open("/home/ec2-user/bye.txt", "rb").read(), Bucket=bucket, Key="bye.txt")
>>> print([i["Key"] for i in s3_client.list_objects(Bucket=bucket)["Contents"]])
['bye.txt', 'hello.txt']

Read the first file. In Amazon S3, use the bucket and key to get the object. The Body field is a streaming object that can be read to retrieve the contents of the object:

>>> s3_client.get_object(Bucket=bucket, Key="hello.txt")["Body"].read()
b'Hello World!\n'

Similar to GridFS, Amazon S3 also allows you to search for files by file name. In the Amazon S3 API, you can specify a prefix that is used to match against the key for the objects:

>>> print([i["Key"] for i in s3_client.list_objects(Bucket=bucket, Prefix="hello.txt")["Contents"]])
['hello.txt']

We can see one file with the name hello.txt.

Next, delete the hello.txt file. To do this, we use the bucket and file name, or key:

>>> s3_client.delete_object(Bucket=bucket, Key="hello.txt")
{'ResponseMetadata': {'RequestId': '56C082A6A85F5036', 'HostId': '3fXy+s1ZP7Slw5LF7oju5dl7NQZ1uXnl2lUo1xHywrhdB3tJhOaPTWNGP+hZq5571c3H02RZ8To=', 'HTTPStatusCode': 204, 'HTTPHeaders': {'x-amz-id-2': '3fXy+s1ZP7Slw5LF7oju5dl7NQZ1uXnl2lUo1xHywrhdB3tJhOaPTWNGP+hZq5571c3H02RZ8To=', 'x-amz-request-id': '56C082A6A85F5036', 'date': 'Wed, 03 Feb 2021 22:45:57 GMT', 'x-amz-version-id': 'rVpCtGLillMIc.I1Qz0PC9pomMrhEBGd', 'x-amz-delete-marker': 'true', 'server': 'AmazonS3'}, 'RetryAttempts': 0}, 'DeleteMarker': True, 'VersionId': 'rVpCtGLillMIc.I1Qz0PC9pomMrhEBGd'}
>>> print([i["Key"] for i in s3_client.list_objects(Bucket=bucket)["Contents"]])
['bye.txt']

The bucket now only contains one file.

Let’s overwrite the bye.txt file with different data, in this case the goodbye.txt file contents:

>>> hb = s3_client.put_object(Body=open("/home/ec2-user/goodbye.txt", "rb").read(), Bucket=bucket, Key="bye.txt")
>>> print([i["Key"] for i in s3_client.list_objects(Bucket=bucket)["Contents"]])
['bye.txt']

Similar to GridFS, with versioning turned on in Amazon S3, an overwrite doesn’t actually delete the previous version. Amazon S3 keeps older versions unless you specifically delete them. So, when we list the versions of the bye.txt object, we see two files:

>>> y = s3_client.list_object_versions(Bucket=bucket, Prefix="bye.txt")
>>> versions = sorted([(i["Key"],i["VersionId"],i["LastModified"]) for i in y["Versions"]], key=lambda y: y[2])
>>> print(len(versions))
2

As with GridFS, Amazon S3 allows us to get specific versions of the file, via the get_object() method. By default, this returns the most recent version. Unlike GridFS, versions in Amazon S3 are identified with a unique identifier, VersionId, not a counter. We can get the versions of the object and sort them based on their LastModified field. We can access the original version by specifying the VersionId of the first element in the sorted list. We can also access the most recent version by not specifying a VersionId:

>>> x0 = s3_client.get_object(Bucket=bucket, Key="bye.txt", VersionId=versions[0][1])
>>> print(x0["Body"].read())
b'Bye World!\n'
>>> x1 = s3_client.get_object(Bucket=bucket, Key="bye.txt", VersionId=versions[1][1])
>>> print(x1["Body"].read())
b'Goodbye World!\n'
>>> xlatest = s3_client.get_object(Bucket=bucket, Key="bye.txt")
>>> print(xlatest["Body"].read())
b'Goodbye World!\n'

Similar to GridFS, Amazon S3 provides the ability to attach metadata to the files. The API allows for adding any keys and values as part of the Metadata field in the put_object() operation. In the following code, we add a key-value pair with the key somekey and the value somevalue:

>>> bb = s3_client.put_object(Body=open("/home/ec2-user/byebye.txt", "rb").read(), Bucket=bucket, Key="bye.txt", Metadata={"somekey": "somevalue"})
>>> c = s3_client.get_object(Bucket=bucket, Key="bye.txt")

We can access the custom metadata via the Metadata field:

>>> print(c["Metadata"]["somekey"])
somevalue

We can also print the contents of the file:

>>> print(c["Body"].read())
b'Bye Bye World!\n'

One limitation with Amazon S3 versus GridFS is that you can’t search for objects based on the metadata. To accomplish this use case, we employ Amazon DocumentDB.

Use cases with Amazon S3 and Amazon DocumentDB

Some use cases may require you to find objects or files based on the metadata, beyond just the file name. For example, in an asset management use case, we may want to record the author or a list of keywords. To do this, we can use Amazon S3 and Amazon DocumentDB to provide a very similar developer experience, but leveraging the power of a purpose-built document database and a purpose-built object store. In this section, I walk through how to use these two services to cover the additional use case of needing to find files based on the metadata.

First, we import a few packages:

>>> import json
>>> import pymongo
>>> import boto3

We use the credentials that we created when we launched the CloudFormation stack. These credentials were stored in Secrets Manager. The name of the secret is the name of the stack that you used to create the stack (for this post, docdb-mongo), with -DocDBSecret appended to docdb-mongo-DocDBSecret. We assign this to a variable. You should use the appropriate Secrets Manager secret name for your stack:

>>> secret_name = 'docdb-mongo-DocDBSecret'

Next, we create a Secrets Manager client and retrieve the secret. Make sure to set the Region variable with the Region in which you deployed the stack:

>>> secret_client = session.client(service_name='secretsmanager', region_name=region)
>>> secret = json.loads(secret_client.get_secret_value(SecretId=secret_name)['SecretString'])

This secret contains the four pieces of information that we need to connect to the Amazon DocumentDB cluster:

Cluster endpoint
Port
Username
Password

Next we connect to the Amazon DocumentDB cluster:

>>> docdb_client = pymongo.MongoClient(host=secret["host"], port=secret["port"], ssl=True, ssl_ca_certs="/home/ec2-user/rds-combined-ca-bundle.pem", replicaSet='rs0', connect = True)
>>> docdb_client["admin"].authenticate(name=secret["username"], password=secret["password"])
True

We use the database fs and the collection files to store our file metadata:

>>> docdb_db = docdb_client["fs"]
>>> docdb_coll = docdb_db["files"]

Because we already have data in the S3 bucket, we create entries in the Amazon DocumentDB collection for those files. The information we store is analogous to the information in the GridFS fs.files collection, namely the following:

bucket – The S3 bucket
filename – The S3 key
version – The S3 VersionId
length – The file length in bytes
uploadDate – The S3 LastModified date

Additionally, any metadata that was stored with the objects in Amazon S3 is also added to the document in Amazon DocumentDB:

>>> for ver in s3_client.list_object_versions(Bucket=bucket)["Versions"]:
...   obj = s3_client.get_object(Bucket=bucket, Key=ver["Key"], VersionId=ver["VersionId"])
...   to_insert = {"bucket": bucket, "filename": ver["Key"], "version": ver["VersionId"], "length": obj["ContentLength"], "uploadDate": obj["LastModified"]}
...   to_insert.update(obj["Metadata"])
...   docdb_coll.insert_one(to_insert)
...
<pymongo.results.InsertOneResult object at 0x7f452ce88cd0>
<pymongo.results.InsertOneResult object at 0x7f452ce8bf00>
<pymongo.results.InsertOneResult object at 0x7f452ce84eb0>
<pymongo.results.InsertOneResult object at 0x7f452ce840f0>

Now we can find files by their metadata:

>>> sk0 = docdb_coll.find({"somekey": "somevalue"}).next()
>>> print(sk0["somekey"])
somevalue

To read the file itself, we can use the bucket, file name, and version to retrieve the object from Amazon S3:

>>> print(s3_client.get_object(Bucket=sk0["bucket"], Key=sk0["filename"], VersionId=sk0["version"])["Body"].read())
b'Bye Bye World!\n'

Now we can put another file with additional metadata. To do this, we write the file to Amazon S3 and insert the metadata into Amazon DocumentDB:

>>> h = s3_client.put_object(Body=open("/home/ec2-user/solong.txt", "rb").read(), Bucket=bucket, Key="solong.txt")
>>> docdb_coll.insert_one({"bucket": bucket, "filename": "solong.txt", "version": h["VersionId"], "somekey": "someothervalue", "key2": "value2"})
<pymongo.results.InsertOneResult object at 0x7f452dcd01e0>

Finally, we can search for files with somekey defined, as we did with GridFS, and see that two files match:

>>> print(docdb_coll.find({"somekey": {"$exists": True}}).count())
2

Clean up

You can delete the resources created in this post by deleting the stack via the AWS CloudFormation console or the AWS CLI.

Conclusion

Storing large objects inside a database is typically not the best architectural choice. Instead, coupling a distributed object store, such as Amazon S3, with the database provides a more architecturally sound solution. Storing the metadata in the database and a reference to the location of the object in the object store allows for efficient query and retrieval operations, while reducing the strain on the database for serving object storage operations.

In this post, I demonstrated how to use Amazon S3 and Amazon DocumentDB in place of MongoDB’s GridFS. I leveraged Amazon S3’s purpose-built object store and Amazon DocumentDB, a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads.

For more information about recent launches and blog posts, see Amazon DocumentDB (with MongoDB compatibility) resources.

About the author

Brian Hess is a Senior Solution Architect Specialist for Amazon DocumentDB (with MongoDB compatibility) at AWS. He has been in the data and analytics space for over 20 years and has extensive experience with relational and NoSQL databases.