Building AWS Glue Spark ETL jobs using Amazon DocumentDB (with MongoDB compatibility) and MongoDB

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. AWS Glue has native connectors to connect to supported data sources on AWS or elsewhere using JDBC drivers. Additionally, AWS Glue now supports reading and writing to Amazon DocumentDB (with MongoDB compatibility) and MongoDB collections using AWS Glue Spark ETL jobs. This feature enables you to connect and read, transform, and load (write) data from and to Amazon DocumentDB and MongoDB collections into services such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift for downstream analytics. For more information, see Connection Types and Options for ETL in AWS Glue.

This post shows how to build AWS Glue ETL Spark jobs and set up connections with Amazon DocumentDB or MongoDB to read and load data using ConnectionType. The following diagram illustrates the three components of the solution architecture:

Amazon DocumentDB
AWS Glue
MongoDB on Amazon Elastic Compute Cloud (Amazon EC2)

The following diagram illustrates the three components of the solution architecture:

Prerequisites

Before getting started, you must complete the following prerequisites:

Create an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console. Your IAM permissions must also include access to create IAM roles and policies created by the AWS CloudFormation template provided in this post.
Create an IAM policy for AWS Glue.

Save the following code as DocumentDB-Glue-ETL.py in your S3 bucket.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

output_path = "s3://<bucket>/<folder>/" + str(time.time()) + "/"
documentdb_uri = "mongodb://<host name>:27017"
documentdb_write_uri = "mongodb://<host name>:27017"

read_docdb_options = {
    "uri": documentdb_uri,
    "database": "test",
    "collection": "profiles",
    "username": "<username>",
    "password": "<password>",
    "ssl": "true",
    "ssl.domain_match": "false",
    "partitioner": "MongoSamplePartitioner",
    "partitionerOptions.partitionSizeMB": "10",
    "partitionerOptions.partitionKey": "_id"
}

write_documentdb_options = {
    "uri": documentdb_write_uri,
    "database": "test",
    "collection": "collection1",
    "username": "<username>",
    "password": "<password>",
    "ssl": "true",
    "ssl.domain_match": "false",
    "partitioner": "MongoSamplePartitioner",
    "partitionerOptions.partitionSizeMB": "10",
    "partitionerOptions.partitionKey": "_id"
}

# Get DynamicFrame from DocumentDB
dynamic_frame2 = glueContext.create_dynamic_frame.from_options(connection_type="documentdb",
                                                               connection_options=read_docdb_options)

# Write DynamicFrame to DocumentDB
glueContext.write_dynamic_frame.from_options(dynamic_frame2, connection_type="documentdb",
                                             connection_options=write_documentdb_options)

job.commit()

Save the following code as MongoDB-Glue-ETL.py in your S3 bucket.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

output_path = "s3://<bucket>/<folder>/" + str(time.time()) + "/"
mongo_uri = "mongodb://<host name or IP>:27017"
write_uri = "mongodb://<host name or IP>:27017"

read_mongo_options = {
    "uri": mongo_uri,
    "database": "test",
    "collection": "profiles",
    "username": "<username>",
    "password": "<password>",
    "partitioner": "MongoSamplePartitioner",
    "partitionerOptions.partitionSizeMB": "10",
    "partitionerOptions.partitionKey": "_id"}

write_mongo_options = {
    "uri": write_uri,
    "database": "test",
    "collection": "collection1",
    "username": "<username>",
    "password": "<password>"
}


# Get DynamicFrame from MongoDB
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",
                                                              connection_options=read_mongo_options)
# Write DynamicFrame to MongoDB 
glueContext.write_dynamic_frame.from_options(dynamic_frame, connection_type="mongodb", connection_options=write_mongo_options)

job.commit()

Provisioning resources with AWS CloudFormation

For this post, we provide CloudFormation templates for you to review and customize to your needs. Some of the resources deployed by this stack incur costs as long as they remain in use, such as Amazon DocumentDB and Amazon EC2.

For instructions on launching your stacks, see Launching an Amazon DocumentDB AWS CloudFormation Stack and MongoDB on the AWS Cloud: Quick Start Reference Deployment.

The Amazon DocumentDB stack creation can take up to 15 minutes, and MongoDB stack creation can take up 60 minutes.

When stack creation is complete, go to the Outputs tab for the stack on the AWS CloudFormation console and note down the following values (you use these in later steps):

DocumentDB CloudFormation – ClusterEndpoint and ClusterPort
MongoDB CloudFormation – PrimaryReplicaNodeIp

Preparing your collection

When the CloudFormation stack is complete, use an EC2 instance to connect to your Amazon DocumentDB cluster. For instructions, see Install the mongo shell, Connect to your Amazon DocumentDB cluster, and Insert and query data.

For instructions on accessing Amazon DocumentDB from Amazon EC2 in the same VPC, see Connect Using Amazon EC2.

For more information about MongoDB, see Connect to MongoDB nodes and Testing MongoDB.

Before creating your AWS Glue ETL job, use the mongo shell to insert a few entries into a collection titled profiles. See the following code:

s0:PRIMARY> use test
s0:PRIMARY> db.profiles.insertMany([
            { "_id" : 1, "name" : "Matt", "status": "active", "level": 12, "score":202},
            { "_id" : 2, "name" : "Frank", "status": "inactive", "level": 2, "score":9},
            { "_id" : 3, "name" : "Karen", "status": "active", "level": 7, "score":87},
            { "_id" : 4, "name" : "Katie", "status": "active", "level": 3, "score":27}
            ])

You’re now ready to configure AWS Glue ETL jobs using Amazon DocumentDB and MongoDB ConnectionType.

Setting up AWS Glue connections

You set up two separate connections for Amazon DocumentDB and MongoDB when the databases are in two different VPCs (or if you deployed the databases using the provided CloudFormation template). Complete the following steps for both connections. We first walk you through the Amazon DocumentDB connection.

On the AWS Glue console, under Databases, choose Connections.
Choose Add connection.
For Connection name, enter a name for your connection.
If you have SSL enabled on your Amazon DocumentDB cluster (which is what the CloudFormation template in this post used), select Require SSL connection.
For Connection Type, choose Amazon DocumentDB or MongoDB.
Choose Next.

Choose Next.

For Amazon DocumentDB URL, enter a URL using the output from the CloudFormation stack, such as mongodb://host:port/databasename (use the default port, 27017).
For Username and Password, enter the credentials you entered as parameters when creating the CloudFormation stack.
For VPC, choose the VPC in which you created databases (Amazon DocumentDB and MongoDB).
For Subnet, choose the subnet within your VPC.
For Security groups, select your security group.
Choose Next.

Choose Next.

Review the connection details and choose Finish.

Review the connection details and choose Finish.

Similarly, add the connection for MongoDB with the following changes to the steps:

If you used the CloudFormation template in this post, don’t select Require SSL connection for MongoDB
For Connection Type, choose MongoDB
For MongoDB URL, enter a URL using the output from the CloudFormation stack, such as mongodb://host:port/databasename (use the default port, 27017)

Creating an AWS Glue endpoint, S3 endpoint, and security group

Before testing the connections, make sure you create an AWS Glue endpoint and S3 endpoint in the VPC in which the databases are created. Complete the following steps for both Amazon DocumentDB and MongoDB instances separately:

To create your AWS Glue endpoint, on the Amazon VPC console, choose Endpoints.
Choose Create endpoint.
For Service Name, choose AWS Glue.
Search for and select com.amazonaws.<region>.glue (for example, com.amazonaws.us-west-2.glue). Enter the appropriate Region where the database instance was created.
For VPC, choose the VPC of the Amazon DocumentDB

For VPC, choose the VPC of the Amazon DocumentDB

For Security group, select the security groups of the Amazon DocumentDB cluster.
Choose Create endpoint.

Choose Create endpoint.

To create your S3 endpoint, on the Amazon VPC console, choose Endpoints.
Choose Create endpoint.
For Service Name, choose Amazon S3.
Search for and select com.amazonaws.<region>.s3 (for example, com.amazonaws.us-west-2.s3). Enter the appropriate Region.
For VPC, choose the VPC of the Amazon DocumentDB
For Configure route tables, select the route table ID of the associated subnet of the database.

13. For Configure route tables, select the route table ID of the associated subnet of the database.

Choose Create endpoint.

Choose Create endpoint.

Similarly, add an AWS Glue endpoint and S3 endpoint for MongoDB with the following changes:

Choose the VPC of the Amazon MongoDB instance

The Amazon security group must include itself as a source in its inbound rules. Complete the following steps for both Amazon DocumentDB and MongoDB instances separately:

On the Security Groups page, choose Edit Inbound Rules.
Choose Add rule.
For Type, choose All traffic.
For Source, choose the same security group.
Choose Save rules.

Choose Save rules.

The objective of setting up a connection is to establish private connections between the Amazon DocumentDB and MongoDB instances in the VPC and AWS Glue via the S3 endpoint, AWS Glue endpoint, and security group. It’s not required to test the connection because that connection is established by the AWS Glue job when you run it. At the time of writing, testing an AWS Glue connection is not supported for Amazon DocumentDB connections.

Code for building the AWS Glue ETL job

The following sample code sets up a read connection with Amazon DocumentDB for your AWS Glue ETL job (PySpark):

read_docdb_options = {
    "uri": documentdb_uri,
    "database": "test",
    "collection": "profiles",
    "username": "<username>",
    "password": "<password>",
    "ssl": "true",
    "ssl.domain_match": "false",
    "partitioner": "MongoSamplePartitioner",
    "partitionerOptions.partitionSizeMB": "10",
    "partitionerOptions.partitionKey": "_id"
}

The following sample code sets up a write connection with Amazon DocumentDB for your AWS Glue ETL job (PySpark):

write_documentdb_options = {
    "uri": documentdb_write_uri,
    "database": "test",
    "collection": "collection1",
    "username": "<username>",
    "password": "<password>",
    "ssl": "true",
    "ssl.domain_match": "false",
    "partitioner": "MongoSamplePartitioner",
    "partitionerOptions.partitionSizeMB": "10",
    "partitionerOptions.partitionKey": "_id"
}

The following sample code creates an AWS Glue DynamicFrame by using the read and write connections for your AWS Glue ETL job (PySpark):

# Get DynamicFrame from DocumentDB
dynamic_frame2 = glueContext.create_dynamic_frame.from_options(connection_type="documentdb",
                                                               connection_options=read_docdb_options)

# Write DynamicFrame to DocumentDB
glueContext.write_dynamic_frame.from_options(dynamic_frame2, connection_type="documentdb",
                                             connection_options=write_documentdb_options)

Setting up AWS Glue ETL jobs

You’re now ready to set up your ETL job in AWS Glue. Complete the following steps for both Amazon DocumentDB and MongoDB instances separately:

On the AWS Glue console, under ETL, choose Jobs.
Choose Add job.
For Job Name, enter a name.
For IAM role, choose the IAM role you created as a prerequisite.
For Type, choose Spark.
For Glue Version, choose Python (latest version).
For This job runs, choose An existing script that you provide.
Choose the Amazon S3 path where the script (DocumentDB-Glue-ETL.py) is stored.
Under Advanced properties, enable Job bookmark.

Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data.

Keep the remaining settings at their defaults and choose Next.
For Connections, choose the Amazon DocumentDB connection you created.
Choose Save job and edit scripts.
Edit the following parameters:
1. documentdb_uri or mongo_uri
2. documentdb_write_uri or write_uri
3. user
4. password
5. output_path
Choose Run job.

When the job is finished, validate the data loaded in the collection.

Similarly, add the job for MongoDB with the following changes:

Choose the Amazon S3 path where the script (MongoDB-Glue-ETL.py) is stored
For Connections, choose the Amazon MongoDB connection you created
Change the parameters applicable to MongoDB (mongo_uri and write_uri)

Cleaning up

After you finish, don’t forget to delete the CloudFormation stack, because some of the AWS resources deployed by the stack in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack.

On the AWS CloudFormation console, on the Stacks page, select the stack to delete. The stack must be currently running.
On the stack details page, choose Delete.
Choose Delete stack when prompted.

Additionally, delete the AWS Glue endpoint, S3 endpoint, AWS Glue connections, and AWS Glue ETL jobs.

Summary

In this post, we showed you how to build AWS Glue ETL Spark jobs and set up connections using ConnectionType for Amazon DocumentDB and MongoDB databases using AWS CloudFormation. You can use this solution to read data from Amazon DocumentDB or MongoDB, and transform it and write to Amazon DocumentDB or MongoDB or other targets like Amazon S3 (using Amazon Athena to query), Amazon Redshift, Amazon DynamoDB, Amazon OpenSearch Service, and more.

If you have any questions or suggestions, please leave a comment.

About the Authors

Naresh Gautam is a Sr. Analytics Specialist Solutions Architect at AWS. His role is helping customers architect highly available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys meditation and cooking.

Srikanth Sopirala is a Sr. Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family and road biking.