Use AWS Glue to run ETL jobs against non-native JDBC data sources

March 2023: This post was reviewed and updated for accuracy.

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easier to prepare and load your data for analytics. You can create and run an ETL job with a few clicks on the AWS Management Console. Just point AWS Glue to your data store. AWS Glue discovers your data and stores the associated metadata (for example, a table definition and schema) in the AWS Glue Data Catalog.

AWS Glue has native connectors to data sources using JDBC drivers, either on AWS or elsewhere, as long as there is IP connectivity. In this post, we demonstrate how to connect to data sources that are not natively supported in AWS Glue today. We walk through connecting to and running ETL jobs against two such data sources, IBM DB2 and SAP Sybase. However, you can use the same process with any other JDBC-accessible database.

AWS Glue data sources

AWS Glue can connect to the following data stores through a JDBC connection:

For more information, see Adding a Connection to Your Data Store in the AWS Glue Developer Guide.

One of the fastest growing architectures deployed on AWS is the data lake. The ETL processes that are used to ingest, clean, transform, and structure data are critically important for this architecture. Having the flexibility to interoperate with a broader range of database engines allows for a quicker adoption of the data lake architecture.

For data sources that AWS Glue doesn’t natively support, such as IBM DB2, Pivotal Greenplum, SAP Sybase, or any other relational database management system (RDBMS), you can import custom database connectors from Amazon S3 into AWS Glue jobs. In this case, the connection to the data source must be made from the AWS Glue script to extract the data, rather than using AWS Glue connections. To learn more, see Providing Your Own Custom Scripts in the AWS Glue Developer Guide.

Setting up an ETL job for an IBM DB2 data source

The first example demonstrates how to connect the AWS Glue ETL job to an IBM DB2 instance, transform the data from the source, and store it in Apache Parquet format in Amazon S3. To successfully create the ETL job using an external JDBC driver, you must define the following:

The S3 location of the job script
The S3 location of the temporary directory
The S3 location of the JDBC driver
The S3 location of the Parquet data (output)
The IAM role for the job

By default, AWS Glue suggests bucket names for the scripts and the temporary directory using the following format:

s3://aws-glue-scripts-<ACCOUNT_ID>-<REGION>/<USER>
s3://aws-glue-temporary-<ACCOUNT_ID>-<REGION>/<USER>

For the output Parquet data, you can create a similar location:

s3://aws-glue-data-output-<ACCOUNT_ID>-<REGION>/<USER>

You can create another S3 location for the JDBC driver:

s3://aws-glue-jdbc-drivers-<ACCOUNT_ID>-<REGION>/<USER>

Now it’s time to upload the JDBC driver to the defined location in Amazon S3. For this example, we will use the DB2 driver, which is available on the IBM Support site. Download the driver tar.gz file, and extract the db2jcc4.jar file into the S3 folder you just created.

Creating the IAM role

The next step is to set up the IAM role that the ETL job will use:

Create a policy to allow to access database credentials that are stored in AWS Secrets Manager. For that, click on Policies on the left side pane and then Create policy.

Click on the JSON tab and paste the following JSON snippet.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecretValue",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret"
            ],
            "Resource": [
                "arn:aws:secretsmanager:<REGION>:<ACCOUNT_ID>:secret:DB2_Database_Connection_Info*",
                "arn:aws:secretsmanager:<REGION>:<ACCOUNT_ID>:secret:Sybase_Database_Connection_Info*"
            ]
        }
    ]
}

Replace the placeholders for <REGION> and <ACCOUNT_ID>, for example:
Click Review policy.
Give your policy a name, for example, GlueAccessSecretValue.
On the IAM console, choose Roles in the left navigation pane.
Choose Create role. The role type of trusted entity must be an AWS Service, specifically AWS Glue.

Choose Next: Permissions.
Search for the AWSGlueServiceRole policy, and select it.

Search again, now for the GlueAccessSecreateValue policy created before.

Select this policy, and choose Next: Tags and then, Next: Review.
Give your role a name, for example, GluePermissions, and confirm that both policies were selected.
Choose Create role.

Now that you have created the IAM role, it’s time to upload the JDBC driver to the defined location in Amazon S3. For this example, we will use the Db2 driver, which is available on the IBM Support site.

Storing database credentials

It is a best practice to store database credentials in a safe store. In this case, we use AWS Secrets Manager to securely store credentials. Follow these steps to create those credentials:

Open the console, and search for Secrets Manager.
In the AWS Secrets Manager console, choose Store a new secret.
Under Select a secret type, choose Other type of secrets.
In the Secret key/value, set one row for each of the following parameters:
- db_username
- db_password
- db_url (for example, jdbc:db2://10.10.12.12:50000/SAMPLE)
- db_table
- driver_name (com.ibm.db2.jcc.DB2Driver)
- output_bucket: (for example, aws-glue-data-output-1234567890-us-east-2/User)
Choose Next.
For Secret name, use DB2_Database_Connection_Info.
Choose Next.
Keep the Disable automatic rotation check box selected.
Choose Next.
Choose Store.

Adding a job in AWS Glue

The next step is to author the AWS Glue job, following these steps:

In the AWS Management Console, search for AWS Glue.
In the navigation pane on the left, choose Jobs under the Data Integration and ETL
In the Create Job selection window, choose Spark script editor. Click on the Create button.

In the editor, replace the existing code with the following script.

import sys
import boto3
import json
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job


## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Getting DB credentials from Secrets Manager
client = boto3.client("secretsmanager", region_name="us-east-2")

get_secret_value_response = client.get_secret_value(
        SecretId="DB2_Database_Connection_Info"
)

secret = get_secret_value_response['SecretString']
secret = json.loads(secret)

db_username = secret.get('db_username')
db_password = secret.get('db_password')
db_url = secret.get('db_url')
table_name = secret.get('db_table')
jdbc_driver_name = secret.get('driver_name')
s3_output = "s3://" + secret.get('output_bucket') + "/" + table_name

# Connecting to the source
df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).load()

df.printSchema()
print(df.count())

datasource0 = DynamicFrame.fromDF(df, glueContext, "datasource0")

# Defining mapping for the transformation
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("EMPNO", "string", "EMPNO", "string"), ("FIRSTNME", "string", "FIRSTNME", "string"), ("MIDINIT", "string", "MIDINIT", "string"), ("LASTNAME", "string", "LASTNAME", "string"), ("WORKDEPT", "string", "WORKDEPT", "string"), ("PHONENO", "string", "PHONENO", "string"), ("HIREDATE", "date", "HIREDATE", "date"), ("JOB", "string", "JOB", "string"), ("EDLEVEL", "integer", "EDLEVEL", "integer"), ("SEX", "string", "SEX", "string"), ("BIRTHDATE", "date", "BIRTHDATE", "date"), ("SALARY", "double", "SALARY", "double"), ("BONUS", "double", "BONUS", "double"), ("COMM", "double", "COMM", "double")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

# Writing to destination
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": s3_output}, format = "parquet", transformation_ctx = "datasink4")

job.commit()

Select the Job Details tab. Give the job a name (for example, db2-job). Choose the IAM role that you created previously (GluePermissions).
Open up the Advanced Properties list. In the Libraries section, enter the location of your JDBC driver for Dependent JARs path.
Click on Save.

Running the ETL job

Now that you have created the job, you can immediately run the job by clicking on the Run button on the job page. You can also execute it as follows:

On the Jobs page, in the Your jobs section, select your new job. Click the Run job button.
Navigate to the Job page and choose Run tab. Wait until the Run status shows Succeeded.
Now if you go to your S3 output bucket, you can see the newly created parquet file:
You can then run a Glue crawler to pull this parquet file into Glue data catalog, and view the data from Athena.

Setting up an ETL job for an SAP Sybase data source

In this section, we describe how to create an AWS Glue ETL job against an SAP Sybase data source. The process mentioned in the previous section works for a Sybase data source with a few changes required in the job:

While creating the job, choose the correct jar for the JDBC dependency.

In the script, change the reference to the secret to be used from AWS Secrets Manager:

get_secret_value_response = client.get_secret_value(
        SecretId="Sybase_Database_Connection_Info"
)

And change the mapping of the fields in line 47, as follows:

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("au_id", "string", "au_id", "string"), ("au_lname", "string", "au_lname", "string"), ("au_fname", "string", "au_fname", "string"), ("phone", "string", "phone", "string"), ("address", "string", "address", "string"), ("city", "string", "city", "string"), ("state", "string", "state", "string"), ("country", "string", "country", "string"), ("postalcode", "string", "postalcode", "string")], transformation_ctx = "applymapping1")

After you successfully execute the new ETL job, the output contains the same type of information that was generated with the DB2 data source.

Note that each of these JDBC drivers has its own nuances and different licensing terms that you should be aware of before using them.

Maximizing JDBC read parallelism

Something to keep in mind while working with big data sources is the memory consumption. In some cases, “Out of Memory” errors are generated when all the data is read into a single executor. One approach to optimize this is to rely on the parallelism on read that you can implement with Apache Spark and AWS Glue. To learn more, see the Apache Spark SQL module.

You can use the following options:

partitionColumn: The name of an integer column that is used for partitioning.
lowerBound: The minimum value of partitionColumn that is used to decide partition stride.
upperBound: The maximum value of partitionColumn that is used to decide partition stride.
numPartitions: The number of partitions. This, along with lowerBound (inclusive) and upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the partitionColumn When unset, this defaults to SparkContext.defaultParallelism.
Those options specify the parallelism of the table read. lowerBound and upperBound decide the partition stride, but they don’t filter the rows in the table. Therefore, Spark partitions and returns all rows in the table. For example:

df = glueContext.read.format("jdbc").option("driver", jdbc_driver_name).option("url", db_url).option("dbtable", table_name).option("user", db_username).option("password", db_password).option("partitionColumn", <column_name>).option("lowerBound", 1).option("upperBound", 100).option("numPartitions", 2).load()

It’s important to be careful with the number of partitions because too many partitions could also result in Spark crashing your external database systems.

Conclusion

Using the process described in this post, you can connect to and run AWS Glue ETL jobs against any data source that can be reached using a JDBC driver. This includes new generations of common analytical databases like Greenplum and others.

You can improve the query efficiency of these datasets by using partitioning and pushdown predicates. For more information, see Managing Partitions for ETL Output in AWS Glue. This technique opens the door to moving data and feeding data lakes in hybrid environments.

Additional Reading

If you found this post useful, be sure to check out Work with partitioned data in AWS Glue.

About the Authors

Kapil Shardha is a Technical Account Manager and supports enterprise customers with their AWS adoption. He has background in infrastructure automation and DevOps.

William Torrealba is an AWS Solutions Architect supporting customers with their AWS adoption. He has background in Application Development, High Available Distributed Systems, Automation, and DevOps.

Jun Shan is a Senior Solutions Architect at AWS. He works with customers to leverage the power of AWS to release their potentials to the maximum degree, enabling them to be more efficient, secure, reliable, and optimized for cost. Jun also teaches SQL and Relational Database in several New York City colleges and is the author of “SQL for Data Analytics, 3rd Edition”.