Automate ETL jobs between Amazon RDS for SQL Server and Azure Managed SQL using AWS Glue Studio

Nowadays many customers are following a multi-cloud strategy. They might choose to use various cloud-managed services, such as Amazon Relational Database Service (Amazon RDS) for SQL Server and Azure SQL Managed Instances, to perform data analytics tasks, but still use traditional extract, transform, and load (ETL) tools to integrate and process the data. However, traditional ETL tools may require you to develop custom scripts, which makes ETL automation difficult.

In this post, I show you how to automate ETL jobs between Amazon RDS for SQL Server and Azure SQL Managed Instances using AWS Glue Studio, which is part of AWS Glue, a fully managed serverless integration service. AWS Glue Studio has a graphical interface that makes it easy to create, run, and monitor ETL jobs, and you can create a schedule to run your jobs at specific times.

Solution overview

To move data from one database to another, there are different services available either on-premise or in the cloud, varied by bandwidth limits, ongoing changes (CDC), schema and table modifications, and other features. Beyond that, we need to apply advanced data transformations, monitor, and automate the ETL jobs. This is where AWS Glue Studio can help us facilitate these activities.

As shown in the following diagram, we use AWS Glue Studio as the middleware to pull data from the source database (in this case an Azure SQL Managed Instance), then create and automate the ETL job using one of the pre-built transformations in AWS Glue Studio. Finally, we load the data to the target database (in this case an RDS for SQL Server instance).

The solution workflow consists of the following steps:

Create connections for the source and target databases.
Create and run AWS Glue crawlers.
Create and run an ETL job that transforms the data and loads it from source to target.
Schedule the ETL job to run automatically.
Monitor the ETL job.

Prerequisites

Complete the following prerequisite steps:

Install SQL Server Management Studio (SSMS) or an equivalent client tool.
Set up a VPN connection between Amazon Virtual Private Cloud (Amazon VPC) and the Azure private subnet.
Create a security group for AWS Glue ENI in your VPC.
Create an AWS Identity and Access Management (IAM) role for AWS Glue. For instructions, refer to Setting up IAM permissions for AWS Glue.
Open the appropriate firewall ports in the Azure private subnet.
Create a source database table (Azure SQL Managed Instance). You can deploy the Azure database instance using the following QuickStart. For testing purposes, I import the public AdventureWorks sample database and use the dbo.Employee table. See the following code:
```
#Query table
SELECT * FROM [AdventureWorksLT2019].[dbo].[Employee]
```
Create the target database table (Amazon RDS for SQL Server). To deploy the RDS instance, refer to Create and Connect to a Microsoft SQL Server Database with Amazon RDS. You can create an empty database and table with the following statements. This is the table where the data coming from Azure will be stored.

#Create database
CREATE DATABASE AdventureWorksonRDS;
#Create table
CREATE TABLE Employee
(EmpID INT NOT NULL,
EmpName VARCHAR(50) NOT NULL,
Designation VARCHAR(50) NULL,
Department VARCHAR(50) NULL,
JoiningDate DATETIME NULL,
CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED (EmpID)

#Query table
SELECT * FROM [AdventureWorksonRDS].[dbo].[Employee]

Create connections

The first step is to populate our AWS Glue Data Catalog with the schema information coming from our source and target data sources.

To do that, we first create connections. A connection is a Data Catalog object that stores connection information for a particular data store. Connections store login credentials, URI strings, VPC information, and more. Creating connections in the Data Catalog saves the effort of having to specify the connection details every time you create a crawler or job.

Create a connection for Azure SQL Managed Instance

To create the connection to our source database, complete the following steps:

On the AWS Glue console, choose AWS Glue Studio.
In the navigation pane of the AWS Glue Studio console, choose Connectors.
Choose Create connection.
For Name, enter AzureSQLManaged.
For Connection type, choose JDBC.
For JDBC URL, use the SQL Server syntax jdbc:protocol://host:port;database=db_name.

You can find the host and database name on the Azure SQL Managed Instance service console, on the Overview page.For this specific example, we use the following information for our Azure SQL Instance:

- Protocol – sqlserver
- Host – adi-qa-sql-managed-instance-test.public.xxxxxxxxxxxx.database.windows.net
- Port – 3342
- Database name – AdventureWorksLT2019

Enter your user name and password.
Choose Create connection.

Create a connection for Amazon RDS for SQL Server

To create a connection for our target data source, complete the following steps:

On the AWS Glue Studio console, choose Connectors in the navigation pane.
Choose Create connection.
For Name, enter AWSRDSforSQL.
For Connection type, choose Amazon RDS.
For Database engine, choose Microsoft SQL Server.
For Database instances, choose your RDS DB instance.
For Database name, enter AdventureWorksonRDS.
Enter your user name and password.
Choose Create connection.

You can now see the two connections created in the Connections section.

Create and run AWS Glue crawlers

You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the most common method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, it updates the Data Catalog with the tables it found. The ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

Create a crawler for Azure SQL Managed Instance

To create a crawler for our source database, complete the following steps:

On the AWS Glue console, choose Crawlers in the navigation pane.
Choose Create crawler.
If the data hasn’t been mapped into an AWS Glue table, select Not yet and choose Add a data source.
For Data source¸ choose JDBC.
For Connection, choose AzureSQLManaged.
For Include path, specify the path of the database including the schema: AdventureWorksLT2019/dbo/%.
Choose Add a JDBC data source.
Choose Next.
Choose the IAM role created as part of the prerequisites and choose Next.
Choose Add database to create the target database in the AWS Glue Data Catalog.
For Name, enter azuresqlmanaged_db.
Choose Create database.
For Target database, choose azuresqlmanaged_db.
Choose Next.
Review if everything looks correct and choose Create crawler.

Create a crawler for Amazon RDS for SQL Server

Repeat the crawler creation steps to create the crawler for the target RDS for SQL Server database, using the following information:

Crawler name – AmazonRDSSQL_Crawler
Data source – JDBC
Connection – AWSRDSforSQL
Include path – AdventureWorksonRDS/dbo/%
IAM role – AWSGlueServiceRoleDefault
Database name – amazonrdssql_db

Run the crawlers

Now it’s time to run the crawlers.

On the AWS Glue console, choose Crawlers in the navigation pane.
Select the crawlers you created and choose Run.
When the crawler is complete, choose Databases in the navigation pane. Here you can find the databases discovered by the crawler.
Choose Tables in the navigation pane and explore the tables discovered by the crawler that correctly identified the data type as SQL Server.
Choose the table adventureworkslt2019_dbo_employee and review the schema created for the data source.

Create and run an ETL job

Now that we have crawled our source and target databases, and we have the data in the AWS Glue Data Catalog, we can create an ETL job to load and transform this data.

On the AWS Glue Studio console, choose Jobs in the navigation pane.
Select Visual with a blank canvas to use a visual interface to create our ETL jobs.
Choose Create.
On the Source menu, choose AWS Glue Data Catalog.
On the Data source properties tab, specify the database and table (for this post, azuresqlmanaged_db and adventureworkslt2019_dbo_employee).
On the Transform menu, choose Apply mapping to map the source fields to the target database.
On the Transform tab, you can see the data fields to be loaded, and you even can drop some of them if needed.
On the Target menu, choose AWS Glue Data Catalog.
On the Data target properties tab, choose the database and table where you want to load the transformed data (for this post, amazonrdssql_db and adventureworksrds_dbo_employee).
On the Job details tab, for Name, enter ETL_Azure_to_AWS.
For IAM Role, choose the appropriate role.
Choose Save.
Choose Run to run the job.

If the ETL job ran successfully, it should map the data from the source database (Azure SQL) to the target database (Amazon RDS for SQL). To confirm it, you can connect to the target database using SQL Server Management Studio (SSMS), and query the empty database/table AdventureWorksonRDS/dbo.Employee. It should have the data coming from the Azure SQL Managed Instance.

#Query table
SELECT * FROM [AdventureWorksonRDS].[dbo].[Employee]

Schedule your ETL job

In AWS Glue Studio, you can create a schedule to have your jobs run at specific times. This will reimport the full dataset and reference the use of bookmarks to do incremental loads. You can schedule your ETL jobs on an hourly, daily, weekly, monthly, or custom basis, depending on your needs. To schedule a job, complete the following steps:

On the AWS Glue Studio, navigate to the job you created.
On the Schedules tab, choose Create schedule.
For Name, enter a name (for example, dbo_employee_daily_load).
Choose your preferred frequency, start hour, and minute of the hour. For this post, we schedule it daily at 3:00 UTC.
For Description, enter an optional description.
Choose Create schedule.

Confirm on the Schedules tab that the schedule was created and activated successfully.

You have now automated your ETL job to run at your desired frequency.

Monitor your ETL job

The job monitoring dashboard provides an overall summary of the job runs, with totals for the jobs with a status of Running, Canceled, Success, or Failed.

The Runs tab shows the jobs for the specified date range and filters. You can filter the jobs on additional criteria, such as status, worker type, job type, and job name.

Conclusion

In this post, I went through the steps to automate ETL jobs using AWS Glue Studio, which is a user-friendly graphical interface to perform data integration tasks such as discovering and extracting data from various sources; enriching, cleaning, normalizing, and combining data; and loading and organizing data in databases, data warehouses, and data lakes. You can easily find and access this data using the AWS Glue Data Catalog. Data engineers and ETL developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio.

About the author

Daniel Maldonado is an AWS Solutions Architect, specializing in Microsoft workloads and big data technologies, and focused on helping customers migrate their applications and data to AWS. Daniel has over 13 years of experience working with information technologies and enjoys helping clients reap the benefits of running their workloads in the cloud.