AWS Big Data Blog

Synchronize your AWS Glue Studio Visual Jobs to different environments 

June 2023: This post was reviewed and updated for accuracy.

AWS Glue has become a popular option for integrating data from disparate data sources due to its ability to integrate large volumes of data using distributed data processing frameworks. Many customers use AWS Glue to build data lakes and data warehouses. Data engineers who prefer to develop data processing pipelines visually using AWS Glue Studio to create data integration jobs. This post introduces Glue Visual Job API to author the Glue Studio Visual Jobs programmatically, and Glue Job Sync utility that uses the API to easily synchronize Glue jobs to different environments without losing the visual representation.

Glue Job Visual API

AWS Glue Studio has a graphical interface called Visual Editor that makes it easy to author extract, transform, and load (ETL) jobs in AWS Glue. The Glue jobs created in the Visual Editor contain its visual representation that composes data transformation. In this post, we call the jobs Glue Studio Visual Jobs.

For example, it’s common to develop and test AWS Glue jobs in a dev account, and then promote the jobs to a prod account. Previously, when you copied the AWS Glue Studio Visual jobs to a different environment, there was no mechanism to copy the visual representation together. This means that the visual representation of the job was lost and you could only copy the code produced with Glue Studio. It can be time consuming and tedious to either copy the code or recreate the job.

AWS Glue Job Visual API lets you programmatically create and update Glue Studio Visual Jobs by providing a JSON object that indicates visual representation, and also retrieve the visual representation from existing Glue Studio Visual Jobs. A Glue Studio Visual Job consists of data source nodes for reading the data, transform nodes for modifying the data, and data target nodes for writing the data.

There are some typical use cases for Glue Visual Job API:

  • Automate creation of Glue Visual Jobs.
  • Migrate your ETL jobs from third-party or on-premises ETL tools to AWS Glue. Many AWS partners, such as Bitwise, Bladebridge, and others have built convertors from the third-party ETL tools to AWS Glue.
  • Synchronize AWS Glue Studio Visual jobs from one environment to another without losing visual representation.

In this post, we focus on a utility that uses Glue Job Visual APIs to achieve the mass synchronization of your Glue Studio Visual Jobs without losing the visual representation.

Glue Job Sync Utility

There are common requirements to synchronize the Glue Visual Jobs between different environments.

  • Promote Glue Visual Jobs from a dev account to a prod account.
  • Transfer ownership of Glue Visual Jobs between different AWS accounts.
  • Replicate Glue Visual Job configurations from one region to another for disaster recovery purpose.

Glue Job Sync Utility is built on top of Glue Visual Job API, and the utility lets you synchronize the jobs to different accounts without losing the visual representation. The Glue Job Sync Utility is a python application that enables you to synchronize your AWS Glue Studio Visual jobs to different environments using the new Glue Job Visual API. This utility requires that you provide source and target AWS environment profiles. Optionally, you can provide a list of jobs that you want to synchronize, and specify how the utility should replace your environment-specific objects using a mapping file. For example, Amazon Simple Storage Service (Amazon S3) locations in your development environment and role can be different than your production environment. The mapping config file will be used to replace the environment specific objects.

How to use Glue Job Sync Utility

In this example, we’re synchronizing two AWS Glue Studio Visual jobs, test1 and test2, from the development environment to the production environment in a different account.

  • Source environment (dev environment)
    • AWS Account ID: 123456789012
    • AWS Region: eu-west-3 (Paris)
    • AWS Glue Studio Visual jobs: test1, test2
    • AWS Identity and Access Management (IAM) Role ARN for Glue job execution role: arn:aws:iam::123456789012:role/GlueServiceRole
    • Amazon S3 bucket for Glue job script and other asset location: s3://aws-glue-assets-123456789012-eu-west-3/
    • Amazon S3 bucket for data location: s3://dev-environment/
  • Destination environment (prod environment)
    • AWS Account ID: 234567890123
    • AWS Region: eu-west-3 (Paris)
    • IAM Role ARN for Glue job execution role: arn:aws:iam::234567890123:role/GlueServiceRole
    • Amazon S3 bucket for Glue job script and other asset location: s3://aws-glue-assets-234567890123-eu-west-3/
    • Amazon S3 bucket for data location: s3://prod-environment/

Set up the utility in your local environment

You will need the following prerequisites for this utility:

  • Python 3.6 or later.
  • boto3 1.24.39 or later.
  • Create two AWS named profiles, dev and prod, with the corresponding credentials in your environment. Follow this instruction.

Download the Glue Job Sync Utility

Download the sync utility from the GitHub repository to your local machine.

Create AWS Glue Studio Visual Jobs

  1. Create two AWS Glue Studio Visual jobs, test1, and test2, in the source account.
    • If you don’t have any AWS Glue Studio Visual jobs, then follow this instruction to create the Glue Studio Visual jobs.

  2. Open AWS Glue Studio in the destination account and verify that the test1 and test2 jobs aren’t present.

Run the Job Sync Utility

  1. Create a new file named mapping.json, and enter the following JSON code. With the configuration in line 1, the sync utility will replace all of the Amazon S3 references within the job (in this case s3://aws-glue-assets-123456789012-eu-west-3) to the mapped location (in this case s3://aws-glue-assets-234567890123-eu-west-3). Then, the utility will create the job to the destination environment. Along these lines, line 2 and line 3 will trigger appropriate substitutions in the job. Note that these are example values and you’ll need to substitute the right values that match your environment.

    {
        "s3://aws-glue-assets-123456789012-eu-west-3": "s3://aws-glue-assets-234567890123-eu-west-3",
        "arn:aws:iam::123456789012:role/GlueServiceRole": "arn:aws:iam::234567890123:role/GlueServiceRole",
        "s3://dev-environment": "s3://prod-environment"
    }
  2. Execute the utility by running the following command:
    $ python3 sync.py --src-profile dev --src-region eu-west-3 --dst-profile prod --dst-region eu-west-3 --src-job-names test1,test2 --config-path mapping.json
  3. Verify successful synchronization by opening AWS Glue Studio in the destination account:
  4. Open the Glue Studio Visual jobs, test1, and test2, and verify the visual representation of the DAG.

The screenshot above shows that you were able to copy the jobs test1 and test2 while keeping DAG into the destination account.

Conclusion

AWS Glue Job Visual API and the AWS Glue Sync Utility simplify how you synchronize your jobs to different environments. These are designed to easily integrate into your Continuous Integration pipelines while retaining the visual representation that improves the readability of the ETL pipeline.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for designing AWS features, implementing software artifacts, and helping customer architectures. In his spare time, he enjoys watching anime in Prime Video.

Aaron Meltzer is a Software Engineer on the AWS Glue Studio team. He leads the design and implementation of features to simplify the management of AWS Glue jobs. Outside of work, Aaron likes to read and learn new recipes.

Mohamed Kiswani is the Software Development Manager on the AWS Glue Team

Shiv Narayanan is a Senior Technical Product Manager on the AWS Glue team.