Streamlining AWS Glue Studio visual jobs: Building an integrated CI/CD pipeline for seamless environment synchronization

Many Amazon Web Services (AWS) customers have integrated their data across multiple sources using AWS Glue, a serverless data integration service. By providing seamless integration throughout the development lifecycle, AWS Glue enables organizations to make data-driven business decisions.

AWS Glue Studio visual jobs provide a graphical interface called the visual editor that you can use to author extract, transform, and load (ETL) jobs in AWS Glue visually. The visual editor maintains a visual representation that a variety of data sources, transformations, and data sinks. With its intuitive interface, you can easily create large-scale data integration jobs without needing coding expertise, simplifying workflows and eliminating the need for manual ETL script programming.

As data engineers increasingly rely on the AWS Glue Studio visual editor to create data integration jobs, the need for a streamlined development lifecycle and seamless synchronization between environments has become paramount. Additionally, managing versions of visual directed acyclic graphs (DAGs) is crucial for tracking changes, collaboration, and maintaining consistency across environments.

This post introduces an end-to-end solution that addresses these needs by combining the power of the AWS Glue Visual Job API, a custom AWS Glue Resource Sync Utility, and an based continuous integration and continuous deployment (CI/CD) pipeline.

A few common questions from our customers include:

What are the best practices for moving our workloads from a pre-production environment to production?
What are the recommended best practices for provisioning data integration components?
How can I build AWS Glue visual jobs in the development environment and automatically propagate them to the production account using the CI/CD pipeline?
How can I version control and track changes to my AWS Glue Studio visual jobs?

End-to-end development lifecycle for data integration pipeline

The software development lifecycle on AWS has six phases: plan, design, implement, test, deploy, and maintain, as shown in the following diagram.

SDLC

For more information regarding each component, check out End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue.

AWS Glue Resource Sync Utility

As part of synchronizing AWS Glue visual jobs across different environments, requirements include:

Manage version control of visual DAGs by tracking changes to AWS Glue Studio visual jobs using version control systems such as Git
Promote AWS Glue visual jobs from a pre-production environment to a production environment
Transfer ownership of AWS Glue visual jobs between different AWS accounts
Replicate AWS Glue visual jobs from one AWS Region to another as part of a disaster recovery strategy

The AWS Glue Resource Sync Utility is a Python application developed on top of the AWS Glue Visual Job API, designed to synchronize AWS Glue Studio visual jobs across different accounts without losing the visual representation. It operates by using source and target AWS environment profiles. Optionally, a list of jobs for synchronization can be provided along with a mapping file to replace environment-specific resources.

For more information on the AWS Glue Resource Sync Utility, refer to Synchronize your AWS Glue Studio Visual Jobs to different environments.

Solution overview

As shown in the following diagram, this solution uses three separate AWS accounts. One account is designated for the development environment, another for the production environment, and a third to host the CI/CD infrastructure and pipeline.

Solution Overview

The solution emphasizes version controlling AWS Glue Studio visual jobs by serializing them into JSON files and storing them in a Git repository. As a result, you can:

Track changes to your visual DAGs over time.
Collaborate with team members.
Restore and deploy visual DAGs in different environments seamlessly.

The AWS account responsible for hosting the CI/CD pipeline is composed of three key components:

Managing AWS Glue Job updates – Provides smooth updates and maintenance of AWS Glue jobs.
Cross-Account Access Management – Enables secure promotion of updates from the development environment to the production environment.
Version Control Integration – Incorporates serialized visual DAGs into the CI/CD pipeline for deployment to target environments.

You can create AWS Glue Studio visual jobs using the intuitive visual editor in your development account. After these jobs are configured, they can serialize the visual DAGs into JSON files and commit them to a Git repository. The CI/CD pipeline detects changes to the repository and automatically triggers the deployment process.

The pipeline includes a step where the AWS Glue Resource Sync Utility deserializes the visual DAGs from the committed JSON files and deploys them to the production environment. This approach promotes consistent deployment of jobs while maintaining their visual representation.

The solution uses the AWS Glue Visual Job API, AWS Glue Resource Sync Utility, and AWS CDK to streamline deployment across environments. It enables seamless synchronization and consistent versioning of AWS Glue jobs between development and production, preserving visual workflows and reducing manual tasks. The solution consists of two main parts:

Initial steps (one-time setup) – Setting up the development environment, bootstrapping AWS environments, deploying the CI/CD pipeline, and integrating the AWS Glue Resource Sync Utility
Day-to-day development (repeated) – Ongoing activities such as creating visual jobs, serializing them, committing changes to the repository, deploying to production through the pipeline, and verifying the jobs

The solution follows these high-level steps for the initial setup:

Set up the development environment
Bootstrap your AWS environments
Deploy the CI/CD pipeline
Configure AWS developer tools connection on GitHub
Integrate the CI/CD pipeline with the AWS Glue Resource Sync Utility

The solution follows these high-level steps for the day-to-day development:

Create visual jobs in the development account
Serialize visual jobs
Commit changes to Git repository
Deploy visual jobs to production
Verify visual jobs in production

Prerequisites

Before you begin, make sure you have the following:

GitHub account
Git (git command)
Python 3.9 or later
Package installer for Python (pip command)
AWS CDK Toolkit (cdk command) 2.155.0 or later
AWS CLI configured with appropriate credentials for your accounts
Three AWS accounts:
- Development account
- Production account
- Pipeline account (for hosting the CI/CD pipeline)

Technical solution walkthrough

This section provides a detailed guide to setting up and using an automated CI/CD pipeline for AWS Glue Studio visual jobs.

Initial steps (one-time setup)

In this section, we walk through the foundational steps required to establish the CI/CD pipeline for AWS Glue Studio visual jobs. These initial steps set up the necessary infrastructure and configurations, providing a smooth and automated deployment process across your development and production environments.

Set up the development environment

To set up the development environment, follow these steps:

Fork the aws-glue-cdk-baseline repository
Clone the forked repository:

git clone https://github.com/<YOUR-GITHUB-USERNAME>/aws-glue-cdk-baseline.git

cd aws-glue-cdk-baseline

Create and activate a Python virtual environment:

python3 -m venv .venv

# On Windows, use .venv\\Scripts\\activate.bat
source .venv/bin/activate

Install required dependencies:

pip install -r requirements.txt

pip install -r requirements-dev.txt

To configure the default settings, edit the default-config.yaml file with your AWS account details and replace placeholders with your AWS account details:
Pipeline account: awsAccountId and awsRegion.
Development account: awsAccountId and awsRegion.
Production account: awsAccountId and awsRegion.

Bootstrap your AWS environments

Bootstrapping prepares your AWS accounts for AWS CDK deployments. To bootstrap your AWS environments, run the following commands, replacing placeholders with your account numbers, Regions, and AWS CLI profiles:

# Bootstrap the pipeline account
cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE>

# Bootstrap the development account, trusting the pipeline account
cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> --trust <PIPELINE-ACCOUNT-NUMBER>

# Bootstrap the production account, trusting the pipeline account
cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> --trust <PIPELINE-ACCOUNT-NUMBER>

Deploy the CI/CD pipeline

Deploy the pipeline stack to your pipeline account:

cdk deploy --profile <PIPELINE-PROFILE>

This command creates:

The pipeline stack in the pipeline account
The AWS Glue app stack in the development account

Configure AWS developer tools connection to GitHub

To establish a connection between AWS CodePipeline and your GitHub repository, follow these steps:

Create a GitHub connection
In the AWS Management Console for your pipeline account, navigate to AWS CodePipeline
In the navigation pane, choose Connections
Choose Create connection
Select GitHub as the source provider
Authorize the connection
Provide a connection name (such as MyGitHubConnection)
Choose Connect to GitHub
Follow the prompts to authorize AWS CodePipeline to access your GitHub account
Make sure that the connection has access to your forked aws-glue-cdk-baseline repository
Note the connection Amazon Resource Name (ARN)
After the connection is established, note the Connection ARN because you’ll need it when configuring the pipeline

Integrate the CI/CD pipeline with the AWS Glue Resource Sync Utility

To integrate the AWS Glue Resource Sync Utility into the pipeline to automate the synchronization of AWS Glue visual jobs, follow these steps:

Download the sync.py script from the AWS Glue Samples repository:

wget https://raw.githubusercontent.com/aws-samples/aws-glue-samples/master/utilities/resource_sync/sync.py \
-O aws_glue_cdk_baseline/job_scripts/sync.py

Create a new file aws_glue_cdk_baseline/job_scripts/generate_mapping.py with the following content:

import yaml
import json
 
def generate_mapping():
    with open('default-config.yaml', 'r') as config_file:
        config = yaml.safe_load(config_file)
    mapping = {
        f"s3://aws-glue-assets-{config['devAccount']['awsAccountId']}-{config['devAccount']['awsRegion']}": f"s3://aws-glue-assets-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}",
        f"arn:aws:iam::{config['devAccount']['awsAccountId']}:role/service-role/AWSGlueServiceRole": f"arn:aws:iam::{config['prodAccount']['awsAccountId']}:role/service-role/AWSGlueServiceRole",
        f"s3://dev-glue-data-{config['devAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}": f"s3://prod-glue-data-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}"
    }
    with open('mapping.json', 'w') as mapping_file:
        json.dump(mapping, mapping_file, indent=2)
 
if __name__ == "__main__":
    generate_mapping()

This script generates a mapping.json file that the sync.py script will use to synchronize the jobs between the development and production environments. The mapping.json file contains the mapping of the development environment assets to the production environment assets:

The s3://aws-glue-assets-* Amazon Simple Storage Service (Amazon S3) bucket contains the AWS Glue Studio visual job definitions
The arn:aws:iam::*:role/service-role/AWSGlueServiceRole AWS Identity and Access Management (IAM) role is used by the AWS Glue Studio jobs to access AWS resources
The s3://dev-glue-data-* and s3://prod-glue-data-* S3 buckets contain scripts and data used by the AWS Glue Studio jobs

Update the aws_glue_cdk_baseline/pipeline_stack.py file to include a step that deserializes the JSON file and deploys the AWS Glue jobs to the production environment:

from typing import Dict
import aws_cdk as cdk
from aws_cdk import (
    Stack,
    aws_iam as iam
)
from constructs import Construct
from aws_cdk.pipelines import CodePipeline, CodePipelineSource, CodeBuildStep
from aws_glue_cdk_baseline.glue_app_stage import GlueAppStage
 
GITHUB_REPO = "YOUR-GITHUB-USERNAME/aws-glue-cdk-baseline"
GITHUB_BRANCH = "main"
GITHUB_CONNECTION_ARN = "YOUR-GITHUB-CONNECTION-ARN"
 
class PipelineStack(Stack):
 
    def __init__(self, scope: Construct, construct_id: str, config: Dict, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)
 
        source = CodePipelineSource.connection(
            GITHUB_REPO,
            GITHUB_BRANCH,
            connection_arn=GITHUB_CONNECTION_ARN
        )
 
        pipeline = CodePipeline(self, "GluePipeline",
            pipeline_name="GluePipeline",
            cross_account_keys=True,
            docker_enabled_for_synth=True,
            synth=CodeBuildStep("CdkSynth",
                input=source,
                install_commands=[
                    "pip install -r requirements.txt",
                    "pip install -r requirements-dev.txt",
                    "npm install -g aws-cdk",
                ],
                commands=[
                    "cdk synth",
                ]
            )
        )
 
        # Add development stage
        dev_stage = GlueAppStage(self, "DevStage", config=config, stage="dev", 
            env=cdk.Environment(
                account=str(config['devAccount']['awsAccountId']),
                region=config['devAccount']['awsRegion']
            ))
        pipeline.add_stage(dev_stage)

        # Add production stage
        prod_stage = GlueAppStage(self, "ProdStage", config=config, stage="prod", 
            env=cdk.Environment(
                account=str(config['prodAccount']['awsAccountId']),
                region=config['prodAccount']['awsRegion']
            ))
        pipeline.add_stage(prod_stage)
 
        # Glue Resource Sync as a separate step in the pipeline
        pipeline.add_wave("GlueJobSync").add_post(CodeBuildStep("GlueJobSync",
            input=source,
            commands=[
                "python $(pwd)/aws_glue_cdk_baseline/job_scripts/generate_mapping.py",
                "python aws_glue_cdk_baseline/job_scripts/sync.py "
                   "--dst-role-arn arn:aws:iam::{0}:role/GlueCrossAccountRole-prod "
                   "--dst-region {1} "
                   "--deserialize-from-file aws_glue_cdk_baseline/resources/resources.json "
                   "--config-path mapping.json "
                   "--targets job,catalog "
                   "--skip-prompt".format(
                       config['prodAccount']['awsAccountId'],
                       config['prodAccount']['awsRegion']
                   ),
            ],
            role_policy_statements=[
                iam.PolicyStatement(
                    actions=[
                        "sts:AssumeRole",
                    ],
                    resources=["*"]
                )
            ]
        ))

Replace the placeholders in the pipeline_stack.py file with your values:

GITHUB_REPO with the name of your GitHub repository
GITHUB_BRANCH with the name of the branch you want to use for the pipeline
GITHUB_CONNECTION_ARN with the ARN of the GitHub connection you created in Step 4

Update the aws_glue_cdk_baseline/glue_app_stack.py file to create a cross-account role with the necessary permissions to access the development environment resources:

    self.cross_account_role = self.create_cross_account_role(
        f"GlueCrossAccountRole-{stage}",
        str(config['pipelineAccount']['awsAccountId'])
    )
 
    def create_cross_account_role(self, role_name: str, trusted_account_id: str):
        return iam.Role(self, f"{role_name}CrossAccountRole",
            role_name=role_name,
            assumed_by=iam.AccountPrincipal(trusted_account_id),
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AdministratorAccess")]
        )
 
    @property
    def cross_account_role_arn(self):
        return self.cross_account_role.role_arn

    @property
    def cross_account_role_arn(self):
        return self.glue_app_stack.cross_account_role_arn

Check the andreimaksimov/aws-glue-cdk-baseline for a complete diff.

Commit your changes to the repository:

git add aws_glue_cdk_baseline/job_scripts/sync.py
git add aws_glue_cdk_baseline/job_scripts/generate_mapping.py
git add pipeline_stack.py

git commit -m "Integrate Glue Resource Sync Utility into the pipeline"

git push

Day-to-day development (repeated)

With the initial setup complete, you can now proceed with your regular development activities. This section outlines the steps you’ll repeat during your day-to-day work to develop, version control, and deploy AWS Glue visual jobs.

Create visual jobs in the development account

In this step, you’ll use AWS Glue Studio to create and configure your visual jobs within the development environment.

In your development account, in AWS Glue Studio, select AWS Glue Studio
To create a new visual job, choose Create job
Choose Visual with a blank canvas and use the visual editor to design your ETL job
Configure the job settings:
Job name: Provide a meaningful name
IAM role: Select an IAM role with necessary permissions
Other configurations: Adjust as needed
To save the job, choose Save

Repeat these steps to create additional jobs as required.

Serialize visual jobs

To serialize your visual jobs to enable version control and preparation for deployment, follow these steps:

Run the AWS Glue Resource Sync Utility:

python sync.py \
  --src-role-arn arn:aws:iam::<DEV-ACCOUNT-NUMBER>:role/GlueCrossAccountRole-dev \
  --src-region us-east-1 \
  --serialize-to-file resources.json \
  --targets job,catalog \
  --skip-prompt

Replace <DEV-ACCOUNT-NUMBER> with your development account number
Replace <DEV-REGION> with your development Region (for example, us-east-1)
Verify the serialized file:
Locate JSON in aws_glue_cdk_baseline/resources/
Make sure it contains the definitions of your visual jobs

Commit changes to Git repository

To commit changes to the Git repository, follow these steps:

Add the serialized resources to Git:

git add aws_glue_cdk_baseline/resources/resources.json

Commit your changes:

git commit -m "Add serialized Glue Visual Jobs"

Push to GitHub:

git push

This action triggers the CI/CD pipeline.

Deploy visual jobs to production

The CI/CD pipeline automatically deploys the following changes:

Synthesize the AWS CDK application
Deploy to the development environment
Deploy to the production environment
Execute the AWS Glue Resource Sync Utility

The following screenshot shows the CI/CD pipeline.

CICD Pipeline

Verify visual jobs in production

After the pipeline has completed the deployment, it’s important to verify that the visual jobs are correctly reflected in the production environment. To do so, follow these steps:

In the production account, on the AWS Glue Studio console, select AWS Glue Studio
Verify the deployed jobs:
Make sure that the visual jobs are present
Open each job to confirm that the visual DAGs are preserved

By following these steps in your day-to-day workflow, you make sure that your AWS Glue visual jobs are version-controlled, consistent across environments, and that your production environment reflects the latest tested changes.

Version control for AWS Glue visual jobs

By serializing AWS Glue Studio visual jobs to JSON files and committing them to a Git repository, you enable version control for your data integration workflows. By following this approach you can:

Track Changes – Monitor modifications to your AWS Glue jobs over time
Collaborate – Work with team members on developing and refining jobs
Restore and deploy – Easily restore jobs in other accounts or environments

The serialization and deserialization steps are integral to your development and deployment process, making sure that all changes are captured and seamlessly propagated.

Conclusion

By combining the AWS Glue Visual Job API, AWS Glue Resource Sync Utility, and an AWS CDK based CI/CD pipeline, we’ve crafted a comprehensive solution for managing AWS Glue Studio visual jobs across different environments. This integrated approach offers several benefits:

Version control integration – Manage and track changes to your AWS Glue visual jobs using Git, enabling collaboration and change tracking
Streamlined development – Easily develop and test AWS Glue jobs using the Visual Editor in the development environment
Automated deployment – Use a CI/CD pipeline to automatically deploy serialized visual DAGs to the production environment
Environment consistency – Promote consistency across development and production environments by using the same job definitions
Visual representation preservation – Maintain the visual DAG representation when synchronizing jobs between environments

This solution empowers data engineers to focus on building robust data integration pipelines while automating the complexities of managing and deploying AWS Glue Studio visual jobs across multiple environments.

We encourage you to try this solution and adapt it to your needs. As always, we welcome your feedback and suggestions for further improvements.

About the Authors

Andrei Maksimov is an AWS Senior Cloud Infrastructure Architect specializing in cloud infrastructure, software development, and DevOps. He designs and implements scalable, secure, and efficient cloud solutions and helps customers optimize their cloud environments. Outside of work, Andrei enjoys participating in hackathons, contributing to open source projects, and exploring the latest advancements in AI. You can connect with him on LinkedIn.

David Zhang is an AWS Data Architect specializing in designing and implementing analytics infrastructure, data management, ETL, and extensive data systems. He helps customers modernize their AWS data platforms. David is also an active speaker at AWS conferences and contributor to AWS conferences, technical content, and open source initiatives. He enjoys playing volleyball, tennis, and weightlifting in his free time. Feel free to connect with him on LinkedIn.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for designing AWS features, implementing software artifacts, and helping with customer architectures. In his spare time, he enjoys watching anime on Prime Video. You can connect with him on LinkedIn.

AWS Big Data Blog