AWS Big Data Blog
Streamlining AWS Glue Studio visual jobs: Building an integrated CI/CD pipeline for seamless environment synchronization
Many Amazon Web Services (AWS) customers have integrated their data across multiple sources using AWS Glue, a serverless data integration service. By providing seamless integration throughout the development lifecycle, AWS Glue enables organizations to make data-driven business decisions.
AWS Glue Studio visual jobs provide a graphical interface called the visual editor that you can use to author extract, transform, and load (ETL) jobs in AWS Glue visually. The visual editor maintains a visual representation that a variety of data sources, transformations, and data sinks. With its intuitive interface, you can easily create large-scale data integration jobs without needing coding expertise, simplifying workflows and eliminating the need for manual ETL script programming.
As data engineers increasingly rely on the AWS Glue Studio visual editor to create data integration jobs, the need for a streamlined development lifecycle and seamless synchronization between environments has become paramount. Additionally, managing versions of visual directed acyclic graphs (DAGs) is crucial for tracking changes, collaboration, and maintaining consistency across environments.
This post introduces an end-to-end solution that addresses these needs by combining the power of the AWS Glue Visual Job API, a custom AWS Glue Resource Sync Utility, and an based continuous integration and continuous deployment (CI/CD) pipeline.
A few common questions from our customers include:
- What are the best practices for moving our workloads from a pre-production environment to production?
- What are the recommended best practices for provisioning data integration components?
- How can I build AWS Glue visual jobs in the development environment and automatically propagate them to the production account using the CI/CD pipeline?
- How can I version control and track changes to my AWS Glue Studio visual jobs?
End-to-end development lifecycle for data integration pipeline
The software development lifecycle on AWS has six phases: plan, design, implement, test, deploy, and maintain, as shown in the following diagram.
For more information regarding each component, check out End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue.
AWS Glue Resource Sync Utility
As part of synchronizing AWS Glue visual jobs across different environments, requirements include:
- Manage version control of visual DAGs by tracking changes to AWS Glue Studio visual jobs using version control systems such as Git
- Promote AWS Glue visual jobs from a pre-production environment to a production environment
- Transfer ownership of AWS Glue visual jobs between different AWS accounts
- Replicate AWS Glue visual jobs from one AWS Region to another as part of a disaster recovery strategy
The AWS Glue Resource Sync Utility is a Python application developed on top of the AWS Glue Visual Job API, designed to synchronize AWS Glue Studio visual jobs across different accounts without losing the visual representation. It operates by using source and target AWS environment profiles. Optionally, a list of jobs for synchronization can be provided along with a mapping file to replace environment-specific resources.
For more information on the AWS Glue Resource Sync Utility, refer to Synchronize your AWS Glue Studio Visual Jobs to different environments.
Solution overview
As shown in the following diagram, this solution uses three separate AWS accounts. One account is designated for the development environment, another for the production environment, and a third to host the CI/CD infrastructure and pipeline.
The solution emphasizes version controlling AWS Glue Studio visual jobs by serializing them into JSON files and storing them in a Git repository. As a result, you can:
- Track changes to your visual DAGs over time.
- Collaborate with team members.
- Restore and deploy visual DAGs in different environments seamlessly.
The AWS account responsible for hosting the CI/CD pipeline is composed of three key components:
- Managing AWS Glue Job updates – Provides smooth updates and maintenance of AWS Glue jobs.
- Cross-Account Access Management – Enables secure promotion of updates from the development environment to the production environment.
- Version Control Integration – Incorporates serialized visual DAGs into the CI/CD pipeline for deployment to target environments.
You can create AWS Glue Studio visual jobs using the intuitive visual editor in your development account. After these jobs are configured, they can serialize the visual DAGs into JSON files and commit them to a Git repository. The CI/CD pipeline detects changes to the repository and automatically triggers the deployment process.
The pipeline includes a step where the AWS Glue Resource Sync Utility deserializes the visual DAGs from the committed JSON files and deploys them to the production environment. This approach promotes consistent deployment of jobs while maintaining their visual representation.
The solution uses the AWS Glue Visual Job API, AWS Glue Resource Sync Utility, and AWS CDK to streamline deployment across environments. It enables seamless synchronization and consistent versioning of AWS Glue jobs between development and production, preserving visual workflows and reducing manual tasks. The solution consists of two main parts:
- Initial steps (one-time setup) – Setting up the development environment, bootstrapping AWS environments, deploying the CI/CD pipeline, and integrating the AWS Glue Resource Sync Utility
- Day-to-day development (repeated) – Ongoing activities such as creating visual jobs, serializing them, committing changes to the repository, deploying to production through the pipeline, and verifying the jobs
The solution follows these high-level steps for the initial setup:
- Set up the development environment
- Bootstrap your AWS environments
- Deploy the CI/CD pipeline
- Configure AWS developer tools connection on GitHub
- Integrate the CI/CD pipeline with the AWS Glue Resource Sync Utility
The solution follows these high-level steps for the day-to-day development:
- Create visual jobs in the development account
- Serialize visual jobs
- Commit changes to Git repository
- Deploy visual jobs to production
- Verify visual jobs in production
Prerequisites
Before you begin, make sure you have the following:
- GitHub account
- Git (
git
command) - Python 3.9 or later
- Package installer for Python (
pip
command) - AWS CDK Toolkit (
cdk
command) 2.155.0 or later - AWS CLI configured with appropriate credentials for your accounts
- Three AWS accounts:
- Development account
- Production account
- Pipeline account (for hosting the CI/CD pipeline)
Technical solution walkthrough
This section provides a detailed guide to setting up and using an automated CI/CD pipeline for AWS Glue Studio visual jobs.
Initial steps (one-time setup)
In this section, we walk through the foundational steps required to establish the CI/CD pipeline for AWS Glue Studio visual jobs. These initial steps set up the necessary infrastructure and configurations, providing a smooth and automated deployment process across your development and production environments.
Set up the development environment
To set up the development environment, follow these steps:
- Fork the aws-glue-cdk-baseline repository
- Clone the forked repository:
- Create and activate a Python virtual environment:
- Install required dependencies:
- To configure the default settings, edit the
default-config.yaml
file with your AWS account details and replace placeholders with your AWS account details: - Pipeline account:
awsAccountId
andawsRegion
. - Development account:
awsAccountId
andawsRegion
. - Production account:
awsAccountId
andawsRegion
.
Bootstrap your AWS environments
Bootstrapping prepares your AWS accounts for AWS CDK deployments. To bootstrap your AWS environments, run the following commands, replacing placeholders with your account numbers, Regions, and AWS CLI profiles:
Deploy the CI/CD pipeline
Deploy the pipeline stack to your pipeline account:
This command creates:
- The pipeline stack in the pipeline account
- The AWS Glue app stack in the development account
Configure AWS developer tools connection to GitHub
To establish a connection between AWS CodePipeline and your GitHub repository, follow these steps:
- Create a GitHub connection
- In the AWS Management Console for your pipeline account, navigate to AWS CodePipeline
- In the navigation pane, choose Connections
- Choose Create connection
- Select GitHub as the source provider
- Authorize the connection
- Provide a connection name (such as MyGitHubConnection)
- Choose Connect to GitHub
- Follow the prompts to authorize AWS CodePipeline to access your GitHub account
- Make sure that the connection has access to your forked
aws-glue-cdk-baseline
repository - Note the connection Amazon Resource Name (ARN)
- After the connection is established, note the Connection ARN because you’ll need it when configuring the pipeline
Integrate the CI/CD pipeline with the AWS Glue Resource Sync Utility
To integrate the AWS Glue Resource Sync Utility into the pipeline to automate the synchronization of AWS Glue visual jobs, follow these steps:
- Download the
sync.py
script from the AWS Glue Samples repository:
- Create a new file
aws_glue_cdk_baseline/job_scripts/generate_mapping.py
with the following content:
This script generates a mapping.json
file that the sync.py script will use to synchronize the jobs between the development and production environments. The mapping.json
file contains the mapping of the development environment assets to the production environment assets:
- The
s3://aws-glue-assets-*
Amazon Simple Storage Service (Amazon S3) bucket contains the AWS Glue Studio visual job definitions - The
arn:aws:iam::*:role/service-role/AWSGlueServiceRole
AWS Identity and Access Management (IAM) role is used by the AWS Glue Studio jobs to access AWS resources - The
s3://dev-glue-data-*
ands3://prod-glue-data-*
S3 buckets contain scripts and data used by the AWS Glue Studio jobs
- Update the
aws_glue_cdk_baseline/pipeline_stack.py
file to include a step that deserializes the JSON file and deploys the AWS Glue jobs to the production environment:
Replace the placeholders in the pipeline_stack.py file with your values:
GITHUB_REPO
with the name of your GitHub repositoryGITHUB_BRANCH
with the name of the branch you want to use for the pipelineGITHUB_CONNECTION_ARN
with the ARN of the GitHub connection you created in Step 4
- Update the
aws_glue_cdk_baseline/glue_app_stack.py
file to create a cross-account role with the necessary permissions to access the development environment resources:
Check the andreimaksimov/aws-glue-cdk-baseline for a complete diff.
- Commit your changes to the repository:
Day-to-day development (repeated)
With the initial setup complete, you can now proceed with your regular development activities. This section outlines the steps you’ll repeat during your day-to-day work to develop, version control, and deploy AWS Glue visual jobs.
Create visual jobs in the development account
In this step, you’ll use AWS Glue Studio to create and configure your visual jobs within the development environment.
- In your development account, in AWS Glue Studio, select AWS Glue Studio
- To create a new visual job, choose Create job
- Choose Visual with a blank canvas and use the visual editor to design your ETL job
- Configure the job settings:
- Job name: Provide a meaningful name
- IAM role: Select an IAM role with necessary permissions
- Other configurations: Adjust as needed
- To save the job, choose Save
Repeat these steps to create additional jobs as required.
Serialize visual jobs
To serialize your visual jobs to enable version control and preparation for deployment, follow these steps:
- Run the AWS Glue Resource Sync Utility:
- Replace
<DEV-ACCOUNT-NUMBER>
with your development account number - Replace
<DEV-REGION>
with your development Region (for example,us-east-1
) - Verify the serialized file:
- Locate JSON in
aws_glue_cdk_baseline/resources/
- Make sure it contains the definitions of your visual jobs
Commit changes to Git repository
To commit changes to the Git repository, follow these steps:
- Add the serialized resources to Git:
- Commit your changes:
- Push to GitHub:
This action triggers the CI/CD pipeline.
Deploy visual jobs to production
The CI/CD pipeline automatically deploys the following changes:
- Synthesize the AWS CDK application
- Deploy to the development environment
- Deploy to the production environment
- Execute the AWS Glue Resource Sync Utility
The following screenshot shows the CI/CD pipeline.
Verify visual jobs in production
After the pipeline has completed the deployment, it’s important to verify that the visual jobs are correctly reflected in the production environment. To do so, follow these steps:
- In the production account, on the AWS Glue Studio console, select AWS Glue Studio
- Verify the deployed jobs:
- Make sure that the visual jobs are present
- Open each job to confirm that the visual DAGs are preserved
By following these steps in your day-to-day workflow, you make sure that your AWS Glue visual jobs are version-controlled, consistent across environments, and that your production environment reflects the latest tested changes.
Version control for AWS Glue visual jobs
By serializing AWS Glue Studio visual jobs to JSON files and committing them to a Git repository, you enable version control for your data integration workflows. By following this approach you can:
- Track Changes – Monitor modifications to your AWS Glue jobs over time
- Collaborate – Work with team members on developing and refining jobs
- Restore and deploy – Easily restore jobs in other accounts or environments
The serialization and deserialization steps are integral to your development and deployment process, making sure that all changes are captured and seamlessly propagated.
Conclusion
By combining the AWS Glue Visual Job API, AWS Glue Resource Sync Utility, and an AWS CDK based CI/CD pipeline, we’ve crafted a comprehensive solution for managing AWS Glue Studio visual jobs across different environments. This integrated approach offers several benefits:
- Version control integration – Manage and track changes to your AWS Glue visual jobs using Git, enabling collaboration and change tracking
- Streamlined development – Easily develop and test AWS Glue jobs using the Visual Editor in the development environment
- Automated deployment – Use a CI/CD pipeline to automatically deploy serialized visual DAGs to the production environment
- Environment consistency – Promote consistency across development and production environments by using the same job definitions
- Visual representation preservation – Maintain the visual DAG representation when synchronizing jobs between environments
This solution empowers data engineers to focus on building robust data integration pipelines while automating the complexities of managing and deploying AWS Glue Studio visual jobs across multiple environments.
We encourage you to try this solution and adapt it to your needs. As always, we welcome your feedback and suggestions for further improvements.
About the Authors
Andrei Maksimov is an AWS Senior Cloud Infrastructure Architect specializing in cloud infrastructure, software development, and DevOps. He designs and implements scalable, secure, and efficient cloud solutions and helps customers optimize their cloud environments. Outside of work, Andrei enjoys participating in hackathons, contributing to open source projects, and exploring the latest advancements in AI. You can connect with him on LinkedIn.
David Zhang is an AWS Data Architect specializing in designing and implementing analytics infrastructure, data management, ETL, and extensive data systems. He helps customers modernize their AWS data platforms. David is also an active speaker at AWS conferences and contributor to AWS conferences, technical content, and open source initiatives. He enjoys playing volleyball, tennis, and weightlifting in his free time. Feel free to connect with him on LinkedIn.
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for designing AWS features, implementing software artifacts, and helping with customer architectures. In his spare time, he enjoys watching anime on Prime Video. You can connect with him on LinkedIn.