Microsoft Workloads on AWS

Use Azure DevOps to deploy AWS Glue jobs in CI/CD pipeline

In this blog post, we will walk you through an example using AWS Toolkit for Azure DevOps to deploy your AWS Glue jobs across multiple Amazon Web Services (AWS) accounts to simulate development and production environments.

Introduction

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning (ML), and application development. Since it is serverless, there’s no infrastructure to set up or manage. Many customers use AWS Glue for ETL (Extract, Transform and Load) jobs, but they encounter some challenges with the DevOps tools they are allowed to use or prefer to use. For example, customers want a way to integrate AWS Glue with Microsoft Azure DevOps. In this blog post, we address this challenge and show you how to use AWS Toolkit for Azure DevOps.

This toolkit is a free-to-use extension for Microsoft Azure DevOps Services and Azure DevOps Server that makes it easy to manage and deploy applications using AWS. It provides tasks that enable integration with many AWS services. It can also run commands using the AWS Tools for Windows PowerShell module, as well as the AWS Command Line Interface (AWS CLI).

Solution overview

This solution constructs a CI/CD pipeline with multiple stages in Azure DevOps. By using AWS Toolkit for Azure DevOps, the pipeline pulls AWS Glue job information from an AWS account (in this post, it will be referred to as the Dev AWS account) and stores it in Azure Repos for versioning.

There is a second Azure Pipelines that is used to deploy this AWS Glue job to another AWS account (referred to as the Prod AWS account) by using the AWS Toolkit for Azure DevOps.

The solution is designed as follows (Figure 1):

  1. A data engineer sets up the example ETL solution in the Dev AWS account using the AWS Management Console. The example ETL solution involves an AWS Glue crawler, table, database, and an AWS Glue job. The data engineer defines the data transformation and makes changes to the ETL process as required. The example ETL solution is explained in more detail later in this post.
  2. The data engineer runs the Azure Pipelines to retrieve the AWS Glue crawler, jobs, and other configuration data from Dev AWS account and stores them in Azure Repos. The code will then be reviewed and promoted to the main branch.
  3. Once the changes are approved, the data engineer runs another Azure Pipelines to retrieve the AWS Glue data from the Azure Repos and deploys into Prod AWS account.
AWS Glue Jobs DevOps Process Overview with Azure DevOps for Versioning

Figure 1: AWS Glue Jobs DevOps Process Overview with Azure DevOps for Versioning

Prerequisites

This post assumes you have met the following prerequisites:

Walkthrough

Step 1. First, set up an AWS service connection in Azure DevOps (Figure 2). Follow these instructions to create a new service connection in Azure Pipelines.

  1. Use the “AWS Service Connection” type to create the Dev AWS account connection.
  2. Repeat the process to create the Prod AWS account connection.
  3. Once the service connections are created successfully, continue to next step.
Adding Azure DevOps Service Connection Sample

Figure 2:Adding Azure DevOps Service Connection Sample

Step 2. Create an Azure Pipelines to retrieve the AWS Glue template and configuration from the Dev AWS account (Figure 3).

  1. Create new Azure Pipelines by following these instructions.
  2. Add a new task within the Azure Pipelines of the type “AWS Tools for Windows PowerShell Script.”
  3. Choose the “Dev AWS Service Connection” that you previously created and the AWS Region you are using.

Step 3. Once you have added the task, replace this script inside the task. Verify replacement of all placeholders in brackets, such as “[REPLACE AWS GLUE JOB NAME]”.

Adding AWS Tools for Windows PowerShell Script

Figure 3: Adding AWS Tools for Windows PowerShell Script

Step 4. Create the mapping file in your repository:

  1. Navigate to the Azure Repos you previously used.
  2. Create a new file named “mapping.json” in the same directory.
  3. Copy and paste the following code into the file and replace all placeholders in brackets with the respective values for your environment. “DEV S3 BUCKET NAME WHERE GLUE SCRIPT EXISTS” can be found in the Job details section of AWS Glue in the Script path field (Figure 4).

{
   "s3://[REPLACE WITH DEV S3 RAW DATA BUCKET]": "s3://[REPLACE WITH PROD S3 RAW DATA BUCKET]",
   "[REPLACE WITH DEV GLUE SERVICE ROLE ARN]": "[REPLACE WITH PROD GLUE SERVICE ROLE ARN]",
   "s3://[REPLACE WITH DEV S3 TRANSFORM DATA BUCKET]": "s3://[REPLACE WITH PROD S3 TRANSFORM DATA BUCKET]",
   "[REPLACE WITH DEV S3 BUCKET NAME WHERE GLUE SCRIPT EXISTS]": "[REPLACE WITH PROD S3 BUCKET NAME WHERE GLUE  SCRIPT EXISTS]"
}
Example AWS Glue Job Script S3 path

Figure 4: Example AWS Glue Job Script S3 path

Step 5. Create an Azure Pipelines to deploy the AWS Glue template and configuration into the Prod AWS account.

  1. Create new Azure Pipelines by following these instructions.
  2. Add a new task within the Azure Pipelines of the type “AWS Tools for Windows PowerShell Script.”
  3. Choose the “Prod AWS Service Connection” that you previously created and the AWS Region you are using.
  4. Once you have added the task, replace this script inside the task:

Verifying the pipelines

Once you have deployed two pipelines from the preceding walkthrough, you can run the first pipeline. You should be able to find the following files in your designated Azure Repos:

  • job-script.py
  • job-template.json
  • mapping.json

Refer to Figure 5.

Example AWS Glue Job in Azure Repos

Figure 5: Example AWS Glue Job in Azure Repos

Then run the second pipeline. You should be able to find the AWS Glue job in your Prod AWS account. Refer to Figure 6.

Example AWS Glue job was deployed in Prod AWS account via Azure Pipelines

Figure 6: Example AWS Glue job was deployed in Prod AWS account via Azure Pipelines

Troubleshooting and considerations

When deploying an AWS Glue ETL pipeline across different AWS accounts and environments managed through Azure DevOps, there are several important considerations to ensure smooth, reliable operations. As the pipeline extracts data, runs transformations, and loads results in a distributed fashion, any inconsistencies or issues can impact the overall workflow. This section outlines key points to take into account around environment consistency, monitoring, access controls, versioning, dependencies and other operational aspects of the distributed Glue solution. Addressing these considerations during planning and implementation can help reduce troubleshooting efforts later.

  • Environment and Artifact Consistency: Ensure extraction sources, data schemas, and AWS Glue database configurations are consistently defined and synchronized across all AWS accounts and environments in Azure DevOps. Validate mappings during deployments to catch drift.
  • Monitoring and Alerting: Leverage Amazon CloudWatch to monitor the AWS Glue job executions for failures. Integrate with Azure Pipelines for end-to-end visibility, and set up alarms and notify stakeholders promptly.
  • Access Control and Security: Assign least privileged AWS Identity and Access Management (IAM) roles for AWS resources accessed by Azure Pipelines.
  • Version Control and Conflict Avoidance: Isolate changes using Git feature branches and pull requests. Validate no conflicts before merging releases to main branches that trigger deployments.
  • Dependency Management: Package and version dependencies like libraries, JARs, and configurations together. Automatically deploy packaged dependencies across environments using Azure Pipelines.
  • Testing and Validation: Rigorously test full end-to-end deployment workflows from development through non-production stages before production release. Validate post-deployments as well.
  • Performance and Scalability: Design AWS infrastructure like AWS Glue jobs, databases and storage to auto-scale elastically based on load. Optimize Azure Pipelines job’s timeouts to ensure deployments can handle increasing workload and data volumes over time. Monitor AWS Glue performance metrics to proactively address bottlenecks.

Cleanup

After you have tested and verified your pipeline, you should remove any unused resources created for this example to avoid incurring any expenses. Use the AWS Management Console to delete the AWS Glue crawler, jobs, and Amazon S3 buckets. Then, remove the unused pipelines from the Azure DevOps Portal.

Conclusion

In this blog post, we provided a step-by-step guide to define, provision, and manage changes to an AWS Glue ETL solution using the AWS Toolkit for Azure DevOps to integrate version control and deploy resources to another AWS account. This example showcases how you can build an ETL solution in multiple development/production AWS accounts by using CI/CD pipelines with Azure DevOps services.

You can also learn more about deploying to AWS using AWS CloudFormation stacks directly from your existing Azure DevOps build pipeline in this blog post.


AWS has significantly more services, and more features within those services, than any other cloud provider, making it faster, easier, and more cost effective to move your existing applications to the cloud and build nearly anything you can imagine. Give your Microsoft applications the infrastructure they need to drive the business outcomes you want. Visit our .NET on AWS and AWS Database blogs for additional guidance and options for your Microsoft workloads. Contact us to start your migration and modernization journey today.

TAGS:
Chan Nyein Zaw

Chan Nyein Zaw

Chan, a Solution Architect at the AWS UK Financial Services team, has an extensive background in the industry. His journey has taken him through diverse roles, including Developer, IT Administrator, Development Manager, Development Consultant, and Cloud Architect. Chan is enthusiastic about assisting customers in creating globally accessible solutions and addressing everyday challenges through technology.

Brijesh Pati

Brijesh Pati

Brijesh Pati is an Enterprise Solutions Architect at AWS. His primary focus is helping enterprise customers adopt cloud technologies for their workloads. He has a background in application development and enterprise architecture and has worked with customers from various industries such as sports, finance, energy and professional services. His interests include serverless architectures and AI/ML.