Building a serverless Jenkins environment on AWS Fargate

Jenkins is a popular open-source automation server that enables developers around the world to reliably build, test, and deploy their software. Jenkins uses a controller-agent architecture in which the controller is responsible for serving the web UI, stores the configurations and related data on disk, and delegates the jobs to the worker agents that run these jobs as their primary responsibility.

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service, and has been a popular choice to deploy a highly available, fault tolerant, and scalable Jenkins environment. Traditionally, the Amazon Elastic Compute Cloud (Amazon EC2) launch type was the preferred choice, with Amazon Elastic Block Store (Amazon EBS) as the storage for the configurations and data. That changed with the introduction of AWS Fargate support for Amazon EFS.

The objective of this post is to walk you through how to set up a completely serverless Jenkins environment on AWS Fargate using Terraform.

Overview of the solution

The following diagram illustrates the solution architecture.

This diagram depicts the deployment architecture for Jenkins on AWS Fargate

The objective is to create a fully automated deployment of highly available, production-ready Jenkins in a serverless environment on AWS. We use the following components and services:

The Jenkins controller URL is backed by an Application Load Balancer. We recommend using AWS Certificate Manager (ACM) to provision an SSL certificate to associate with the load balancer. The SSL termination happens at the load balancer.
We use a VPC with at least two public and two private subnets.
The Jenkins controller runs as a service in Amazon ECS using Fargate as the launch type. We use Amazon Elastic File System (Amazon EFS) as the persistent backing store for the Jenkins controller task. The Jenkins controller and Amazon EFS are launched in private subnets.
Jenkins uses the Amazon ECS Fargate plugin to delegate to Amazon ECS to run the builds on Docker-based agents. Each Jenkins build is run on a dedicated Docker container that is wiped out at the end of the build. Jenkins agents discover the Jenkins controller task using AWS Cloud Map for service discovery.
You can use the aforementioned plugin to create multiple definitions of the Amazon EC2 Container Service Cloud configuration. This allows you to use an ECS cluster with a Fargate Spot capacity provider to run Jenkins jobs that can tolerate interruptions. This provides cost optimizations to run, for example, large test suites and build jobs.
The Jenkins controller can assume AWS Identity and Access Management (IAM) roles based on the ARN provided with the Fargate task definition. You can to define multiple ECS agent templates within the ECS plugin with different permission scopes for leveraging security best practices.
AWS Backup is used for assuring that you can restore all important data from the Jenkins controller in case of an accident.
You can additionally configure user notifications using Amazon Simple Notification Service (Amazon SNS) to alert users of any failures in the build jobs or in the environment.

Prerequisites

For this post, we developed a Terraform module to perform the deployment. The module and deployment example scripts are available in the GitHub repo. Refer the README for a list of all module variables. The following are required to deploy this Terraform module:

Terraform 13+.
A VPC with at least two public and two private subnets.
An SSL certificate to associate with the Application Load Balancer. It’s recommended to use an ACM certificate. This is not done by the main Terraform module. However, the example in the example directory uses the public ACM module to create the ACM certificate and pass it to the serverless Jenkins module. You can do it this way or explicitly pass the ARN of a certificate that you previously created or imported into ACM.
An admin password for Jenkins must be stored in AWS Systems Manager Parameter Store. This parameter must be of type SecureString and have the name jenkins-pwd.
Terraform must be bootstrapped. This means that a state Amazon Simple Storage Service (Amazon S3) bucket and a state locking Amazon DynamoDB table must be initialized.

Deployment

All the required resources and configurations are packaged as a Terraform module, which means it’s not directly deployable. However, an example deployment is in the example directory. To deploy the example, complete the following steps:

Clone the GitHub repo.
Change your working directory to the bootstrap directory.

Included in this directory is sample Terraform code to bootstrap the initial Terraform state management resources. A best practice is to use a state backend such as Amazon S3 and a locking mechanism such as DynamoDB when using Terraform. For more information, see State Storage and Locking. Because this bootstrap code creates Terraform state management resources, special care must be taken to save the resultant Terraform state file. Be aware that this state is only saved to a local file named terraform.tfstate. Make sure to save this state file if you want to maintain the S3 state bucket and DynamoDB lock table using Terraform.

Replace my-state-bucket and my-lock-table with your preferred names:

terraform init
terraform apply \
    -var="state_bucket_name=my-state-bucket" \
    -var="state_lock_table_name=my-lock-table"

To deploy the module, change your working directory to example.
Copy vars.sh.example to vars.sh.
Edit the variables in vars.sh as necessary, giving all details specific to your environment (VPC, subnets, Route 53 zone ID and domain name, state bucket, state locking table, and the Region in which you intend to deploy)
Run deploy_example.sh.

After you deploy all the resources, you can open a browser of your choice and enter the Route 53 alias record name. This opens the Jenkins web UI. Sign in with the user ecsuser and the password stored in the parameter store. All the plugins mentioned in the plugins.txt are also installed, and the Fargate container service cloud configurations are applied. Two clouds are created: one for Fargate and the other for the Fargate Spot cluster. The following screenshot shows the Fargate cloud.

This picture depicts the Amazon EC2 container service cloud configuration for the Amazon Elastic Container Service (ECS) / Fargate plugin.

You’re presented with two preconfigured jobs.

This picture depicts the two pre-provisioned Jenkins jobs. One of them would schedule jobs on a standard FARGATE cluster and the other would schedule jobs on a cluster with FARGATE_SPOT capacity provider.

Before we dive into those jobs, let’s take a look at capacity providers.

Amazon ECS on Fargate capacity providers enable you to use both Fargate and Fargate Spot capacity with your Amazon ECS tasks. For more information about capacity providers, see Amazon ECS capacity providers. With Fargate Spot, you can run interruption tolerant Amazon ECS tasks at a discounted rate compared to the Fargate price. Fargate Spot runs tasks on spare compute capacity. When AWS needs the capacity back, your tasks are interrupted with a 2-minute warning.

As part of the module that we just deployed, two ECS Fargate clusters are provisioned: one with a FARGATE capacity provider and the other with a FARGATE_SPOT capacity provider. The Jenkins controller service is configured to run on the cluster with the FARGATE capacity provider.

The two ECS container service cloud configurations created earlier determine which agent the jobs run on. For example, you can configure all the workloads that can tolerate interruptions (such as build jobs and test suites) to run on the fargate-cloud-spot, which is backed by an ECS Fargate cluster with the FARGATE_SPOT capacity provider, and all the other workloads (such as Terraform apply runs) to run on the fargate-cloud, which is backed by an ECS Fargate cluster with the FARGATE capacity provider. The Fargate Spot capacity provider helps optimize costs by using the spare compute capacity in the AWS Cloud. We call the jobs scheduled to run on the Spot capacity provider non-critical tasks and those running on the Fargate capacity provider critical tasks due to the nature of those jobs. The Jenkins declarative pipeline for the critical task job looks like the following code:

pipeline {
    agent {
        ecs {
            inheritFrom 'build-example'
        }
    }
    stages {
      stage('Test') {
          steps {
              script {
                  sh "echo this was executed on non spot instance"
              }
              sh 'sleep 120'
              sh 'echo sleep is done'
          }
      }
    }
}

Test the solution

To test the solution, let’s first run the Simple Job Critical Task by choosing Build Now. This creates a task in the cluster with the FARGATE capacity provider.

The picture depicts that the Jenkins job run creates a task in the cluster with FARGATE capacity provider.

When the task is in Running state, it registers itself as an agent and the Jenkins job runs on this agent.

This picture depicts that the newly created ECS task registers itself as a Jenkins agent.

You can see the console output in Jenkins.

This picture depicts the Jenkins job console output.

When the job is successful, the task is stopped.

This picture depicts that the ECS task is stopped once the Jenkins job is finished.

Cleanup

In order to clean up all the resources that were created to build the serverless Jenkins environment, navigate to the example directory and run the following command

terraform destroy -auto-approve

In order to clean up the dynamodb lock table and the S3 state bucket, navigate to the example/bootstrap directory and run the following command

terraform destroy -auto-approve

Limitations

Because Amazon EFS is network attached storage, in its default configuration it may not be performant enough for some Jenkins use cases. Proper research and testing is advised before using Amazon EFS for productive workloads. For more information, see Amazon EFS performance.

If necessary, set Amazon EFS to maximum I/O mode. You can do this in the Terraform module by setting the variable efs_performance_mode to maxIO. For more information, see What are the differences between General Purpose and Max I/O performance modes in Amazon EFS?

Conclusion

In this post, we went through how to deploy a highly available, scalable, fault tolerant, production-ready Jenkins environment in a completely serverless environment on Fargate. We also showed how to use the Terraform module to automate the deployment of all the necessary resources and associated configurations. The post also introduced how to use FARGATE and FARGATE_SPOT capacity providers.

When we deploy each job in its own Docker container, we can control the CPU and memory requirements for each job, the base container that defines the environment, the tooling necessary for the job, and more importantly not load the controller with job runs.

AWS DevOps & Developer Productivity Blog