Choosing the right storage for cloud native CI/CD on Amazon Elastic Kubernetes Service

Building and testing software is a resource-intensive operation that usually involves a fleet of very powerful servers waiting in the wings for build jobs.

With the rise of cloud native continuous integration/continuous development (CI/CD) systems on Kubernetes (i.e., Tekton, Jenkins X), we’re seeing a shift from the large (and often over-provisioned) static fleet of build servers to a more dynamic build fleet, provisioned on-demand and orchestrated with Kubernetes.

Cloud native CI/CD approaches enable provisioning of the right architectures (AMD, Intel, and ARM) at the right time (10 worker instances on the weekend vs 100s on a weekday) for the right cost (reserved pricing, on-demand, and Spot fleets).

This movement from a large, statically provisioned server fleet with directly attached storage to the Kubernetes world with ephemeral, and network attached storage can take some adjustment.

Amazon EKS scaling configurations

AWS scaling configurations for EKS

In this blog, I share my journey searching for the right storage for a Cloud Native CI/CD system powered by Amazon Elastic Kubernetes Service (Amazon EKS) on AWS. I will define the build project and how I set up the workspace. I will then benchmark various AWS storage services to identify which is best for this particular workload. The benefits to choosing the right storage service include effortless scale, increased performance, and an improved user experience.

The build project

Let’s select an example project that I will build and test that helps demonstrate the impact of the storage choices made.

I chose the ‘Landing Zone Accelerator on AWS’ project from AWS Labs on GitHub. It represents a typical large TypeScript-based project and includes a reasonable number of test cases that you can execute.

Setting up a workspace

Cloud Native CI/CD uses the concept of a workspace where your work occurs. Let’s consider a few ideal properties:

Speed with small files:
- Source code builds typically read and write 100,000s of very small files.
- My example project starts with ~1,600 files, but ends up with ~85,000 after the install and build are finished!
Persistence between stages:
- If I’ve already ‘built’ my software, and now want to ‘test’ it, then I should be able to use the workspace created during ‘build’ for my ‘test’.
- If my ‘build’ is built on one worker instance, and my ‘test’ ends up scheduled on a different one, then we shouldn’t need to start over.
Size efficient:
- Source code is inherently ‘compressible’. I like to be efficient with my workspace storage.
- My example project – after building – consumes 1.6GB of uncompressed space. However, it compresses down to 300MB, an almost 5:1 compression ratio!
Accessible beyond our build cluster:
- I have developers and other downstream consumers that want to view test results and use the built outputs as inputs to other project builds.
- Ideally these items land on an Amazon Simple Storage Service (Amazon S3) bucket which provides maximum accessibility via techniques like cross-account policies, pre-signed URLs, etc.

Let’s find the right solution to the first property. I measure the time it takes to perform a yarn install and yarn build while changing the storage approach that underlies our workspace.

Amazon Elastic Block Storage (EBS)

First, I’ll measure using the Amazon Elastic Block Storage (Amazon EBS) volume attached to our worker instance using an emptyDir Kubernetes volume, such as:

  volumes:
    name: workspace
    emptydir: 
       sizeLimit: 5Gi

The Pod that I execute and measure uses an Init container to ‘clone’ our project’s repository from GitHub, then execute yarn install followed by yarn build.

For a single pod, this takes 1 minute 30 seconds:

Single pod

Not bad! But does this scale? In my experience, creating, reading, and writing millions of small files is difficult for any storage system. I’ll try 15 parallel builds, which represents handling 1,275,000 files:

15 pod builds

What happened? Scaling has added 4 minutes and 30 seconds to our times! How much of this is due to Disk I/O and managing all those small files?

I’ve chosen a ‘memory optimized’ (r6.8xlage) Amazon Elastic Compute Cloud (Amazon EC2) instance for my worker instance on purpose. I’ve found that nothing beats a RAM Disk for software build performance! Although most software projects end up with a lot of files, they usually aren’t very big. I can get away with a 5GB RAM Disk and still have plenty of memory left for our build processes.

I’ll re-run our baseline using a memory-backed ephemeral disk, such as:

  volumes:
    name: workspace
    emptydir: 
       sizeLimit: 5Gi
       medium: Memory

baseline

For the baseline – the times are basically the same as when we used Amazon EBS-backed storage. This is not surprising, as a single build won’t hit disk I/O constraints. What if I run 15 parallel builds?

15 parallel builds

That’s better! You can see that Disk I/O was responsible for more than a 2 minutes per build increase.

One way I could improve Amazon EBS performance is by attaching/detaching multiple volumes to the instance until I saturate the instance’s Amazon EBS bandwidth or meet the maximum number of attached volumes allowed. In my experience, you encounter volume attach, or Amazon EBS bandwidth limits, before saturating the available CPU and memory. This leads back to memory as the most performant option.

Where use of memory for workspace storage doesn’t provide enough room, you can consider instance families with directly attached NVMe storage. Now that I’ve established ‘memory backed ephemeral’ as the right approach for our first property, let’s look at the rest. Memory-backed won’t satisfy properties 2, 3, or 4, as it only exists for the lifecycle of the pod – then it’s reclaimed and destroyed automatically.

Amazon S3

Let’s consider ‘restoring’ and ‘saving’ our workspace directly to Amazon S3. Here you can handle restoration and backup of the workspace in pre-build and post-build Pod steps.

What does this do to build-times? After the build is finished, I’ll create a compressed archive of our workspace, and then ship it to an S3 bucket.

direct ot an S3 bucket

Persisting the workspace to Amazon S3 has added about 1 minute to our build times. I’ll break down where this extra time has been spent: about 50 seconds creating the compressed archive, and about 10 seconds transferring the object to Amazon S3.

I’ve used up CPU cycles, and ephemeral storage space creating our temporary archive before sending it to Amazon S3. The CPU and space that was used would be better used building software!

Amazon FSx for Lustre

Are there storage systems that are a better fit? I’ll consider Amazon FSx for Lustre.

FSx for Lustre fits well into the properties I want which include:

Using a Data Replication Association (DRA), FSx for Lustre can synchronize files automatically from the filesystem to Amazon S3, and from Amazon S3 back to the filesystem.
The DRA would let me selectively include certain directory patterns to send to Amazon S3. I could keep the full workspace contents in the FSx for Lustre filesystem, and selectively send the useful build outputs to Amazon S3.
FSx for Lustre can be configured to apply compression automatically. Given the 5:1 compression ratio on the source code, this should result in excellent space efficiency.
A Kubernetes CSI Driver is available for FSx. This means that I can mount a filesystem directly to our Pods as a Kubernetes Persistent Volume (PV). CSI and PV are standard interfaces – familiar to anyone operating with Kubernetes.

I’ll re-run the 15 build pods using FSx for Lustre and see if I can improve on the minute it took to persist the workspace.

15 build pods to FSx for Lustre

That’s better! Persisting to FSx has reduced the persistence step time taken by around 45 seconds compared to direct to Amazon S3. I’ve offloaded compression and replication of build outputs to FSx!

Note that it is 45 seconds quicker to ‘save’ the workspace, and by extension 45 seconds quicker to ‘restore’ the workspace. Over 100s, or 1000s of builds that time and resource savings adds up to significant numbers.

I chose to provision a ‘middle of the road’ FSx volume for my tests (500MB/Second/TB of throughput). You can adjust the FSx filesystem properties to achieve nearly any combination of cost and performance that you desire.

Build objects into Amazon S3

I’ve persisted our workspaces to FSx for Lustre in a /workspaces/ directory. I’ve configured the DRA to only replicate the /outputs/ directory to Amazon S3. Right now our S3 bucket is empty.

I will execute our ‘test stage’ using our persisted workspace. This executes yarn test and saves a test report in our /outputs/ directory. This should replicate to Amazon S3 when our build is done.

The ‘test stage’ successfully restored the workspace created during the ‘build stage’ and ran all tests. Looking in Amazon S3, I can see my test report!

aws s3 ls s3://fsxblogpoststack-fsxlfilesystemfsxreplicationbuck-1f95qosmn3u5i/e561a021-9a19-4e32-8325-43702b9f3697/
2023-02-28 12:13:02          0 
2023-02-28 12:56:19     250525 test-results.txt

When I download and look at the file, I can see the outputs from yarn test.

aws s3 cp s3://fsxblogpoststack-fsxlfilesystemfsxreplicationbuck-1f95qosmn3u5i/e561a021-9a19-4e32-8325-43702b9f3697/test-results.txt .
download: s3://fsxblogpoststack-fsxlfilesystemfsxreplicationbuck-1f95qosmn3u5i/e561a021-9a19-4e32-8325-43702b9f3697/test-results.txt to ./test-results.txt

tail -5 test-results.txt
@aws-accelerator/installer:  installer-stack.ts  |     100 |      100 |     100 |     100 |                   
@aws-accelerator/installer:  solutions-helper.ts |     100 |      100 |     100 |     100 |                   
@aws-accelerator/installer: ---------------------|---------|----------|---------|---------|-------------------
 >  Lerna (powered by Nx)   Successfully ran target test for 9 projects
Done in 474.57s.

Conclusion

In this blog, I shared my approach for selecting the right storage for a Cloud Native CI/CD system powered by Amazon Elastic Kubernetes Service (Amazon EKS) on AWS. I defined the build project and how I set up the workspace. I then demonstrated how to benchmark various AWS storage services to identify which is best for this particular workload.

Scheduling as many software build jobs as the hardware can handle is our ultimate goal. I’ve demonstrated that storage choices for the build workspace can have a big impact. The ultimate at-scale performer – memory backed ephemeral – is great for building software, but it does not work well when we need persistence.

I also showed that workspace persistence is possible, but you must be careful not to rob resources from the node that should be used for building software. FSx for Lustre emerged as the clear winner for persisting our workspaces with the best performance, lowest consumption of node resources, flexibility to share build artifacts, and efficient use of storage via compression. In conclusion, the benefits to choosing the right storage service include effortless scale, increased performance, and an improved user experience.

Try it for yourself

All the Infrastructure as Code (IaC) and Kubernetes API calls to build and run this environment are available in /awslabs/ on GitHub.

Success with performance tuning demands experimentation and exploration. Don’t be afraid to experiment considering the unique properties of your own environments. Use the methods and examples I’ve discussed here to come up with the right approach for you!