AWS Open Source Blog

Migrating Cortex CI/CD workflows to GitHub Actions

In this blog post, intern engineers Azfaar Qureshi and Shovnik Bhattacharya talk about their experience working with Cortex, a popular open source observability project. They share the challenges they faced and how they applied lessons learned to improve the development experience for other contributors in the Cortex Project.

The rise of open source has completely transformed the software industry. From Kubernetes to Linux to Git, open source projects have become massively popular and are often the industry standard within their respective fields. As such, it has become increasingly important—even necessary—for developers to be able to contribute to these projects. However, making new contributions is not always a developer-friendly experience.

Throughout our internship, we worked on various open source projects. We quickly found that contributing to these projects is different from working on a project that you or your organization already own. Typically, you would branch off the main repository to work on your changes locally (periodically pushing them if need be) and then create a pull request (PR) once you are done. However, this workflow is not always possible in large open source projects. For example, Cloud Native Computing Foundation (CNCF) projects follow a strict governance structure. Unlike maintainers or approvers, contributors cannot assign themselves issues or PRs. Contributors cannot create branches in the main repository or push to an existing branch either. Instead, they must fork the project and create PRs from there. However, these limitations cause an often overlooked problem: Most CI/CD providers, such as TravisCI or CircleCI, do not support having their CI workflows run out of the box on forked repositories.

At a glance, this process does not seem like much of an issue. After all, you can just set up an account with the CI provider and manually configure your workflows to run on the fork. However, as an individual contributor who is not part of a larger organization, you are often not eligible for special pricing plans. Your limited free-tier minutes might run out, which can become a barrier for frequent, long-term open source contributions.

On the other hand, if you do not set up CI on your fork, you are prone to creating unstable upstream PRs, and community effort will be wasted sifting through them. Another problem we noticed is that details on test results, logs, and artifacts are only visible by navigating to the CI provider’s website. Additionally, some CI providers, such as CircleCI, require OAUTH access to your GitHub account, which can be an issue for security-conscious individuals who use GitHub Single Sign On (SSO) because it incurs a potential security vulnerability.

We decided to address these issues in Cortex in particular because our teams at AWS work with Cortex frequently. Cortex is a popular open source project that provides scalable, highly available, and multi-tenant storage of time-series metrics. We aimed to improve the project’s CI infrastructure in terms of security, ease of use, and minimizing the barrier to entry for all developers.

Proposed solution

To address the pain points listed above, we evaluated alternative CI providers, ultimately deciding on GitHub Actions, which is GitHub’s first-party CI/CD platform. It not only addressed our concerns but also provided more advantages for the open source community than CircleCI, the existing CI provider for Cortex.

GitHub Actions addressed our concerns about ease of use. CI workflows written in GitHub Actions work out of the box on forked repositories. As mentioned above, ease of use has been a key feature lacking from CircleCI. Contributors are unable to run the CI pipeline until after an upstream pull request is made. Adopting GitHub Actions will ensure more stable and higher quality PRs before community effort is spent reviewing them. This approach helps increase developer velocity by reducing feedback cycles.

Furthermore, GitHub Actions helps projects align with the core open source value of transparency. Test results, logs, and artifacts can be viewed freely from within the GitHub repository rather than having this information withheld behind a third-party service.

Migrating to GitHub Actions also offers the advantage of future proofing. Cortex provides Arm releases of their project, but there is a lack of CI infrastructure to test this environment because CircleCI does not support Arm containers. GitHub Actions, however, does. This support is increasingly relevant to users of Cortex because most major cloud providers, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP), provide Arm compute instances on which their deployments could be running. MacBooks are also switching to Arm, so having a CI pipeline to supports tests in this environment is important in the long term.

Another important benefit is potentially lower costs. GitHub Actions is completely free for open source repositories. CircleCI provides only an annual credit stipend, which may not serve the needs of larger open source projects as the community grows or workflows become more intensive. Running out of CircleCI credits was an actual problem for Cortex as captured in issue #2340.

Finally, GitHub Actions has a much richer feature set. GitHub Actions supports a wider variety of workflows than CircleCI. GitHub Actions has 30 unique workflow triggers, whereas CircleCI only has three : push, manual, and schedule. Workflows such as pull request templates, closing stale issues, and running custom jobs on new releases are all natively supported with GitHub Actions.

Technical evaluation

Our next step was to make sure that a migration to GitHub Actions was feasible from a technical point of view. We accomplished this by doing a deep dive into the feature offerings of GitHub Actions and CircleCI and comparing them with the features currently in use by the Cortex CI pipeline.

Comparison of features

table showing feature offerings of GitHub Actions and CircleCI and comparing them with the features currently in use by the Cortex CI pipeline

This analysis allowed us to confirm that GitHub Actions had all the necessary features to move forward with the migration.

Cortex CircleCI workflow

Once we established the migration was feasible, we took a closer look into how we would execute the shift. Seven jobs, which were part of one cohesive test-build-deploy workflow, needed to be migrated in the Cortex CircleCI config file:

Job Description
lint Runs linting and ensures vendor directory, protos and generated documentation are consistent.
test Runs units tests on Cassandra testing framework.
integration-configs-db Integration tests for database configurations.
integration Runs integration tests after upgrading go lang, pulling necessary docker images and downloading necessary module dependencies.
build Builds and saves an up to date Cortex image and website.
deploy_website Deploys the latest version of Cortex website. Triggered within workflow.
deploy Deploys the latest Cortex image.

We captured the interdependencies between these jobs by making a dependency graph. This graph influenced our migration plan as we split our PRs into manageable chunks such that each part could be functional when merged upstream. In the first PR, we planned to include the test, build, and lint jobs as they had no dependencies. For the next PR, we planned to add the integration jobs as they depended on the build job. Finally, for the last PR, we would include the remaining deploy jobs.

table with jobs and descriptions

Migration plan

Once we submitted our proposal upstream (#3274) and got community buy-in, our next step was to start our migration. We had a three-step migration plan:

1. Merge jobs upstream over multiple PRs. (#3302, #3341 and #3368)

2. Have CircleCI and GitHub Actions Jobs run in parallel. Once all our PRs were merged, CircleCI jobs and GitHub Actions jobs ran in parallel.

3. Phase out CircleCI jobs once successful. In the 2020-11-15 Cortex community meeting, it was decided to finish the last stage of the migration and remove CircleCI completely.

Technical challenges

At first, we thought this migration would be fairly simple as we could simply translate the CircleCI config file to GitHub Actions’ format line by line. However, when we started the migration, we found it was a little more involved because of the difference in maturity between the two platforms. CircleCI has been around since 2011, whereas CI/CD pipeline support was added to GitHub Actions in late 2019. As a result, some features that were simple one-liners in CircleCI required extra work to build out in GitHub Actions.

SSH key verification

Some of the jobs required communicating with the GitHub repository over SSH, namely, the web-deploy job, which generates the code for https://cortexmetrics.io/ and pushes it to the upstream gh-pages branch. In CircleCI, setting up your container to communicate with SSH is very easy: you just provide the fingerprint and CircleCI will automatically set up the public and private key in your container.

# CircleCI config.yml
steps:
- add_ssh_keys:
    fingerprints:
    - "72:f2:e3:39:18:1f:95:17:90:b3:37:5e:49:ed:7e:a3"

With GitHub Actions, however, we needed to this work manually. The first step was to clone the repository over SSH. Luckily, GitHub Actions provides a way to clone a repository with SSH using their official checkouts action.

# GitHub Actions
steps:
  - name: Checkout Repo
     uses: actions/checkout@v2
     with:
        ssh-key: ${{ secrets.WEBSITE_DEPLOY_SSH_PRIVATE_KEY }} 

Unfortunately, the official action would change the SSH key agent and overwrite settings, such as StrictHostKeyVerification, which were needed for security reasons. We did not have the permissions to change those settings back, and subsequent attempts to communicate with the repository over SSH would throw errors. To work around this issue, we created our own SSH agent so we could pass in the socket when using SSH commands.

- name: Setup SSH Keys and known_hosts for Github Authentication to Deploy Website
  run: |
     mkdir -p ~/.ssh
     ssh-keyscan github.com >> ~/.ssh/known_hosts
     ssh-agent -a $SSH_AUTH_SOCK > /dev/null
     ssh-add - <<< "${{ secrets.WEBSITE_DEPLOY_SSH_PRIVATE_KEY }}"
  env:
    SSH_AUTH_SOCK: /tmp/ssh_agent.sock

Once the SSH agent was set up, we could use it by simply providing the SSH_AUTH_SOCK as follows:

- name: Deploy Website
   run: make BUILD_IN_CONTAINER=false web-deploy
   env:
     SSH_AUTH_SOCK: /tmp/ssh_agent.sock

Hard-coded filepaths

Another issue we faced was incorrect protocol buffer (protobuf) generation. Practically the entire Cortex CI was done through GNU makefiles and shell scripts that had hard-coded filepaths. Unfortunately, those filepaths were only compatible with the CircleCI container structure. When using GitHub Actions containers, you can’t mutate the file structure to match but are restricted to operate within the $GH_WORKSPACE, the location of which changes from workflow to workflow.

The protobuf-generating command, protoc -I $(GOPATH)/src:/.../...:/.../, couldn’t find the files it was looking for under $GOPATH and consequently used an alternative filepath with outdated proto files as fallback. This caused the protobuf files to be incorrectly generated, resulting in many opaque errors in CI.

We solved this issue by symbolically linking the actual workspace to the expected workspace and updating corresponding environment variables. Although the fix was relatively simple, identifying the root cause with unrelated error messages required a lot of investigation.

Release Tag Matching

Some of the jobs in the old workflow only ran on certain release tags. This restriction was enforced through the use of regular expressions in the CircleCI workflow. GitHub Actions, however, does not have complete regex support. We had to approximate the regular expressions using GitHub Actions Filter Patterns, which, although not as restrictive, was sufficient for our use case.

CircleCI Regex: /^v[0-9]+(\.[0-9]+){2}(-.+|[^-.]*)$/

  • v12.0.1-rc.0
  • v1.0.0
  • v12.0.1-rc
  • v.1.1.1

GHA Filter Pattern: v[0-9]+.[0-9]+.[0-9]+**

  • v12.0.1-rc.0
  • v1.0.0
  • v12.0.1-rc
  • V.1.1.1

Migration success

Once we worked around these technical challenges, the next step was to open the pull requests upstream. The Cortex maintainers were really amazing to work with and were prompt and thorough in their reviews. Our PRs were merged quickly and then we started monitoring how our CI performed in the live environment. We navigated to the actions tab on the repository and followed the different GitHub Actions workflows. We saw our CI successfully running on new releases, forks, and commits to master.

We also ensured that every job was running successfully.

Finally, the only thing left was for the upstream maintainers to remove CircleCI completely, which they plan to do shortly.

Conclusion

During the course of this project, we learned many lessons about working with a large open source community. We found that coding is only a small part of the work. The majority of the work is communicating, validating ideas, soliciting feedback, and proposing our own solutions to problems. In previous internships we had written code for problems that were already well defined and split into manageable chunks for us. However, this project taught us how to take charge of a larger task, break it down into manageable pieces, and see it through to completion.

Furthermore, we learned about the intricacies of working on an open source project and interacting with upstream maintainers as well as the rest of the community. Luckily for us, the Cortex community is great to work with. There was active discussion on our proposal, regular reviews on our pull requests, and detailed feedback from the maintainers. Overall, working on Cortex was a great first step into the open source community, and we hope to contribute to many such open source projects in the future.

Azfaar Qureshi

Azfaar Qureshi

Azfaar Qureshi is a third-year computer engineering student at the University of Waterloo. He is currently interning at AWS and is interested in infrastructure and SRE.

Shovnik Bhattacharya

Shovnik Bhattacharya

Shovnik Bhattacharya is a rising third-year computer engineering student at the University of Waterloo. He is currently interning at AWS and is interested in telemetry and machine learning.

The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.

Alolita Sharma

Alolita Sharma

Alolita is a senior manager at AWS where she leads open source observability engineering and collaboration for OpenTelemetry, Prometheus, Cortex, Grafana. Alolita is co-chair of the CNCF Technical Advisory Group for Observability, member of the OpenTelemetry Governance Committee and a board director of the Unicode Consortium. She contributes to open standards at OpenTelemetry, Unicode and W3C. She has served on the boards of the OSI and SFLC.in. Alolita has led engineering teams at Wikipedia, Twitter, PayPal and IBM. Two decades of doing open source continue to inspire her. You can find her on Twitter @alolita.