AWS Public Sector Blog
Building a secure and low-code bioinformatics workbench on AWS HealthOmics
The completion of the Human Genome Project by an international group of researchers in 2003 marked a pivotal milestone in biology, ushering in the post-genomic era. Thanks to technological advances, obtaining an individual’s genetic profile can now be achieved in a week and at a fraction of the original project’s cost. This ability to rapidly and affordably sequence genomes has given rise to a new age of multi-omics, where researchers unravel the molecular complexities underlying human health and disease. Accompanying the increased accessibility of sequencing is an increase in the volume of data generated and the need for bioinformatics expertise.
However, the large volume of data presents various challenges for traditional on-premises hosted environments. Firstly, the processing of these large genomic datasets involves a series of standardized steps with varying hardware and software requirements. The on-premises server responsible for performing core compute will, therefore, need to have a mixture of hardware and software configurations, leading to increased installation cost and complexity in maintenance. Yet, sequencing experiments are intermittent due to the time needed for data analysis to drive hypothesis generation and execution of follow-up sequencing experiments. This results in spikey usage patterns where the on-premises server will be underutilized for most periods of the year with short bursts of demands for high computational resources. Furthermore, the sensitive nature of genomic data mandates a secure environment for processing and storage. Finally, a shortage of bioinformaticians means that the enablement of clinicians and experts in other fields to perform their own analyses can accelerate the time-to-insight from the collected data.
To address these challenges, Singapore General Hospital (SGH), SingHealth Office of Academic Informatics (OAI) and Amazon Web Services (AWS) collaborated to develop a cost-effective, scalable cloud infrastructure that enables researchers to perform their own analyses on a centrally secured and compliant cloud platform.
AWS HealthOmics offers a suite of services that help bioinformaticians, researchers, and scientists to store, query, analyze, and generate insights from genomic and other biological data. It comprises three primary components: HealthOmics storage, which helps store and share petabytes of genomic data efficiently and at a low cost per billion bases (gigabase); HealthOmics analytics, which simplifies the preparation of genomic data for multi-omics and multimodal analyses; and HealthOmics Workflows, which automatically provisions and scales the underlying infrastructure for bioinformatics computations. HealthOmics Workflows also provides a graphic user interface for the execution of published pipelines, simplifying the process for performing bioinformatics analyses and optimizing the end-to-end execution process. HealthOmics Workflows is the main focus of the solution describe in this post.
With HealthOmics Workflows, you can process and analyze your genomics data using either Ready2Run workflows or private workflows. Ready2Run workflows are preconfigured workflows published by third-party publishers while private workflows are user defined using Workflow Definition Language (WDL), Common Workflow Language (CWL) or Nextflow. To create a private workflow, in addition to providing the workflow definition, you need to containerize your workflow tools and store them in a private repository within Amazon Elastic Container Registry (Amazon ECR), a fully managed container registry. For detailed prerequisites and instructions on creating a private workflow, please refer to this user guide.
However, for bioinformatics workflows involving multiple tools, the process of pulling each image can be time-consuming and challenging to manage. In this post, we showcase how SGH used DevOps tools from AWS to streamline this process by pulling the necessary containers, scanning for vulnerabilities, and publishing to Amazon ECR. Additionally, we also describe an approach to publish private workflows to HealthOmics Workflows whenever changes are made to the workflow files, enabling bioinformaticians to streamline and automate the deployment of bioinformatics workflows on HealthOmics.
Solution overview
The main objective of this solution is to implement version control for workflow files and establish a robust security posture by scanning for vulnerabilities in containers pulled from public repositories. To further simplify the deployment of the workflow, the solution offers a templatized approach of pulling containerized tools for common use cases such as single-cell RNA sequencing and whole-exome sequencing (WES) analysis.
To address these objectives, the solution involves two pipelines:
- Workflow dependencies pipeline – This pipeline streamlines the process of pulling required containers for the private workflow. It performs vulnerability scanning on the container images using Aqua Security’s Trivy before pushing them to a private container repository for use by the private workflow. This proactive security measure ensures that potential vulnerabilities are identified and addressed before deploying the workflow.
- AWS HealthOmics private workflow pipeline – This pipeline automates the process of deploying the private workflow to AWS HealthOmics.
To simplify repository management, a monorepo strategy is adopted, which allows various pipelines to reside within a single repository. This approach triggers the appropriate pipeline only when changes are made to the specific workflow being worked on, eliminating the need to manage multiple repositories and streamlining the development process.
The solution embraces a shift-left approach to address the extensive security requirements, where testing is done early prior to deployment. This proactive step ensures that potential security issues are identified and addressed early in the development lifecycle, reducing the risk of vulnerabilities propagating to production environments.
Architecture overview
In this post, you will learn how to use AWS developer tools such as AWS CodePipeline and AWS CodeBuild to streamline the development and deployment process of bioinformatics workflows to AWS HealthOmics.
Workflow dependencies pipeline
Figure 1 shows the workflow dependencies pipeline described in this post.
We use AWS CodePipeline, a managed continuous delivery service, to automate the process of pulling the necessary containers for the private workflow. Release cycles for the monorepo were implemented using the V2 type pipeline.
The pipeline consists of the following stages:
1. Source – The source of the pipeline is a GitHub repository. The pipeline is triggered when a developer pushes the container-requirements.txt
file within the workflow folder, which contains a list of public image URIs.
Sample content of the container-requirements.txt
file:
quay.io/biocontainers/kallisto:0.50.1--hc877fd6_1
quay.io/biocontainers/ensembl-vep:106.1--pl5321h4a94de4_0
Please note that the above sample is solely intended to illustrate the format of the container-requirements.txt
. The specific container images listed in the sample may not be accurate or up-to-date.
2. Build – In this stage, AWS CodeBuild, a managed continuous integration service, is used to automate the process of pulling the images based on the container-requirements.txt
file to the AWS CodeBuild environment. AWS CodeBuild then scans the images for vulnerabilities using Trivy. The scan results are uploaded to an Amazon Simple Storage Service (Amazon S3)
3. Notify – After the vulnerability results are uploaded to the S3 bucket, an AWS Lambda function is triggered. This function summarizes the vulnerabilities found for each image, listing the number of critical and high findings. It then sends this summary to developers for review through Amazon Simple Notification Service (Amazon SNS), a managed messaging service.
4. Approval – We use Amazon SNS to send a notification to developers that the pipeline is pending their approval.
5. Deploy – With the approval granted, AWS CodeBuild pulls the container images based on the container-requirements.txt file
, creates private Amazon ECR repositories, and pushes the images to the created private repositories. The images are tagged with the latest
tag and a tag representing the current date.
If you would like to further automate container image compliance and enhance security measures by incorporating a solution that integrates with AWS Security Hub, you can refer to Automating image compliance for Amazon ECS and Amazon EKS using Amazon Elastic Container Registry (ECR) and AWS Security Hub. The solution pushes vulnerability findings from the container scans to Security Hub, enabling centralized visibility and management of security risks. Additionally, it uses Security Hub remediation actions to restrict access to Amazon ECR container images when vulnerabilities are detected during an image scan. This added layer of security enforcement ensures that only compliant and approved container images are accessible, reducing the risk of deploying vulnerable applications.
AWS HealthOmics private workflow pipeline
Figure 2 shows the HealthOmics- private workflow pipeline described in this post.
For this pipeline, we use the V2 type pipeline in AWS CodePipeline to automate the process of deploying private workflows to AWS HealthOmics.
The pipeline consists of the following stages:
- Source – The source of the pipeline is a GitHub repository. The pipeline is triggered when a developer pushes changes to the workflow files (such as workflow definition files or
parameters.json
files). - PullAndDeploy – In this stage, AWS CodeBuild is used to automate the process. It compresses the workflow files and uploads them to an S3 bucket. Subsequently, AWS CodeBuild creates a new private workflow on AWS HealthOmics using the compressed files from the S3 bucket.
To further streamline the process of launching bioinformatics workflows on AWS HealthOmics, SGH incorporates an event-driven architecture to achieve end-to-end automation and real-time notifications. It uses Amazon EventBridge, a service that delivers real-time event streams from various AWS services, to capture and react to the successful completion or failure of AWS HealthOmics workflows. Additionally, Amazon SNS is used to notify developers about the status of the workflows in real time. This event-driven architecture ensures keeping developers informed about the progress and status of their bioinformatics workflows on AWS HealthOmics. For more detailed information and implementation specifics, you can refer to Designing an event-driven architecture for Bioinformatics workflows using AWS HealthOmics and Amazon EventBridge.
Conclusion
The collaboration between SGH, OAI, and AWS has resulted in a robust solution that simplifies the deployment of bioinformatics workflows in the cloud. This solution demonstrates how organizations can streamline and standardize the process of creating private workflows, while simultaneously ensuring rigorous security measures through vulnerability scanning of containers. Additionally, it provides version control capabilities for both workflows and containers, enabling effective tracking and management of changes over time. By automating and simplifying these processes, the solution frees up valuable time and resources, allowing organizations to concentrate their efforts on innovating and enhancing their genomic pipelines.
To explore this solution, the source code is available in this repository on GitHub. The repository contains detailed instructions on how to set up and configure the solution within your AWS account.