AWS Public Sector Blog

Building a secure and low-code bioinformatics workbench on AWS HealthOmics

The completion of the Human Genome Project by an international group of researchers in 2003 marked a pivotal milestone in biology, ushering in the post-genomic era. Thanks to technological advances, obtaining an individual’s genetic profile can now be achieved in a week and at a fraction of the original project’s cost. This ability to rapidly and affordably sequence genomes has given rise to a new age of multi-omics, where researchers unravel the molecular complexities underlying human health and disease. Accompanying the increased accessibility of sequencing is an increase in the volume of data generated and the need for bioinformatics expertise.

However, the large volume of data presents various challenges for traditional on-premises hosted environments. Firstly, the processing of these large genomic datasets involves a series of standardized steps with varying hardware and software requirements. The on-premises server responsible for performing core compute will, therefore, need to have a mixture of hardware and software configurations, leading to increased installation cost and complexity in maintenance. Yet, sequencing experiments are intermittent due to the time needed for data analysis to drive hypothesis generation and execution of follow-up sequencing experiments. This results in spikey usage patterns where the on-premises server will be underutilized for most periods of the year with short bursts of demands for high computational resources. Furthermore, the sensitive nature of genomic data mandates a secure environment for processing and storage. Finally, a shortage of bioinformaticians means that the enablement of clinicians and experts in other fields to perform their own analyses can accelerate the time-to-insight from the collected data.

To address these challenges, Singapore General Hospital (SGH), SingHealth Office of Academic Informatics (OAI) and Amazon Web Services (AWS) collaborated to develop a cost-effective, scalable cloud infrastructure that enables researchers to perform their own analyses on a centrally secured and compliant cloud platform.

AWS HealthOmics offers a suite of services that help bioinformaticians, researchers, and scientists to store, query, analyze, and generate insights from genomic and other biological data. It comprises three primary components: HealthOmics storage, which helps store and share petabytes of genomic data efficiently and at a low cost per billion bases (gigabase); HealthOmics analytics, which simplifies the preparation of genomic data for multi-omics and multimodal analyses; and HealthOmics Workflows, which automatically provisions and scales the underlying infrastructure for bioinformatics computations. HealthOmics Workflows also provides a graphic user interface for the execution of published pipelines, simplifying the process for performing bioinformatics analyses and optimizing the end-to-end execution process. HealthOmics Workflows is the main focus of the solution describe in this post.

With HealthOmics Workflows, you can process and analyze your genomics data using either Ready2Run workflows or private workflows. Ready2Run workflows are preconfigured workflows published by third-party publishers while private workflows are user defined using Workflow Definition Language (WDL), Common Workflow Language (CWL) or Nextflow. To create a private workflow, in addition to providing the workflow definition, you need to containerize your workflow tools and store them in a private repository within Amazon Elastic Container Registry (Amazon ECR), a fully managed container registry. For detailed prerequisites and instructions on creating a private workflow, please refer to this user guide.

However, for bioinformatics workflows involving multiple tools, the process of pulling each image can be time-consuming and challenging to manage. In this post, we showcase how SGH used DevOps tools from AWS to streamline this process by pulling the necessary containers, scanning for vulnerabilities, and publishing to Amazon ECR. Additionally, we also describe an approach to publish private workflows to HealthOmics Workflows whenever changes are made to the workflow files, enabling bioinformaticians to streamline and automate the deployment of bioinformatics workflows on HealthOmics.

Solution overview

The main objective of this solution is to implement version control for workflow files and establish a robust security posture by scanning for vulnerabilities in containers pulled from public repositories. To further simplify the deployment of the workflow, the solution offers a templatized approach of pulling containerized tools for common use cases such as single-cell RNA sequencing and whole-exome sequencing (WES) analysis.

To address these objectives, the solution involves two pipelines:

  • Workflow dependencies pipeline – This pipeline streamlines the process of pulling required containers for the private workflow. It performs vulnerability scanning on the container images using Aqua Security’s Trivy before pushing them to a private container repository for use by the private workflow. This proactive security measure ensures that potential vulnerabilities are identified and addressed before deploying the workflow.
  • AWS HealthOmics private workflow pipeline – This pipeline automates the process of deploying the private workflow to AWS HealthOmics.

To simplify repository management, a monorepo strategy is adopted, which allows various pipelines to reside within a single repository. This approach triggers the appropriate pipeline only when changes are made to the specific workflow being worked on, eliminating the need to manage multiple repositories and streamlining the development process.

The solution embraces a shift-left approach to address the extensive security requirements, where testing is done early prior to deployment. This proactive step ensures that potential security issues are identified and addressed early in the development lifecycle, reducing the risk of vulnerabilities propagating to production environments.

Architecture overview

In this post, you will learn how to use AWS developer tools such as AWS CodePipeline and AWS CodeBuild to streamline the development and deployment process of bioinformatics workflows to AWS HealthOmics.

Workflow dependencies pipeline

Figure 1 shows the workflow dependencies pipeline described in this post.

Figure 1. Architecture for the workflow dependencies pipeline for pulling required containers and storing the container image in private repository within Amazon ECR.

We use AWS CodePipeline, a managed continuous delivery service, to automate the process of pulling the necessary containers for the private workflow. Release cycles for the monorepo were implemented using the V2 type pipeline.

The pipeline consists of the following stages:

1. Source – The source of the pipeline is a GitHub repository. The pipeline is triggered when a developer pushes the container-requirements.txt file within the workflow folder, which contains a list of public image URIs.

Sample content of the container-requirements.txt file:

quay.io/biocontainers/kallisto:0.50.1--hc877fd6_1
quay.io/biocontainers/ensembl-vep:106.1--pl5321h4a94de4_0

Please note that the above sample is solely intended to illustrate the format of the container-requirements.txt. The specific container images listed in the sample may not be accurate or up-to-date.

2. Build – In this stage, AWS CodeBuild, a managed continuous integration service, is used to automate the process of pulling the images based on the container-requirements.txt file to the AWS CodeBuild environment. AWS CodeBuild then scans the images for vulnerabilities using Trivy. The scan results are uploaded to an Amazon Simple Storage Service (Amazon S3)

3. Notify – After the vulnerability results are uploaded to the S3 bucket, an AWS Lambda function is triggered. This function summarizes the vulnerabilities found for each image, listing the number of critical and high findings. It then sends this summary to developers for review through Amazon Simple Notification Service (Amazon SNS), a managed messaging service.

4. Approval – We use Amazon SNS to send a notification to developers that the pipeline is pending their approval.

5. Deploy – With the approval granted, AWS CodeBuild pulls the container images based on the container-requirements.txt file, creates private Amazon ECR repositories, and pushes the images to the created private repositories. The images are tagged with the latest tag and a tag representing the current date.

If you would like to further automate container image compliance and enhance security measures by incorporating a solution that integrates with AWS Security Hub, you can refer to Automating image compliance for Amazon ECS and Amazon EKS using Amazon Elastic Container Registry (ECR) and AWS Security Hub. The solution pushes vulnerability findings from the container scans to Security Hub, enabling centralized visibility and management of security risks. Additionally, it uses Security Hub remediation actions to restrict access to Amazon ECR container images when vulnerabilities are detected during an image scan. This added layer of security enforcement ensures that only compliant and approved container images are accessible, reducing the risk of deploying vulnerable applications.

AWS HealthOmics private workflow pipeline

Figure 2 shows the HealthOmics- private workflow pipeline described in this post.

Figure 2. Architecture for AWS HealthOmics private workflow pipeline to automate the process of private workflow deployments.

For this pipeline, we use the V2 type pipeline in AWS CodePipeline to automate the process of deploying private workflows to AWS HealthOmics.

The pipeline consists of the following stages:

  1. Source – The source of the pipeline is a GitHub repository. The pipeline is triggered when a developer pushes changes to the workflow files (such as workflow definition files or parameters.json files).
  2. PullAndDeploy – In this stage, AWS CodeBuild is used to automate the process. It compresses the workflow files and uploads them to an S3 bucket. Subsequently, AWS CodeBuild creates a new private workflow on AWS HealthOmics using the compressed files from the S3 bucket.

To further streamline the process of launching bioinformatics workflows on AWS HealthOmics, SGH incorporates an event-driven architecture to achieve end-to-end automation and real-time notifications. It uses Amazon EventBridge, a service that delivers real-time event streams from various AWS services, to capture and react to the successful completion or failure of AWS HealthOmics workflows. Additionally, Amazon SNS is used to notify developers about the status of the workflows in real time. This event-driven architecture ensures keeping developers informed about the progress and status of their bioinformatics workflows on AWS HealthOmics. For more detailed information and implementation specifics, you can refer to Designing an event-driven architecture for Bioinformatics workflows using AWS HealthOmics and Amazon EventBridge.

Conclusion

The collaboration between SGH, OAI, and AWS has resulted in a robust solution that simplifies the deployment of bioinformatics workflows in the cloud. This solution demonstrates how organizations can streamline and standardize the process of creating private workflows, while simultaneously ensuring rigorous security measures through vulnerability scanning of containers. Additionally, it provides version control capabilities for both workflows and containers, enabling effective tracking and management of changes over time. By automating and simplifying these processes, the solution frees up valuable time and resources, allowing organizations to concentrate their efforts on innovating and enhancing their genomic pipelines.

To explore this solution, the source code is available in this repository on GitHub. The repository contains detailed instructions on how to set up and configure the solution within your AWS account.

Jeremy Ng

Jeremy Ng

Dr. Jeremy Ng is a senior bioinformatician in the Division of Pathology at Singapore General Hospital (SGH). He completed his doctoral training at the National University of Singapore, where he studied the determinants of transforming growth factor-beta signaling (TGFB) outcome. He has been with SGH since September 2020, working in the oncology space.

Eugene Ng

Eugene Ng

Eugene is a solutions architect at Amazon Web Services (AWS), with a primary focus on the healthcare industry. He enjoys exploring and implementing new technologies, aiming to support AWS customers in their innovation journey within the healthcare domain.

Kok Leong

Kok Leong

Kok Leong is the CIO and director of academic informatics at SingHealth. He is responsible for enabling the advancements of SingHealth’s research and education capabilities. Kok played a pivotal role in the development and implementation of the On-premises Research Data Science and Systems Explorer (ODySSEy) platform, which streamlines the management, utilization, compliance, and governance of research data.

Qu Xiaohua

Qu Xiaohua

Qu Xiaohua is a certified Amazon Web Services (AWS) Solution Architect Professional, PMP, and CISSP. He has 20-plus years of experience in software development, system solutioning, IT project management, and governance. He is currently the research lead for SingHealth's CIO Office, supporting the SingHealth research community in its digital transformation and innovation journey.

Seow Eng

Seow Eng

Seow Eng is a manager in SingHealth's CIO Office and a certified solutions architect with more than 20 years of experience in software engineering, bioinformatics, and healthcare research and development. In his current role, he manages the SingHealth Research Control Tower, where he is responsible for ensuring compliance with SingHealth' security policy and maintaining a secure and compliant environment for all users and services.