AWS HPC Blog

Leveraging Seqera Platform on AWS Batch for machine learning workflows – Part 1 of 2

Batch and SeqeraThis post was contributed by Dr Ben Sherman (Seqera) and Dr Olivia Choudhury (AWS), Paolo Di Tomasso, and Gord Sissons from Seqera, and Aniket Deshpande and Abhijit Roy from AWS.

Machine learning (ML) is used for multiple healthcare and life sciences (HCLS) applications, including medical imaging, protein folding, drug discovery, and gene editing. While Nextflow pipelines (like those in nf-core) are commonly used for genomics, they are also being adopted for machine learning workloads.

Nextflow is an excellent solution for many ML-based scenarios. Sometimes you need to continuously (and automatically) retrain your models based on rapidly-changing datasets from external sources such as sequencers. Sometimes you need training and inference resources sporadically but you face constraints getting GPUs or FPGAs – even for short periods. And often pipelines like nf-core/proteinfold have compute and data-intensive inference steps where many samples need to be processed in parallel.

In the next two posts, we’ll show you how these kinds of challenges can be addressed using Nextflow and the Seqera Platform integrated with AWS.

In part one of this two-part blog series, we explain how to build an example Nextflow pipeline that performs ML model-training and inference for image analysis, illustrating how Nextflow supports custom ML-based workflows. We also discuss how health care and life science customers are using this today.

In part two, we’ll provide a step-by-step guide explaining how users new to the Seqera Platform can rapidly get started with AWS, maximizing the use of AWS Batch, Amazon Simple Storage Service (Amazon S3), and other AWS services.

Seqera on AWS

Seqera Platform (previously Nextflow Tower) is a comprehensive bioinformatics data analysis platform deeply integrated with AWS. Seqera is used by leading biotechnology and pharmaceutical companies globally, including 10 of the top 20 global BioPharmas and roughly 10,000 bioinformaticians across hundreds of organizations.

Seqera Platform has several key features:

  • It is explicitly designed for Nextflow pipelines.
  • It is cross platform – Seqera works with AWS and other cloud and HPC providers, including on-premises systems. It supports customer’s preferred container runtimes, registries, source code managers, and it uses multiple AWS services, including AWS Batch, Amazon S3, Amazon FSx for Lustre, Amazon Elastic File System (EFS), Amazon Elastic Kubernetes Service (EKS), AWS Secrets Manager and others.
  • It has a large, active, user and developer community which provide high-quality curated nf-core pipelines and modules.

Seqera Platform can be deployed in two different ways.

Seqera Cloud is a fully managed service hosted exclusively on AWS infrastructure. Presently, there are 8,000+ corporate and research Seqera Cloud users. Researchers can use Seqera Cloud for free and progress to paid plans as their needs evolve.

Seqera Enterprise is a customer-managed version of the Seqera platform that is deployable on-premises or on a customer’s preferred cloud. Some customers install Seqera on-premises, while others deploy Seqera on AWS using Docker Compose or the Amazon Elastic Kubernetes Service (EKS).

Seqera employs a “bring your own credentials” model. As illustrated in the architecture diagram in Figure 1, users log into Seqera and add compute environments by supplying credentials for their preferred cloud or HPC workload manager.

Figure 1: High-level architecture of Seqera on the AWS Cloud.

Figure 1: High-level architecture of Seqera on the AWS Cloud.

Seqera Platform users have a private workspace and can be assigned to various shared workspaces, each with its own pipelines, datasets, and compute environments. Seqera sidesteps the complexity of running in different cloud or HPC environments by providing a consistent user experience regardless of the underlying infrastructure.

While most workloads run on-premises or on AWS infrastructure, this ability to deploy to different compute environments is useful for several reasons:

  • Customers can leverage on-premises HPC clusters and tap cloud capacity when their own resources are fully utilized.
  • Research frequently involves datasets hosted on third-party clouds, making it more cost-effective to bring the compute to the data rather than transferring large datasets to a local execution environment.
  • Academic and research efforts frequently involve collaboration among institutions using different infrastructure. Seqera allows these users to seamlessly share pipelines, datasets, computing infrastructure, and research results without exposing private cloud credentials.

Seqera Forge

While users can choose to run pipelines on pre-existing AWS Batch environments, Seqera Forge fully automates creating and configuring AWS Batch compute environments and queues for Nextflow pipelines, following best practices. Seqera can also dispose of cloud resources when they’re not in use, helping reduce costs.

By leveraging AWS APIs, Forge dramatically simplifies the deployment, configuration, and teardown of AWS infrastructure, making it possible for researchers with minimal knowledge of “CloudOps” to deploy cloud infrastructure themselves.

A sample training dataset

To illustrate how machine learning and inference workloads can be run on AWS using Seqera Platform, we used the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. This is a well-known dataset, often used as an example for learning or comparing different ML techniques, specifically for image classification. It consists of 589 samples, each with a set of 30 features taken from an image of a breast tissue. The diagnosis column in the data indicates whether the sample was benign (B) or malignant (M), as illustrated in Figure 2.

Figure 2: Images from Breast Cancer Wisconsin (Diagnostic) Dataset aligning with tabular data showing that samples are either malignant or benign.

Figure 2: Images from Breast Cancer Wisconsin (Diagnostic) Dataset aligning with tabular data showing that samples are either malignant or benign.

In the sample pipeline, we will train and evaluate multiple models to classify these breast samples as benign or malignant. In a real-world scenario, we can also use k-fold cross-validation to evaluate each model on several randomized partitions of the dataset and use multiple performance metrics with minimum requirements to determine whether a model is “good enough” to be used in production.

For our purposes here, we will simply evaluate each model on a single 80/20 train/test split and select the model with the highest test accuracy.

A sample pipeline

For illustration purposes, we use a simple proof-of-concept pipeline called Hyperopt developed by Seqera Labs. The pipeline takes any tabular dataset as input (or the name of a dataset on OpenML). It then trains and evaluates a set of ML models on the dataset, reporting the model that achieved the highest test accuracy. You can learn more about this pipeline in the article Nextflow and Tower for Machine Learning. The pipeline code is available on GitHub. Figure 3 shows a Mermaid diagram, automatically generated by Nextflow, of the overall pipeline.

Figure 3: The pipeline steps are implemented as Python scripts that use several common packages for ML, including numpy, pandas, scikit-learn, and matplotlib. These dependencies are defined in a Conda environment file called conda.yml.

Figure 3: The pipeline steps are implemented as Python scripts that use several common packages for ML, including numpy, pandas, scikit-learn, and matplotlib. These dependencies are defined in a Conda environment file called conda.yml.

By default, the pipeline uses the WDBC dataset described above and evaluates five different classification models:

When you run the pipeline, you should see something like this:

$ nextflow run hyperopt -profile wave
[...]
The best model for ‘wdbc’ was ‘mlp’, with accuracy = 0.991

This shows that Nextflow ran a pipeline that trained different ML models on the WDBC dataset and evaluated their performance during model inference. Multi-layer perceptron was most accurate in classifying breast tumor images as benign or malignant. For further details of the pipeline and its deployment, refer to part two of this blog series.

While the hyperopt pipeline implements a simple classification model, it provides all the building blocks you need to create your own ML pipelines with Nextflow.

Seqera is also an excellent solution for deploying GPU-based workloads in the AWS cloud. For a hands-on tutorial, see the article Running AI workloads in the cloud with Nextflow Tower — a step-by-step guide.

It’s the customers that matter most

Seqera is used by hundreds of pharmaceutical, healthcare, and biotech companies to run data analysis pipelines in the AWS Cloud. According to the latest 2023 State of the Workflow Survey, AWS is the most popular cloud environment among Nextflow users, with 49% of all Nextflow users surveyed already using or planning to use AWS within the next two years and 35.1% of survey respondents using AWS Batch [1,2]. The survey results showed strong cloud adoption, with the percentage of Nextflow users running in the cloud up 20% over 2022 [2].

Among the customers running Seqera and Nextflow on AWS are:

  • Arcus Biosciences—Arcus Biosciences is at the forefront of designing combination therapies, with best-in-class potential, in the pursuit of cures for cancer. By using Seqera Platform on AWS, Arcus was able to improve productivity, ensure pipeline traceability, and use cloud resources more efficiently. They were also able to prepare for future growth by scaling capacity for research and clinical trials while providing an intuitive, collaborative user experience to researchers and clinicians. Read the case study here.
  • Gritstone Bio—Gritstone Bio is developing targeted immunotherapies for cancer and infectious disease. Gritstone’s approach seeks to generate a therapeutic immune response by leveraging insights into the immune system’s ability to recognize and destroy diseased cells by targeting select antigens. Their workloads involve massive compute requirements for analysis of individual biopsies and makes extensive use of machine learning for tumor classification models. Gritstone use Seqera and multiple AWS cloud services to manage their bioinformatics pipelines. Read the case study here.
  • Tessera Therapeutics—Tessera Therapeutics are pioneers in a new category of genetic medicine and rely heavily on genomic analysis pipelines to identify promising new treatments. By using Seqera Platform to manage analysis pipelines on AWS, Tessera increased its analysis throughput and research productivity while simultaneously containing cloud spending by using resources more efficiently. You can read the case study here.

Conclusion

For organizations collaborating on large-scale data analysis and ML workloads, Seqera on AWS is an excellent solution. You can easily deploy powerful AWS compute and storage resources at scale, reduce costs through optimized resource usage, and manage spending across projects and teams.

In part two of this blog series, we will provide a step-by-step guide, explaining how you can easily deploy a Seqera environment on AWS to run ML pipelines like the one above, and other Nextflow pipelines.

References

[1] The State of the Workflow 2023: Community Survey Results.

[2] The State of the Workflow 2022: Community Survey Results.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this blog.

Ben Sherman

Ben Sherman

Ben Sherman is a software engineer at Seqera Labs. Ben earned his PhD at Clemson University, where he received advanced training in machine learning and high-performance computing (HPC). He is the developer of multiple scientific applications and workflows in computational sciences (particularly bioinformatics), enabling data-intensive scientific research on HPC platforms.

Abhijit Roy

Abhijit Roy

Abhijit Roy is an Enterprise Solution Architect with more than 20 years of experience in software development and digital transformation. He supports AWS Healthcare & Life Sciences customers in their cloud journey focusing on complex technical problems. Outside of AWS, Abhijit enjoys scuba diving around the world and is a board member of a non-profit supporting human trafficking survivors.

Aniket Deshpande

Aniket Deshpande

Aniket Deshpande is senior GTM specialist for HPC in Healthcare Lifesciences at AWS. Aniket has more than a decade of experience in the biopharma and clinical informatics space, where he has developed and commercialized clinical-grade software solutions and services for genomics, molecular diagnostics, and translational research. Prior to AWS, Aniket has worked in various technical roles at DNAnexus, Qiagen, Knome, Pacific Biosciences and Novartis.

Gordon Sissons

Gordon Sissons

Gordon Sissons is a consulting engineer and principal at StoryTek Consulting Inc. He is a fan of Seqera and has 30 years of experience in pre-sales engineering and consulting services. Prior to working with Seqera, Gord was the founder of NeatWorx Web Solutions Inc. and VP of technology and client services for Sun Microsystems of Canada. Gord is a graduate of Carleton University in Ottawa, Canada.

Olivia Choudhury

Olivia Choudhury

Olivia Choudhury, PhD, is a Senior Partner Solutions Architect at AWS. She helps partners, in the Healthcare and Life Sciences domain, design, develop, and scale state-of-the-art solutions leveraging AWS. She has a background in genomics, healthcare analytics, federated learning, and privacy-preserving machine learning. Outside of work, she plays board games, paints landscapes, and collects manga.

Paolo Di Tommaso

Paolo Di Tommaso

Paolo Di Tommaso is the CTO and co-founder of Seqera Labs. He is a computer scientist with a strong interest in high-throughput scientific computing, data-intensive applications, parallel programming, cloud computing, and containerization technologies. He has broad experience as a software engineer and software architect in life science and healthcare applications. He is an open-source advocate and the creator and maintainer of the Nextflow workflow system.