Skip to main content

Guidance for Development, Automation, Implementation, and Monitoring of Bioinformatics Workflows on AWS

Overview

IMPORTANT: This Guidance requires the use of AWS CodeCommit, which is no longer available to new customers. Existing customers of AWS CodeCommit can continue using and deploying this Guidance as normal.

This Guidance shows how you can build and run production-grade bioinformatics workflows at scale. Using AWS services for automation, workflow analysis, storage, and operational and cost observability, you can follow DevOps best practices to manage the lifecycle of your bioinformatics workflows. You can use this architecture as the foundation for your own infrastructure and update certain aspects as needed to integrate it with your environment and meet your needs.

How it works

This architecture diagram highlights key considerations and best practices for implementing bioinformatics workflows at scale.

Well-Architected Pillars

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

This Guidance uses AWS CodeCommit, AWS CodeBuild, and AWS CodePipeline to create version control and automate the build and deployment of your bioinformatics workflow’s source code. Additionally, DynamoDB lets you track HealthOmics output files and run metadata. Because this Guidance uses DevOps best practices to manage your workflow code and give you visibility into workflow run metadata, you can make incremental changes to achieve accurate results. By tracking workflow run metadata, you can easily find relevant workflow run status and output files to perform downstream reporting or scientific analysis.

Read the Operational Excellence whitepaper

This Guidance provides encryption at rest using AWS Key Management Service (AWS KMS) and encryption in transit for all network traffic using DataSync. Additionally, AWS Identity and Access Management(IAM) provides fine-grained access control over potentially sensitive data so that only authorized users can perform specific actions to process and analyze it.

Read the Security whitepaper

This Guidance lets you orchestrate computationally intensive bioinformatics workflows at scale by using HealthOmics. This service has certain service quotas, such as number of virtual CPUs, to prevent accidental overprovisioning. Additionally, Amazon S3 and DynamoDB provide high availability with built-in backup. This Guidance also uses EventBridge to capture events, such as failures, and Amazon SNS can provide real-time notifications in response so that you can take appropriate action. You can quickly investigate events using Amazon CloudWatch, which provides detailed logs to give you visibility into your HealthOmics workflows and underlying tools.

Read the Reliability whitepaper

This Guidance lets you run concurrent workflows with different CPU and memory configurations for specific tasks. You can request resources by specifying the CPUs, memory, and storage you need, and HealthOmics provisions the appropriate infrastructure. This helps you scale based on your business needs with the right resources.

Read the Performance Efficiency whitepaper

This Guidance uses an HealthOmics sequence store, which lets you store and share petabyte-scale genomics data files efficiently and at a low cost per gigabase, providing additional cost savings over Amazon S3. Additionally, you can use AWS CUR to access the most detailed information about your AWS costs and usage, identify areas for optimization, and understand your business’s trends based on attributes such as projects, departments, or users.

Read the Cost Optimization whitepaper

This Guidance uses managed and serverless services that help you avoid provisioning and managing your own infrastructure, helping you minimize the environmental impact of your projects. HealthOmics provisions resources only when you request a workflow run and tears down the resources when completed. Similarly, Lambda lets you run smaller tasks as functions without provisioning your own servers.

Read the Sustainability whitepaper

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.