What does this AWS Solutions Implementation do?

The Genomics Tertiary Analysis and Machine Learning Using Amazon SageMaker solution creates a platform in the AWS Cloud that can be used to build machine learning models on genomic datasets using AWS managed services. We define tertiary analysis to be the interpretation of genomic variants and assigning meaning to them. This solution provides a broad platform for genomic machine learning in AWS, using variant classification as an example of a scientifically meaningful problem that can be solved using this platform. In the example, we solve the specific challenge of competing clinical definitions when examining genomic variants. Our example is based on the following Kaggle challenge. We create a model to predict if a variant annotated in ClinVar has a conflicting classification or not. A model that can predict the existence of a conflicting classification for a variant can save valuable time that researchers have to spend looking for such conflicts.

This solution demonstrates how to 1) automate the preparation of a genomics machine learning training dataset, 2) develop genomics machine learning model training and deployment pipelines and, 3) generate predictions and evaluate model performance using test data. These steps can be repeated or edited by users for their specific use cases.

AWS Solutions Implementation overview

The diagram below presents the architecture you can automatically deploy using the solution's implementation guide and accompanying AWS CloudFormation template.

Genomics Tertiary Analysis and Machine Learning Using Amazon SageMaker | Architecture Diagram
 Click to enlarge

Genomics Tertiary Analysis and Machine Learning Using Amazon SageMaker solution architecture

The AWS CloudFormation template creates four CloudFormation stacks in your AWS account including a setup stack to install the solution. The other stacks include a landing zone (zone) stack containing the common solution resources and artifacts; a deployment pipeline (pipe) stack defining the solution's continuous integration and continuous delivery (CI/CD) pipeline; and a code base (code) stack providing the ETL scripts, jobs, crawlers, a data catalog, and notebook resources.

The solution’s setup stack creates an AWS CodeBuild project containing the setup.sh script. This script creates the remaining CloudFormation stacks and provides the source code for both the AWS CodeCommit pipe repository and the code repository.

The landing zone (zone) stack creates the CodeCommit pipe repository. After the landing zone (zone) stack completes its setup, the setup.sh script pushes source code to the CodeCommit pipe repository.

The AWS CodePipeline code pipeline deploys the code base (code) CloudFormation stack. The resources deployed in your account include Amazon Simple Storage Service (Amazon S3) buckets for storing object access logs, build artifacts, and data; CodeCommit repositories for source code; an AWS CodeBuild project for building code artifacts (for example, third-party libraries used for data processing); a CodePipeline pipeline for automating builds and deployment of resources; example AWS Glue jobs; and an Amazon SageMaker Jupyter notebook instance. The example code includes the resources needed to quickly develop machine learning models using genomics data and generate predictions.

Genomics Tertiary Analysis and Machine Learning Using Amazon SageMaker

Version 1.0
Last updated: 08/2020
Author: AWS

Estimated deployment time: 30 min

Use the button below to subscribe to solution updates.

Note: To subscribe to RSS updates, you must have an RSS plug-in enabled for the browser you are using.  

Did this Solutions Implementation help you?
Provide feedback 

Features

Provide a scalable environment in AWS to run genomics analysis and research projects

Create a scalable environment in AWS to build machine learning models on genomic datasets using AWS managed services. This solution provides a broad platform for genomic machine learning in AWS using variant classification as an example of a scientifically meaningful problem that can be solved using this platform.

Leverage infrastructure as code best practices

Rapidly evolve the solution using infrastructure as code (IaC) principles and best practices.

Leverage continuous integration and continuous delivery (CI/CD)

Use AWS CodeCommit source code repositories, AWS CodeBuild projects, and AWS CodePipeline to build and deploy genomics machine learning model generation pipelines, deploy Jupyter notebooks, and create extract, transform, and load (ETL) jobs to generate new training datasets.

Modify your genomics analysis and research projects

Modify the solution to fit your particular needs by adding your unique training datasets. Each change is tracked by the CI/CD pipeline, facilitating change control management, rollbacks, and auditing.
Build icon
Deploy a Solution yourself

Browse our library of AWS Solutions Implementations to get answers to common architectural problems.

Learn more 
Find an APN partner
Find an APN Partner

Find AWS certified consulting and technology partners to help you get started.

Learn more 
Explore icon
Explore Solutions Consulting Offers

Browse our portfolio of Consulting Offers to get AWS-vetted help with solution deployment.

Learn more