Sign in
Your Saved List Become a Channel Partner Sell in AWS Marketplace Amazon Web Services Home Help

VariantSpark Notebook

By: AEHRC Latest Version: 1.1.2

Product Overview

VariantSpark is a scalable toolkit for genome-wide association studies optimized for GWAS like datasets. Machine learning methods and, in particular, random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS) and from scalable to rare variants from whole genome sequence data. RFs provide variable importance measures to rank genomic locations according to their predictive power to the disease or phenotype. Although there are a number of existing random forest implementations, some even parallel or distributed such as: Random Jungle, ranger or SparkML, none are optimized to deal with modern whole genome datasets, containing thousands of samples and millions of variables. Implemented directly on Apache Spark core, VariantSpark builds random forest models and estimates variable importance using the mean decrease gini method, processing VCF and CSV files. The package also includes a Jupyter notebook with examples to perform Quality Control and data manipulation tasks using (included in the package) as well as for visualizing the results.

VariantSpark can process 200 samples with 20M variables in 1 hour consuming $3 of AWS resources. VariantSpark compute time increases linearly with both variables and samples.





Operating System

Linux/Unix, Ubuntu 20.04

Delivery Methods

  • CloudFormation Template

Pricing Information

Usage Information

Support Information

Customer Reviews