Overview

VariantSpark is a scalable toolkit for genome-wide association studies optimized for GWAS like datasets. Machine learning methods and, in particular, random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS) and from scalable to rare variants from whole genome sequence data. RFs provide variable importance measures to rank genomic locations according to their predictive power to the disease or phenotype. Although there are a number of existing random forest implementations, some even parallel or distributed such as: Random Jungle, ranger or SparkML, none are optimized to deal with modern whole genome datasets, containing thousands of samples and millions of variables. Implemented directly on Apache Spark core, VariantSpark builds random forest models and estimates variable importance using the mean decrease gini method, processing VCF and CSV files. The package also includes a Jupyter notebook with examples to perform Quality Control and data manipulation tasks using HAIL.is (included in the package) as well as for visualizing the results.

VariantSpark can process 200 samples with 20M variables in 1 hour consuming $3 of AWS resources. VariantSpark compute time increases linearly with both variables and samples.

Highlights

VariantSpark can work directly with the VCF data, without the costly pre-processing required by other tools due to its novel approach of building random forest models.
VariantSpark is implemented directly on top of Apache Spark - a modern distributed framework for big data processing, which gives VariantSpark the ability to scale horizontally to process even whole genome sequence data.
More information available in our peer-reviewed publication O'Brien et al. VariantSpark: population scale clustering of genotype information BMC Genomics 2015 and our most recent pre-print Bayat et al. VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data BioRxiv 2019.

Details

Sold by

AEHRC

Pricing

VariantSpark Notebook

Info

View purchase options

Pricing is based on actual usage, with charges varying according to how much you consume. Subscriptions have no end date and may be canceled any time.

Additional AWS infrastructure costs may apply. Use the AWS Pricing Calculator to estimate your infrastructure costs.

Region

Usage costs (9)

Info

Instance type	Product cost/hour	EC2 cost/hour	Total/hour
t2.nano	$0.00	$0.006	$0.006
t2.micro AWS Free Tier	$0.00	$0.012	$0.012
t2.small	$0.00	$0.023	$0.023
t3.nano	$0.00	$0.005	$0.005
t3.micro AWS Free Tier	$0.00	$0.01	$0.01
t3.small	$0.00	$0.021	$0.021
t3a.nano	$0.00	$0.005	$0.005
t3a.micro	$0.00	$0.009	$0.009
t3a.small	$0.00	$0.019	$0.019

Vendor refund policy

We do not currently support refunds, but you can cancel at any time.

Legal

Vendor terms and conditions

Upon subscribing to this product, you must acknowledge and agree to the terms and conditions outlined in the vendor's End User License Agreement (EULA) .

Content disclaimer

Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

Usage information

Info

Delivery details

VariantSpark Notebook

VariantSpark Monitor EC2 uses the custom AMI, and the EMR cluster is instantiated using data from the Monitor. Both are contained within a VPC, and customers need only communicate with the EMR cluster through the master node.

CloudFormation Template (CFT)

AWS CloudFormation templates are JSON or YAML-formatted text files that simplify provisioning and management on AWS. The templates describe the service or application architecture you want to deploy, and AWS CloudFormation uses those templates to provision and configure the required services (such as Amazon EC2 instances or Amazon RDS DB instances). The deployed application and associated resources are called a "stack."

Version release notes

Added security updates to monitor image

Additional details

Usage instructions

Subscribe to product and click on Launch to start deploying the stack. For information on how to use the cloudformation template and notebook, see the product video at https://variantspark-marketplace-resources.s3.amazonaws.com/static/public/VariantSpark_AWS_Video.mp4 . Access the notebook at the Jupyter Notebook URL stack output. Cluster health can be inspected using the Ganglia URL stack output. Please note that Cloudformation must have permissions to create IAM roles. If you are able to connect to Ganglia but not the Jupyter Notebook, check that the monitor instance is running, and that its security group allows communication with the cluster on port 8080, which is enabled by default. If SSH access is desired, the security groups must be edited to allow access through port 22.

Resources

Vendor resources

Example Notebook

AWS Podcast Feature

Support

Vendor support

AWS Support

Get support

AWS infrastructure support

AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.

Get support

Customer reviews

Write a review

Ratings and reviews

Info

1 ratings

5 star

4 star

3 star

2 star

1 star

100%

1 AWS reviews

Lynn Langit

Try out VariantSpark in 15 minutes

Reviewed on Nov 27, 2019

Purchase verified by AWS

The CSIRO Bioinformatics team has been creating a number of powerful bioinformatics tools for several years. This implementation of their VariantSpark tool, which allows for rapid discovery of polygenic disease associations in large whole-genome sequencing cohorts, lets you to quickly try out VariantSpark on AWS services and includes an example case from bioinformatics.

The example can be implemented via these AWS Marketplace CloudFormation templates and built in around 15 minutes. The example includes configuration for an EMR (Spark) cluster and parameters, VariantSpark parameters and a also a client EC2 machine which includes an example Jupyter notebook.

Running the example in the notebook takes less than 5 minutes. There you can see VariantSpark in action, finding significant variants in a fun, synthetic phenotype (Hipsterism - or the genetic traits linked to being a Hipster).

You can also visualize the impact of running the workload on the AWS EMR cluster, using the included Ganglia libraries. Ganglia is a scalable, distributed monitoring tool for high-performance computing systems, clusters and networks.

Because the solution is built on Cloud Formation templates, you can copy the solution template and further customize it quickly to support your production / research bioinformatics analyses.