Overview
VariantSpark is a scalable toolkit for genome-wide association studies optimized for GWAS like datasets. Machine learning methods and, in particular, random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS) and from scalable to rare variants from whole genome sequence data. RFs provide variable importance measures to rank genomic locations according to their predictive power to the disease or phenotype. Although there are a number of existing random forest implementations, some even parallel or distributed such as: Random Jungle, ranger or SparkML, none are optimized to deal with modern whole genome datasets, containing thousands of samples and millions of variables. Implemented directly on Apache Spark core, VariantSpark builds random forest models and estimates variable importance using the mean decrease gini method, processing VCF and CSV files. The package also includes a Jupyter notebook with examples to perform Quality Control and data manipulation tasks using HAIL.is (included in the package) as well as for visualizing the results.
VariantSpark can process 200 samples with 20M variables in 1 hour consuming $3 of AWS resources. VariantSpark compute time increases linearly with both variables and samples.
Highlights
- VariantSpark can work directly with the VCF data, without the costly pre-processing required by other tools due to its novel approach of building random forest models.
- VariantSpark is implemented directly on top of Apache Spark - a modern distributed framework for big data processing, which gives VariantSpark the ability to scale horizontally to process even whole genome sequence data.
- More information available in our peer-reviewed publication O'Brien et al. VariantSpark: population scale clustering of genotype information BMC Genomics 2015 and our most recent pre-print Bayat et al. VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data BioRxiv 2019.
Details
Features and programs
Financing for AWS Marketplace purchases
Pricing
Instance type | Product cost/hour | EC2 cost/hour | Total/hour |
---|---|---|---|
t2.nano | $0.00 | $0.006 | $0.006 |
t2.micro AWS Free Tier | $0.00 | $0.012 | $0.012 |
t2.small | $0.00 | $0.023 | $0.023 |
t3.nano | $0.00 | $0.005 | $0.005 |
t3.micro AWS Free Tier | $0.00 | $0.01 | $0.01 |
t3.small | $0.00 | $0.021 | $0.021 |
t3a.nano | $0.00 | $0.005 | $0.005 |
t3a.micro | $0.00 | $0.009 | $0.009 |
t3a.small | $0.00 | $0.019 | $0.019 |
Vendor refund policy
We do not currently support refunds, but you can cancel at any time.
Legal
Vendor terms and conditions
Content disclaimer
Delivery details
VariantSpark Notebook
VariantSpark Monitor EC2 uses the custom AMI, and the EMR cluster is instantiated using data from the Monitor. Both are contained within a VPC, and customers need only communicate with the EMR cluster through the master node.
CloudFormation Template (CFT)
AWS CloudFormation templates are JSON or YAML-formatted text files that simplify provisioning and management on AWS. The templates describe the service or application architecture you want to deploy, and AWS CloudFormation uses those templates to provision and configure the required services (such as Amazon EC2 instances or Amazon RDS DB instances). The deployed application and associated resources are called a "stack."
Version release notes
- Added security updates to monitor image
Additional details
Usage instructions
Subscribe to product and click on Launch to start deploying the stack. For information on how to use the cloudformation template and notebook, see the product video at https://variantspark-marketplace-resources.s3.amazonaws.com/static/public/VariantSpark_AWS_Video.mp4 . Access the notebook at the Jupyter Notebook URL stack output. Cluster health can be inspected using the Ganglia URL stack output. Please note that Cloudformation must have permissions to create IAM roles. If you are able to connect to Ganglia but not the Jupyter Notebook, check that the monitor instance is running, and that its security group allows communication with the cluster on port 8080, which is enabled by default. If SSH access is desired, the security groups must be edited to allow access through port 22.
Resources
Vendor resources
Support
Vendor support
AWS Support
AWS infrastructure support
AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.
Customer reviews
Try out VariantSpark in 15 minutes
The CSIRO Bioinformatics team has been creating a number of powerful bioinformatics tools for several years. This implementation of their VariantSpark tool, which allows for rapid discovery of polygenic disease associations in large whole-genome sequencing cohorts, lets you to quickly try out VariantSpark on AWS services and includes an example case from bioinformatics.
The example can be implemented via these AWS Marketplace CloudFormation templates and built in around 15 minutes. The example includes configuration for an EMR (Spark) cluster and parameters, VariantSpark parameters and a also a client EC2 machine which includes an example Jupyter notebook.
Running the example in the notebook takes less than 5 minutes. There you can see VariantSpark in action, finding significant variants in a fun, synthetic phenotype (Hipsterism - or the genetic traits linked to being a Hipster).
You can also visualize the impact of running the workload on the AWS EMR cluster, using the included Ganglia libraries. Ganglia is a scalable, distributed monitoring tool for high-performance computing systems, clusters and networks.
Because the solution is built on Cloud Formation templates, you can copy the solution template and further customize it quickly to support your production / research bioinformatics analyses.