Listing Thumbnail

    VariantSpark Notebook

     Info
    Sold by: AEHRC 
    AWS Free Tier
    A scalable toolkit with a Jupyter notebook for genome-wide association studies optimized for GWAS like datasets.
    Listing Thumbnail

    VariantSpark Notebook

     Info
    Sold by: AEHRC 

    Overview

    VariantSpark is a scalable toolkit for genome-wide association studies optimized for GWAS like datasets. Machine learning methods and, in particular, random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS) and from scalable to rare variants from whole genome sequence data. RFs provide variable importance measures to rank genomic locations according to their predictive power to the disease or phenotype. Although there are a number of existing random forest implementations, some even parallel or distributed such as: Random Jungle, ranger or SparkML, none are optimized to deal with modern whole genome datasets, containing thousands of samples and millions of variables. Implemented directly on Apache Spark core, VariantSpark builds random forest models and estimates variable importance using the mean decrease gini method, processing VCF and CSV files. The package also includes a Jupyter notebook with examples to perform Quality Control and data manipulation tasks using HAIL.is (included in the package) as well as for visualizing the results.

    VariantSpark can process 200 samples with 20M variables in 1 hour consuming $3 of AWS resources. VariantSpark compute time increases linearly with both variables and samples.

    Highlights

    • VariantSpark can work directly with the VCF data, without the costly pre-processing required by other tools due to its novel approach of building random forest models.
    • VariantSpark is implemented directly on top of Apache Spark - a modern distributed framework for big data processing, which gives VariantSpark the ability to scale horizontally to process even whole genome sequence data.
    • More information available in our peer-reviewed publication O'Brien et al. VariantSpark: population scale clustering of genotype information BMC Genomics 2015 and our most recent pre-print Bayat et al. VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data BioRxiv 2019.

    Details

    Sold by

    Delivery method

    Delivery option
    VariantSpark Notebook

    Latest version

    Operating system
    Ubuntu 20.04

    Pricing

    VariantSpark Notebook

     Info
    Pricing is based on actual usage, with charges varying according to how much you consume. Subscriptions have no end date and may be canceled any time.
    Additional AWS infrastructure costs may apply. Use the AWS Pricing Calculator  to estimate your infrastructure costs.

    Usage costs (9)

     Info
    Instance type
    Product cost/hour
    EC2 cost/hour
    Total/hour
    t2.nano
    $0.00
    $0.006
    $0.006
    t2.micro
    AWS Free Tier
    $0.00
    $0.012
    $0.012
    t2.small
    $0.00
    $0.023
    $0.023
    t3.nano
    $0.00
    $0.005
    $0.005
    t3.micro
    AWS Free Tier
    $0.00
    $0.01
    $0.01
    t3.small
    $0.00
    $0.021
    $0.021
    t3a.nano
    $0.00
    $0.005
    $0.005
    t3a.micro
    $0.00
    $0.009
    $0.009
    t3a.small
    $0.00
    $0.019
    $0.019

    Vendor refund policy

    We do not currently support refunds, but you can cancel at any time.

    Legal

    Vendor terms and conditions

    Upon subscribing to this product, you must acknowledge and agree to the terms and conditions outlined in the vendor's End User License Agreement (EULA) .

    Content disclaimer

    Vendors are responsible for their product descriptions and other product content. AWS does not warrant that vendors' product descriptions or other product content are accurate, complete, reliable, current, or error-free.

    Usage information

     Info

    Delivery details

    VariantSpark Notebook

    VariantSpark Monitor EC2 uses the custom AMI, and the EMR cluster is instantiated using data from the Monitor. Both are contained within a VPC, and customers need only communicate with the EMR cluster through the master node.

    CloudFormation Template (CFT)

    AWS CloudFormation templates are JSON or YAML-formatted text files that simplify provisioning and management on AWS. The templates describe the service or application architecture you want to deploy, and AWS CloudFormation uses those templates to provision and configure the required services (such as Amazon EC2 instances or Amazon RDS DB instances). The deployed application and associated resources are called a "stack."

    Version release notes
    • Added security updates to monitor image

    Additional details

    Usage instructions

    Subscribe to product and click on Launch to start deploying the stack. For information on how to use the cloudformation template and notebook, see the product video at https://variantspark-marketplace-resources.s3.amazonaws.com/static/public/VariantSpark_AWS_Video.mp4 . Access the notebook at the Jupyter Notebook URL stack output. Cluster health can be inspected using the Ganglia URL stack output. Please note that Cloudformation must have permissions to create IAM roles. If you are able to connect to Ganglia but not the Jupyter Notebook, check that the monitor instance is running, and that its security group allows communication with the cluster on port 8080, which is enabled by default. If SSH access is desired, the security groups must be edited to allow access through port 22.

    Resources

    Vendor resources

    Support

    Vendor support

    AWS Support

    AWS infrastructure support

    AWS Support is a one-on-one, fast-response support channel that is staffed 24x7x365 with experienced and technical support engineers. The service helps customers of all sizes and technical abilities to successfully utilize the products and features provided by Amazon Web Services.

    Customer reviews

    Ratings and reviews

     Info
    5
    1 ratings
    5 star
    4 star
    3 star
    2 star
    1 star
    100%
    0%
    0%
    0%
    0%
    1 AWS reviews
    Lynn Langit

    Try out VariantSpark in 15 minutes

    Reviewed on Nov 27, 2019
    Purchase verified by AWS

    The CSIRO Bioinformatics team has been creating a number of powerful bioinformatics tools for several years. This implementation of their VariantSpark tool, which allows for rapid discovery of polygenic disease associations in large whole-genome sequencing cohorts, lets you to quickly try out VariantSpark on AWS services and includes an example case from bioinformatics.

    The example can be implemented via these AWS Marketplace CloudFormation templates and built in around 15 minutes. The example includes configuration for an EMR (Spark) cluster and parameters, VariantSpark parameters and a also a client EC2 machine which includes an example Jupyter notebook.

    Running the example in the notebook takes less than 5 minutes. There you can see VariantSpark in action, finding significant variants in a fun, synthetic phenotype (Hipsterism - or the genetic traits linked to being a Hipster).

    You can also visualize the impact of running the workload on the AWS EMR cluster, using the included Ganglia libraries. Ganglia is a scalable, distributed monitoring tool for high-performance computing systems, clusters and networks.

    Because the solution is built on Cloud Formation templates, you can copy the solution template and further customize it quickly to support your production / research bioinformatics analyses.

    View all reviews