Sign in

Sign in

or

Create a new account

Categories

What is AWS Marketplace Procurement Governance and Entitlement Cost Management How to Sell

Infrastructure Software Backup & Recovery Data Analytics High Performance Computing Migration Network Infrastructure Operating Systems Security Storage

DevOps Agile Lifecycle Management Application Development Application Servers Application Stacks Continuous Integration and Continuous Delivery Infrastructure as Code Issue & Bug Tracking Monitoring Log Analysis Source Control Testing

Business Applications Blockchain Collaboration & Productivity Contact Center Content Management CRM eCommerce eLearning Human Resources IT Business Management Project Management

Machine Learning Human Review Services ML Solutions Data Labeling Services Computer Vision Natural Language Processing Speech Recognition Text Image Video Audio Structured Intelligent Automation Generative AI

Data Products Financial Services Data Healthcare & Life Sciences Data Media & Entertainment Data Telecommunications Data Gaming Data Automotive Data Manufacturing Data Resources Data Retail, Location & Marketing Data Public Sector Data Environmental Data

IoT Analytics Applications Device Connectivity Device Management Device Security Industrial IoT Smart Home & City

Professional Services Assessments Implementation Managed Services Premium Support Training

Industries Education & Research Financial Services Healthcare & Life Sciences Media & Entertainment Industrial Energy Automotive

Cloud Operations Cloud Governance Cloud Financial Management

Delivery Methods Amazon Machine Image Amazon SageMaker AWS Data Exchange CloudFormation Stack Container Image Helm Chart Add-on for Amazon EKS Private Image Build Professional Services SaaS

Solutions AWS Well-Architected Business Applications Data & Analytics Data Products DevOps Infrastructure Software Internet of Things Machine Learning Migration Security

Industry ??industrySolutions.dropdown.advertising_and_marketing_en??Energy ??industrySolutions.dropdown.engineering_construction_and_real_estate_en??Financial Services Healthcare & Life Industrial ??industrySolutions.dropdown.life_sciences_en??Media & Entertainment Nonprofit ??industrySolutions.dropdown.power_and_utility_en??Public Health Public Sector ??industrySolutions.dropdown.retail_en????industrySolutions.dropdown.sustainability_en??Telecommunications

AWS Service Integrations AWS Control Tower AWS PrivateLink Pre-trained Amazon SageMaker Models

AWS IQ Websites & Mobile Applications Databases & Analytics Networking & Security Machine Learning Productivity & Collaboration Cost Optimization Other

Resources Analyst Reports Blogs Customer Success Stories Events Implementation Guides Videos Webinars Whitepapers

Your Saved List

Become a Channel Partner Sell in AWS Marketplace Amazon Web Services Home Help

VariantSpark Notebook

Continue to Subscribe

VariantSpark Notebook

By: AEHRC Latest Version: 1.1.2

Linux/Unix

Continue to Subscribe

Linux/Unix

Continue to Subscribe

Product Overview

VariantSpark is a scalable toolkit for genome-wide association studies optimized for GWAS like datasets. Machine learning methods and, in particular, random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS) and from scalable to rare variants from whole genome sequence data. RFs provide variable importance measures to rank genomic locations according to their predictive power to the disease or phenotype. Although there are a number of existing random forest implementations, some even parallel or distributed such as: Random Jungle, ranger or SparkML, none are optimized to deal with modern whole genome datasets, containing thousands of samples and millions of variables. Implemented directly on Apache Spark core, VariantSpark builds random forest models and estimates variable importance using the mean decrease gini method, processing VCF and CSV files. The package also includes a Jupyter notebook with examples to perform Quality Control and data manipulation tasks using HAIL.is (included in the package) as well as for visualizing the results.

VariantSpark can process 200 samples with 20M variables in 1 hour consuming $3 of AWS resources. VariantSpark compute time increases linearly with both variables and samples.

Version

1.1.2

By

AEHRC

Video

See Product Video

Categories

Operating System

Linux/Unix, Ubuntu 20.04

Delivery Methods

CloudFormation Template

Highlights

VariantSpark can work directly with the VCF data, without the costly pre-processing required by other tools due to its novel approach of building random forest models.
VariantSpark is implemented directly on top of Apache Spark - a modern distributed framework for big data processing, which gives VariantSpark the ability to scale horizontally to process even whole genome sequence data.
More information available in our peer-reviewed publication O'Brien et al. VariantSpark: population scale clustering of genotype information BMC Genomics 2015 and our most recent pre-print Bayat et al. VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data BioRxiv 2019.

Pricing Information

Usage Information

Support Information

Customer Reviews