Decreasing costs and rapid technological innovation have resulted in a tremendous increase in the volume and throughput of biological data being generated at large research institutes, individual labs and biopharma companies. At the same time, Life Sciences research is becoming increasingly collaborative and complex, leveraging multiple technologies to get a systems level understanding of diseases and organisms. This exponential increase in the scale of data being generated combined with increased collaboration has resulted in a need to rethink how data is stored, analyzed and shared.
Researchers, bioinformaticians, software developers and IT departments in Life Sciences are now leveraging Amazon Web Services (AWS) to create scalable and highly available IT infrastructures which store, compute and share terabytes (often petabytes) of data.
AWS Public Data Sets provide a centralized repository where this data can be shared and seamlessly integrated into AWS cloud-based applications. Examples include the 1000 Genomes Project, an international public-private consortium building the most detailed map of human genetic variation to date.
Benefits at a Glance
AWS enables you to increase or decrease capacity within minutes. You can commission one, hundreds or even thousands of server instances simultaneously.
Sharing and collaboration
Create a common space where you and your collaborators can share data, results and methods.
Develop highly available web applications to provide services and pipelines to other users. Deploy Amazon Machine Images (AMIs) that encapsulate proven pipelines and new techniques.
Align your operational costs with your need with the pay-as-you-go pricing model. Stop paying for equipment that sits idle between experiments.
Control who accesses, view, edits, adds or deletes data. Create and monitor all access logs through notifications.
Focus on your science
Focus on your science without worrying about building high scale, distributed infrastructures.
Pharma companies, Biotech companies, research centers and academic laboratories can use AWS to address some of their most critical IT challenges.
Genome Sequencing and Data Distribution
The availability of low-cost, high-throughput genome sequencing instruments is making it very easy to produce vast quantities of genome sequence data. Additionally, the parallel nature of sequencing experiments means the pace of experimentation is throttled by access to computing resources.
Amazon S3 provides a highly scalable, available, and durable data storage system for storing and distributing genome sequence data. Amazon EC2 provides highly scalable compute resources for running a variety of assembly, mapping and alignment applications. The close proximity of Amazon S3 and Amazon EC2 provides a high bandwidth connection for serving up input data and durably storing output data. Also, there is no additional bandwidth cost for data transferred between Amazon EC2 and Amazon S3 within a region.
AWS offers a variety of options for data ingestion and distribution. For smaller data sets (<~1TB), data is typically ingested into Amazon S3 through our web API. For larger genomic data sets, data is typically moved into AWS on physical media using AWS Import or by leveraging WAN-optimization software running on an Amazon EC2 instance. Data can be distributed directly out of Amazon S3, or for more latency-sensitive use cases, out of the Amazon CloudFront content delivery network (CDN). For extremely large data sets, data can be moved out of AWS on physical media using the AWS Import/Export service.
Analyzing genomic and proteomic data often requires access to vast quantities of computing resources for short periods of time. Amazon EC2 provides an elastic computing environment that allows researchers to process vast quantities of data for a variety of biological analyses.
For analysis and processing, there are multiple options for data storage depending on your use case. Amazon S3 is ideal for streaming object data into worker nodes for batch processing. Amazon EBS provides block-level direct attached storage for high I/O tasks such as relational databases. Amazon RDS is a relational database service which allows you to easily manage MySQL databases. Amazon SimpleDB is a simpler, non-relational database service often used to store metadata.
New paradigms, like MapReduce, are ideally suited to processing large data sets and are being leveraged by a new breed of bioinformatics algorithms. Amazon Elastic MapReduce provides an easy to use, robust framework for deploying MapReduce applications on AWS on top of the Hadoop framework.
As pipelines are created in the cloud, software developers and researchers can easily store and share Amazon Machine Images (AMIs). This allows for collaborative application sharing and fast resource deployment of analysis.
Structure-based drug design, virtual screening and other drug simulations save considerable time and resources by providing a preliminary read on drug effectiveness before entering expensive trial phases of drug development. Due to the ‘embarrassingly parallel’ nature of virtual screening, the speed and quality of the research is often constrained only by the scalability of the computer power. By leveraging Amazon EC2 compute instances, researches can spin up 10s, 100s or even 1000s of compute instances to quickly deliver results.
Scientific Collaboration and Centralized Data Management
As scientific data sets have become larger, managing and moving information amongst global partners has become extremely difficult. AWS offers a scalable, easily accessible central repository of data which is close to computation power for experiments.
Additionally, due to the elastic, pay-as-you-go nature of the cloud, distributed scientific teams can get started more quickly without having to worry about provisioning resources or deciding where the data will reside. Once a given experiment is completed, some or all of the resources can be ramped down without the burden of ongoing equipment cost.
Data can be distributed directly out of Amazon S3, or for more latency-sensitive use cases, out of the Amazon CloudFront content delivery network. Amazon Machine Images (AMIs) can be stored and shared in the cloud to facilitate experiments. Also, creating EBS snapshots is an easy way to share data with collaborators.