AWS Official Blog

Paging Researchers, Analysts, and Developers

by Jeff Barr | on | in Announcements | | Comments

Hi, I am Deepak Singh, a business development manager at Amazon Web Services. One of my areas of focus is scientific computing on AWS, and I am guest blogging today about an exciting new initiative that will bring great benefit to researchers and scientists.

“Science was always about mashing up, taking one result and applying it to your [work] in a different way. The question is Can we make that as effective [for] samples [of] data and analysis as it [is] for a map and set of addresses for a coffee shop? That is the vision.” — Cameron Neylon

One way to achieve Cameron Neylon’s vision is to have access to public sources of data. This becomes even more powerful if scientists and analysts can use the available data to perform all kinds of computational and analytical tasks. At Amazon Web Services we believe that making it easy for people to get access to data spurs innovation. In line with that thinking, we have launched Public Data Sets on AWS, a new program that significantly lowers the barrier for researchers and data analysts to access and use some of the most commonly used data sets in their communities without the need to manage data within their own AWS accounts. Public Data Sets on AWS provides a convenient way to share, access, and consume publicly available data within your Amazon EC2 environment. Here is how it works

  • Select public data sets will be hosted by Amazon Web Services for free as an Amazon EBS snapshot.
  • You can access the data by creating your own personal Amazon EBS volume from a publicly shared Amazon EBS public data set snapshot.
  •  You can then access, modify, and perform computations on these data sets directly using an Amazon EC2 instance and just pay for the compute and storage resources that you use.

Some of the areas we have found people interested in include scientific research, economic data analysis and market research. An example of a data set that we have seen interest in from the life science community is Ensembl. Ensembl is a joint project of the European Bioinformatics Institute and the Wellcome Trust Sanger Institute, and produces and maintains automated annotation on a number of eukaryotic genomes. Ensembl have made their MySQL databases for Ensembl release 51 available via the Public Datasets on AWS program and will continue to make updated versions of Ensembl available in the future. This data set consists of more than 650 GB of data and over 31000 files. People who want to use the snapshot will be able create an EBS volume from the snapshot, mount that volume on an AMI running MySQL, and configure the MySQL instance to point to the database files. In other words, you will now have the capability of doing bioinformatics in the cloud without needing to keep your Ensembl databases up to date.

The real power of these data sets comes from developers who can now provide tools and API’s that can be used to analyze the data, or mash them up with other data sources. It will be interesting to see how people make use of the available data sets, what kinds of data sets will be utilized, and the kinds of data types being requested and submitted. With the availability of these initial data sets, and more in the future, we would like to invite developers to provide analysis pipelines, tools and API’s that can be leveraged by the community and potential customers.

If you are interested in making a data set available as part of the Public Data Sets on AWS program, please submit your request on the form at http://aws.amazon.com/publicdatasets/. We would love to hear from you.