The Genomic Medicine Institute (GMI) of Biochemistry and Molecular Biology at the Seoul National University College of Medicine in South Korea analyzes large-scale genomic (DNA and RNA) data. GMI is a one of the largest genome centers focusing exclusively on human genome analysis. Human genomes are about 99.9% identical, and discovering differences between genomes is the key to understanding many diseases. GMI plans to establish an Asian Genome Database with comprehensive genome information specifically targeting Asian populations.

Analyzing DNA sequences works by collecting a very large number of tiny fragments—called reads—from random locations in the human genome. Scientists collect billions of reads, with twentyfold to thirtyfold oversampling, to create a single DNA sequence. This process generates about 100 gigabytes (GB) of compressed data for one human genome. GMI built a data center in its research laboratory to analyze data using traditional rack-mounted clustered servers and storage servers. However, the servers couldn’t store more than 100 individual genomic sequences, and the lack of space meant that scientists could only analyze data four human genome samples simultaneously. Scientists were not able to add additional genome data and there was nowhere to store the results of the analysis. The physical limitations of the infrastructure were constraining the scientists’ research efforts. Additionally, GMI faced frequent temperature control problems and power outages

Faced with the rising cost of building and maintaining their own data center, the scientists began to consider alternatives. Scientists started formulating their cloud strategy in 2010 and used the following requirements to evaluate cloud platform providers:

  • Massive-scale, high-performance services
  • Low-cost implementation and deployment
  • Ability to easily scale up and down to meet changes in demand
  • Highly available and reliable infrastructure
  • Hadoop-based extensible cloud computing power

“We needed to be able to store about 200 GB per sample of analyzed data, extend the analysis pipeline to examine more than 10 human samples simultaneously, and open our analysis pipeline to the public.” says Jeong-Sun Seo, M.D., Ph.D. “We looked at other options, but Amazon Web Services (AWS) met all our requirements.”

GMI used local Hadoop clusters to develop a website called FX, which hosts a RNA-sequence analysis application. The scientists then deployed the FX web application on AWS running Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), and Amazon Elastic MapReduce (Amazon EMR). The user interface for FX resides on a server in a GMI lab. Users use the FX interface to upload data to Amazon Simple Storage Service (Amazon S3). After the upload, Amazon Elastic MapReduce processes the job and stores the result in Amazon S3. End users can directly access Amazon S3 to retrieve the results. Figure 1 demonstrates the system architecture on AWS.


Figure 1: GMI FX Architecture

GMI estimates that by using AWS, scientists have been able to reduce computing time by more than half compared to their previous environment. GMI runs its infrastructure in the US East (Northern Virginia) Region and estimates that performance is best when using 40 Amazon EC2 instances. “AWS provides the wonderful flexibility to scale Amazon EC2 instances up and down to compute genome data,” Rhie reports. “The data analysis pattern we use is well suited to the AWS infrastructure, which allows us to define the number of instances needed to process a few hundred GB of data and download the end-result as a few kilobytes (KB).”

GMI created an interface to FX that allows GMI users and other research institutions to access the website application using an AWS ID and credentials. This approach allows GMI researchers to pay only for the AWS resources that they use, and ensures that external users pay for using the FX web application with their own AWS accounts. “AWS helped us achieve our goals,” says Arang Rhie, Researcher. “Any user, commercial or academic, can upload genomic raw sequence using their own Amazon S3 bucket and analyze the uploaded data using our application on the AWS Cloud. We don’t worry about capacity because Amazon EMR dynamically configures the compute power in the cloud.”

GMI plans to use the FX application on AWS to run a massive amount of sequencing data for a personalized genetic variation analysis. “We expect that using AWS will help reduce our IT costs while increasing our efficiency,” says Seo.

To learn more about how AWS can help solve Big Data problems, visit: