Our lncRNA analysis requires a lot of computational processing and integration. Using AWS, we can quickly compute across 1,000 or more nodes, which changes our time frame for genomic sequencing analysis from weeks to days. 
Dr. Mitch Guttman Assistant Professor, Division of Biology and Biological Engineering

The Guttman Lab for lncRNA Biology at the California Institute of Technology (Caltech) is a research laboratory led by prominent scientist Dr. Mitch Guttman. He leads a team of researchers studying a new class of genes called lncRNAs, short for large noncoding RNA. Using genomic approaches along with biochemistry, molecular biology, cell biology, and computational biology, Guttman and his team are exploring how lncRNAs organize protein and DNA molecules in the cell to control precise gene expression programs.

When Dr. Guttman came to Caltech in 2013, he wanted to be sure his research team had a high-performance computing (HPC) cluster that was elastic and flexible. “When we thought about a cluster for our lab, we knew it had to support our fluctuating compute demands,” says Guttman. “Sometimes we need 1,000 compute nodes, and sometimes only 10. It depends on data availability and what stage of a research project we’re at. And the convergence of multiple projects simultaneously can push that number even higher.”

However, the lab did not want to have to build its own on-premises cluster to support its needs. “In California, we have some of the nation’s highest real estate and electricity costs, so we were concerned with the cost of creating our own cluster here,” says John Lilley, lead administrator, information management systems and services, Caltech. “We also didn’t want to spend our time managing and maintaining the cluster.”

Furthermore, Guttman and his team wanted to ensure they could easily manage cluster access credentials. “We wanted to be able to activate and deactivate cluster user accounts from one central location, without worrying that we missed credentials on any of the machines,” says Lilley.

Caltech had already moved its entire web presence to the Amazon Web Services (AWS) cloud platform, and the Guttman Lab also chose to use AWS to support its HPC cluster. “We had been looking for a way to use the cloud for our compute resources, and AWS was the best choice because it offered the elasticity, flexibility, and cost savings we were looking for,” says Lilley.

The Guttman Lab uses an HPC cluster that includes computers connected to an Amazon Virtual Private Cloud (Amazon VPC), through which the lab can provision a logically isolated section of the AWS cloud to launch AWS resources in a defined virtual network. Researchers in dry and wet labs acquire genomic sequencing data and save it to a GlusterFS file system inside the Amazon VPC, and the researchers access the data using a shared AWS-based Linux workstation, which is authenticated via Simple AD, an Active Directory–compatible directory from AWS Directory Service.

The lab also uses the Amazon WorkSpaces managed desktop computing service for non-Linux users. “We wanted to give our Windows users the ability to connect from their dry lab PCs to Amazon WorkSpaces and have the same level of data access as Linux users,” says Lilley. “And we can use Simple AD to manage that access easily.” The lab uses Amazon Elastic Compute Cloud (Amazon EC2) instances for its GlusterFS nodes, and it uses a CfnCluster framework to deploy and maintain its HPC cluster on AWS. Using that cluster, the research team develops computational tools and statistical methods to analyze experimental data.

With AWS, the Guttman Lab now has the elasticity to manage its fluctuating compute demands. “We didn’t have to build our own physical cluster to manage our cyclical compute usage, because AWS scales automatically for us,” says Lilley. Guttman adds, “Now, we don’t need to spend time prioritizing projects in advance, and we know we’ll have enough compute power without having to refresh hardware every few years. We’re also able to actively develop and test new research methods. AWS is definitely an enabler for our lab.”

The lab also has the agility necessary to easily add more compute resources when required. “We recently needed to expand our GlusterFS system from 5 terabytes to 24 terabytes, and we were able to do it without buying new hardware,” says Lilley. “We simply added more Amazon EC2 nodes and increased cloud storage, and it only took one hour. Previously, it would have taken weeks to do that, because there would be discussions about the hardware purchase prices, and then we would have had to do the procurement, installation, and testing.”

Additionally, researchers at the lab can analyze lncRNA data faster using the AWS cloud. “Our lncRNA analysis requires a lot of computational processing and integration,” says Guttman. “Using AWS, we can quickly compute across 1,000 or more nodes, which changes our time frame for genomic sequencing analysis from weeks to days. We couldn’t do that with the limited capacity we had before.”

The lab has also been able to reduce costs by using Amazon EC2 Spot Instances to bid on spare Amazon EC2 compute capacity. “When you consider the elastic compute capabilities we get using AWS, as well as the cost-effectiveness of EC2 Spot instances, this cluster is far cheaper than anything we could have built ourselves,” says Guttman.

Using Amazon WorkSpaces and Simple AD, the Guttman Lab can easily manage access to its HPC cluster. “When we first started with the cluster, it was quite a task to get credentials synced from the Linux desktop to the management hosts and the CfnCluster,” says Lilley. “With Simple AD integrated into the cluster, we have saved a lot of time because we can activate and deactivate user accounts from a central location. Simple AD helps us keep things consistent within the entire environment.”

Eventually, Caltech plans to have more labs and departments running on AWS. “We are taking what we’ve created on AWS and bringing it to other genomic researchers across campus,” says Lilley. “We see this as the template going forward for HPC at Caltech.”

To learn more about genomics in the cloud, visit our AWS Genomics details page.

To learn more about how AWS can help you manage your HPC cluster, visit our AWS High Performance Computing details page.