|To drive collaboration among researchers, Sage Bionetworks built an on-line environment, called Synapse. Synapse hosts clinical-genomic datasets and provides researchers with a platform for collaborative analyses. Just like GitHub and Source Forge provide tools and shared code for software engineers, Synapse provides a shared compute space and suite of analysis tools for researchers. Synapse leverages a variety of AWS products to handle basic infrastructure tasks, which has freed the Sage Bionetworks development team to focus on the most scientifically-relevant and unique aspects of their application.|
Amazon Simple Workflow Service (Amazon SWF) is a key technology leveraged in Synapse. Synapse relies on Amazon SWF to orchestrate complex, heterogeneous scientific workflows. Michael Kellen, Director of Technology for Sage Bionetworks states, “SWF allowed us to quickly decompose analysis pipelines in an orderly way by separating state transition logic from the actual activities in each step of the pipeline. This allowed software engineers to work on the state transition logic and our scientists to implement the activities, all at the same time. Moreover by using Amazon SWF, Synapse is able to use a heterogeneity of computing resources including our servers hosted in-house, shared infrastructure hosted at our partners’ sites, and public resources, such as Amazon’s Elastic Compute Cloud (Amazon EC2). This gives us immense flexibility is where we run computational jobs which enables Synapse to leverage the right combination of infrastructure for every project.”
One of the initial pilots of Synapse was a project called MetaGEO. MetaGEO’s purpose is to improve the understanding and use of human gene expression data to predict key drivers of disease. Using Amazon SWF, the Sage Bionetworks team built a pipeline to automatically analyze a collection of datasets accessible from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO). This collection includes about 8000 datasets, each ranging in size from around 100MB to nearly 100GB. This pipeline first added meaningful annotations to help researchers understand the datasets they picked for processing. The pipeline then applied numerical methods to remove erroneous, useless data elements. This numerical quality control activity is informally called dataset “QCing”. Once this steps is complete, scientists can mine the data aggregated across multiple studies to look for genes consistently expressed as being associated with a particular disease. Prior to using SWF, this kind of processing took a very long time. Initially, scientists implemented their algorithms by writing individual Perl scripts to execute computations on the aggregated data. They could get their data analyzed, but tracking failed computational jobs was a tedious, manual process as was performing root cause analysis. And with such an approach, scientists could not leverage any parallel processing. What Synapse needed was a reliable mechanism to coordinate various jobs and track them even when several of them ran in parallel or had dependencies between them.
As part of a private beta program, the Sage Bionetworks team was introduced to Amazon SWF, an AWS service that was already in use within AWS and Amazon.com. The Sage Bionetworks team found Amazon SWF to be a perfect framework to base Synapse on. They used the AWS Flow Framework to program a higher order workflow having an initiation step, followed by two processing steps which are defined as another workflow. In the initiation step, a “crawler” queries the GEO site to find datasets of interest and downloads the metadata for each dataset. For each dataset, a new workflow instance is created programmatically to run the processing steps. Since Amazon SWF allows customers to have millions of concurrent workflow executions open, these datasets can all be processed in parallel supporting the computational needs of several simultaneous users. In each workflow execution, first the “indexing” step uploads metadata and web links pointing to GEO into Synapse. Second, the “QC” step processes the raw data and uploads the automatically curated data into Synapse.
While the overall coordination logic is written in Java, the crawler and QC logic were written independently by scientists in R and Perl. The coordination logic is written using the AWS Flow Framework as if it were a single threaded Java program. The framework works in conjunction with the service to decompose it into distributed, asynchronous tasks. The crawler and QC logic were written independently by scientists in R and Perl and run on heterogeneous computational resources. Using Amazon SWF task lists, Synapse can specify the memory requirement for each execution based on the size of the data set to be processed. Amazon SWF routes tasks to appropriately sized servers where the QC logic executes. Amazon SWF makes it easy to capture execution traces even the execution is distributed in nature – these traces are always available for analysis in the AWS Management Console. Through the AWS Management console, Synapse engineers can reconcile results and review logged information on a per-execution basis.
Using Amazon SWF, Synapse is now able to execute data analysis algorithms written by scientists in the various programming languages, in parallel on a heterogeneous set of servers. Sage Bionetworks estimates that to create a minimal orchestration framework in Synapse would cost more than $100K of software engineering labor to start and more as their orchestration needs grow. Amazon SWF relieves them from having to deal with the complexities of distributed coordination and allows them to focus on their core mission. As Sage Bionetworks evolves Synapse into a scientific computing platform with social interactions, they will be leveraging many more powerful features in Amazon SWF and AWS Flow Framework – without having to spend time and effort building them.
Brig Mecham is one of the scientists using Amazon SWF to process billions of biological measurements that have the potential to unlock key aspects of complex human diseases including cancer, Alzheimer’s and diabetes. Says Brig, “Of all the improvements obtained from using Amazon SWF for our computation platform, the most important is that it allowed us to quickly and efficiently start using the data for their true purpose: identifying cures for human diseases”.
To learn more, visit http://sagebase.org/synapse-overview/.