AWS Open Source Blog

Enabling Scientists to Collaborate with Amazon EKS and Open Science Studio

To enable scientists from around the world to collaborate by sharing data and processes, and generating reproducible results, Navteca chose AWS and Amazon Elastic Kubernetes Service (Amazon EKS) as the foundation for their data platform. Navteca is a contractor supporting U.S. federal civilian agencies such as NASA, NOAA, and USGS, and has collaborated with AWS on this effort in support of the White House’s Office of Science and Technology Year of Open Science initiative. Open science is a movement that aims to make scientific research more transparent, accessible, and collaborative. It attempts to address several problems in the current scientific research system, including the lack of reproducibility, the publication bias towards positive results, and limited access to research outputs. As part of this effort AWS and Navteca leveraged numerous open source technologies such as JupyterHub, Dask, Crossplane, and Flux CD.

Navteca has the ultimate goal of creating “scientific models as a service,” a way for researchers around the world to execute common scientific models directly from a familiar interface. “When scientists share their research or models with the community it is often hard to replicate the science because the underlying hardware and software requirements are complex to recreate,” said Ramon Ramirez-Linan, Navteca CTO. For example, to run a model created by another scientist, researchers need to download the dependent libraries, compile the code, and provision sufficient computing resources to run the model, all while meeting stringent security and governance requirements. For specialized IT professionals familiar with high-performance computing (HPC) workloads this may be a straightforward task; however, scientists, researchers, and students trying to reproduce results may not have the expertise to deploy the required infrastructure and software reliably to the cloud. This leads to problems reproducing the results which can slow down the overall progress of research.

As a first step towards this goal, Navteca wanted an open source solution to automate provisioning of Daskhub (JupyterHub with Dask) on demand. Prior to this solution, the process of provisioning a new Daskhub installation could take up to a day and needed manual intervention to get it into a working state. With this solution, provisioning all resources takes minutes with no manual interventions necessary. This aligns with the ‘Data on EKS’ initiative at AWS which acknowledges the importance of Big Data and Machine Learning (ML) to global research agencies and industries on Kubernetes and strives to open source performant architectures which facilitate this work.

Open source components

The first implementation consists of several components: JupyterHub, Dask, Flux GitOps, Crossplane, and Navteca open source Jupyterlab extensions — all hosted on Amazon EKS.

Combining these components allows you to create a highly scalable multi-tenant data analysis environment that can support many concurrent users simultaneously. Moreover it’s very easy to support and can be configured or modified in minutes. It offers a GUI interface which researchers can use to leverage compute and analytic libraries without having to understand how the underlying infrastructure works and without needing to understand how to use the command line or SSH into a HPC cluster. In addition, this environment can be instantiated in any one of the AWS global regions allowing researchers who may not have access to an expensive on-premise HPC cluster to quickly create a multi-tenant research environment. This environment can then be used for collaborative work by thousands of data scientists and researchers.

If you wanted to give the Open Science Studio a try for free and do some data science of your own, you can visit NASA’s website to get started. Anyone can register using just an email address and spin up a notebook that can be used for all sorts of different analyses.

Navteca is also developing additional JupyterLab extensions for the scientific community such as Bucket Explorer (bexplorer), which allows users to browse private datasets in AWS Simple Storage Service (Amazon S3), as well as Open Data on AWS and API Baker that uses Amazon API Gateway and AWS Lambda to turn any Jupyter Notebook into a secure API Endpoint.

Solution walkthrough

self-service data platform diagram

To deploy the solution in your own AWS account here are the high level steps.

  1. First create an Amazon EKS cluster and deploy Crossplane with AWS Crossplane providers, and a GitOps engine (ie FluxCD or ArgoCD). You can find an example here.
  2. Deploy a Crossplane Composition that will reconcile a new instance of the scientific research solution, this will include an Amazon EKS cluster, helm chart for DaskHub (it includes JupyterHub), and Cognito user pool. You can find Navteca’s composition along with instructions on how to deploy it here.

Working together AWS and Navteca were able to leverage Crossplane running on Amazon EKS to allow for the rapid creation of a shared DaskHub environment where users can collaborate on data science and research. In the DaskHub environment end users can not only run analysis themselves but they can also share that analysis and its results so that it can be reproduced, verified and understood by others.

By approaching the creation of infrastructure through the use of Kubernetes and Crossplane, it allows for the creation of a robust and performant shared services platform. This platform can be used for many workloads in addition to the Daskhub workload discussed in this blog post. It is our hope that the open source code used for this effort will not only allow organizations across the world to experience the positive outcomes from this specific workload but also help promote the usage of Amazon EKS and Crossplane in creating shared services platforms for a wide range of possible workloads.

Conclusion

We look forward to continuing to collaborate to help empower scientists, students and researchers to do great things with open science. The work that can be done with these tools is important and having the opportunity to potentially contribute to that work is meaningful. We wish NASA and Navteca luck in their future pursuits and look forward to what cutting edge infrastructure built on AWS will enable scientists to do.

Jacob Mevorach

Jacob Mevorach

Jacob Mevorach is a senior specialist for containers for healthcare and the life sciences at AWS. Jacob has a background in bioinformatics and machine learning. Prior to joining AWS, Jacob focused on enabling and conducting large scale analysis for genomics and other scientific areas.

Carlos Santana

Carlos Santana

Carlos Santana is a Senior Specialist Solutions Architect at AWS leading Container solutions in the Worldwide Application Modernization GTM team. He has more than 20 years of experience in distributed systems, open source, devops, containers, gitops, kubernetes and serverless. He is a CNCF Ambassador and contributor to CNCF projects such as Kubernetes, ArgoCD, and Knative.

Isaac Mosquera

Isaac Mosquera

At AWS Isaac helps customers build internal developer platforms (IDPs) to drive developer productivity and applications modernizations initiatives. Isaac's background is mostly technical, but he is also passionate about product development, organizational psychology, and a penchant for gummy bears.

Manabu McCloskey

Manabu McCloskey

Manabu is a Solutions Architect at Amazon Web Services. He works with AWS strategic customers to help them achieve their business goals. His current focus areas include GitOps, Kubernetes, Serverless, and Spinnaker.

Ramon Ramirez-Linan

Ramon Ramirez-Linan

Ramon is Co-founder and CTO of Navteca. In his current role Ramon is the lead Cloud Solutions Architect for the NASA Science Cloud (SMCE). Ramon has also participated in many NASA Cloud initiatives, like NASA NGAP, NASA HQ MCE. In the NASA Science Cloud Ramon focuses on using AWS EKS to deliver services like Jupyterhub connected to HPC Clusters and using AWS Cloud services to facilitate and promote Open Science. Ramon also leads the technology vision for Navteca (https://www.navteca.com) delivering products and solutions like Voice Atlas (https://www.voiceatlas.com) and Open Science Studio (https://www.opensciencestudio.com/)