AWS Big Data Blog

How Digital Infuzion solves the challenge of large-scale scientific data collaboration with Amazon Quicksight

This is a guest post by Digital Infuzion. In their own words, “Digital Infuzion (DIFZ), a leader in information technology, helps solve complex challenges related to genomics, health, and biomedical data, while collaborating with partners including the J. Craig Venter Institute, Gryphon Scientific, ICF International, and others engaged in scientific research. Together, we create novel and highly scalable solutions that enable the extraction of new knowledge from data”

One of our main areas of focus at Digital Infuzion is helping researchers collect, share, analyze, and visualize data using technology-agnostic, cloud-based data management platforms. By making high-quality data available to the scientific community, we’re helping researchers discover new insights. Ultimately, these insights can help reduce the prevalence and severity of disease by increasing knowledge about pathogenicity and transmissibility and improving the understanding of biological systems in general.

In this post, I’d like to talk about our work in developing and managing the Data Processing and Coordinating Center (DPCC) in support of a large and distributed scientific research network, Centers of Excellence for Influenza Research and Surveillance (CEIRS)*. CEIRS is a program supported by the National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health. We used AWS as the foundation of the solution and adopted Amazon QuickSight, a cloud-powered business intelligence (BI) service, to facilitate both the work of our internal data managers and the analytical needs of the CEIRS researchers.

A network of knowledge

The CEIRS network includes over 50 research institutions across 20 countries. Its mission is to conduct basic and clinical research to better understand disease-causing aspects of the influenza virus and to evaluate its transmission and evolution among human and animal hosts.

The data we manage is diverse, ranging from clinical trials to animal surveillance to genomes and pathogenesis. The individual variables include metadata, gene sequences, geolocation, and other types of complex numeric and text-based fields.

Data accuracy is critical, but the large size of the network and the wide geographic distribution made it challenging to enable seamless data sharing and analysis. That’s why we selected Amazon Relational Database Service (Amazon RDS) for MySQL. It provides the scalability and manageability we needed to deliver this kind of complex solution. We have a lot of sophisticated data scientists who are used to having full access to the infrastructure and applications they’re working with. With AWS, we never have to worry that our solution will place limits or abstractions in their way. We get the benefits of a managed cloud service with the comprehensive detail that allows us to innovate and scale the solution to accommodate a large research community.

Understanding and improving data

Initially, the team was using spreadsheets for reporting and visualizing data, but it soon became clear that we and the CEIRS scientists and managers needed the ability to collaborate, perform analytics in a more interactive and centralized way, share tailored dashboards, and seamlessly access the entire CEIRS dataset across all centers. We needed to adopt an embedded solution that supported data analytics, visualization, and secure sharing capabilities.

QuickSight made sense for a number of reasons. First, because the project’s entire technical infrastructure was already hosted on AWS, the amount of effort that it took to begin using QuickSight and connect it to our data was minimal. This was a big advantage, considering that we’re a relatively lean team.

Another reason is being able to take advantage of the Super-fast, Parallel, In-memory Calculation Engine (SPICE), one of the most powerful features of QuickSight. With SPICE, we can manipulate huge amounts of data and experience almost instantaneous refreshing of results. The speed is fantastic. QuickSight is also highly cost-effective compared to other BI solutions and offers flexible per-user licensing.

In addition, while a majority of CEIRS data is eventually shared in publicly accessible databases, the full unrestricted dataset is only available to account holders. QuickSight offers built-in integration with AWS Identity and Access Management (IAM) for internal management and interoperates with our self-managed OpenLDAP and KeyCloak servers hosted on Amazon Elastic Compute Cloud (Amazon EC2). This meant that we could fully integrate QuickSight into our single sign-on (SSO) platform. As we explain in the next section, these features allowed us to design a seamless experience for our users. Logging in with SSO automatically provisions QuickSight accounts and grants access to data according to user-specific permissions.

Serving the needs of many users

We leverage QuickSight in many workflows in the context of our data management operations. Our own team members use it to analyze and explore the scientific data submitted to us by CEIRS scientists. We perform both automated and manual quality control of all the data to ensure compliance with agreed-upon standards and to ensure that the information is scientifically valid. We use our team’s scientific expertise to identify information that, although technically compliant, may not be scientifically valid when evaluated in full context. QuickSight integration in our workflow enables us to extract the datasets to be examined, run analytics to expose data issues, and collaborate with data submitters to fix errors.

In addition to quality control, our users rely on QuickSight for tracking the status of their data submissions and to issue both recurring and ad hoc data management reports. Program managers also use the platform to quickly respond to data calls from their leadership and to inform strategic planning decisions. The following screenshots show some examples of dashboards enabled by QuickSight.

The architecture of automated account creation

The main challenge we encountered when trying to integrate QuickSight into our permissioned web portal was user access management. The independent user account management system that QuickSight uses didn’t interface directly with our custom OpenLDAP/KeyCloak user authentication and authorization systems hosted on Amazon EC2.

The team developed custom software to bridge the authentication systems, enabling users with valid credentials to gain QuickSight reader permissions and access to data analytics without additional authentication.

In addition to KeyCloak, the solution rests on a set of purpose-built AWS Lambda functions that authenticate and relay access requests seamlessly and automatically (Step 2 in the following architectural diagram). If an account exists, Lambda forwards the request to QuickSight along with an authentication token generated by KeyCloak SSO (Step 6). If not, Lambda uses the newly released QuickSight SDK to create a new reader account on the fly (Step 4) and forwards the request to QuickSight alongside the authentication token generated by KeyCloak. We used the row-level security feature in QuickSight to filter and exclude data from the main analytics core based on the logged-in user’s identity and role (Step 6.1).

Supporting the scientific collaboration of the future

Embedding QuickSight into the DPCC portal gives the entire CEIRS network direct access to a powerful analytical platform. With QuickSight integration, we have removed technical and usability barriers and increased data accessibility, inviting users to explore and harness the power of the CEIRS dataset.

Overall, QuickSight enables us and our partners to quickly identify and address data quality issues and to track and manage quality control efforts. With AWS solutions, scientists and other power users at our partner institutions can analyze data to make scientific discoveries and share content easily. The rapid rollout of new QuickSight features gives our team the confidence to make a long-term commitment to QuickSight as the engine of in-silico discovery for our platform.

* This work was supported by the National Institute of Allergy and Infectious Diseases (NIAID) Centers of Excellence for Influenza Research and Surveillance (CEIRS) Data Processing and Coordination Center (DPCC), contract number HHSN272201400026C.

About the Authors

Stephan Bour, Ph.D. – Digital Infuzion – Chief Scientific Officer
Indresh Singh – J. Craig Venter Institute – Lead Core Informatics Services, Sr. Software Engineer
Brian Markowitz – Digital Infuzion – Lead Data Architect
Marcel Shields – Digital Infuzion – Lead Software Engineer