AWS Partner Network (APN) Blog

Privacy-Preserving Federated Learning on AWS with NVIDIA FLARE

By Kristopher Kersten (NVIDIA), Umair Khalid (AWS), Steve Fu, PhD (AWS), Olivia Choudhury, PhD (AWS), and Jonathan Schellack (AWS)

NVIDIA-AWS-Partners-2022
NVIDIA
Connect with NVIDIA-1

In recent years, researchers and engineers have benefited from advancements in the fields of medical imaging and artificial intelligence (AI) along with the accelerated adoption towards cloud.

These advancements have allowed researchers to unlock the value of clinical data and have enabled engineers to develop next-generation AI applications to advance machine learning (ML) techniques, enable precision medicine, detect and prevent diseases, and improve patient care.

There are numerous examples of these techniques applied in real-world applications, such as improving screening mammography and providing augmented intelligence in acute stroke management.

Large-scale data is crucial in improving the performance of ML models. However, having access to large, diverse healthcare datasets is often challenging and beyond the scope of a single organization. It can be difficult to balance the constraints of data privacy and locality while developing robust and generalizable ML models with large and diverse datasets.

Federated learning (FL) addresses the need of preserving privacy while having access to large datasets for ML model training. It develops a global model across multiple clients and discrete datasets without sharing input data, and supports different institutions to collaborate on ML model development without sharing sensitive clinical data.

The overall goal of FL is to generate more generalizable models that perform well on any dataset, instead of ML models biased by patient demographics or imaging equipment of a specific hospital or clinic.

The NVIDIA FLARE (which stands for Federated Learning Application Runtime Environment) platform provides an open-source Python SDK for collaborative computation and offers privacy-preserving FL workflows at scale. NVIDIA is an AWS Competency Partner that has pioneered accelerated computing to tackle challenges in AI and computer graphics.

In this post, we present NVIDIA FLARE on Amazon Web Services (AWS) and describe how healthcare organizations can benefit from FL workflows in the medical imaging domain. We use the scenario depicted in Figure 2 as a reference to demonstrate how multiple institutions can collaborate on ML model development using private medical imaging data.

NVIDIA FLARE

The NVIDIA FLARE platform is designed to streamline the use of federated learning techniques in various domains, such as manufacturing, financial services, and healthcare. The architecture of FLARE allows researchers and data scientists to adapt machine learning, deep learning, or general compute workflows in a federated paradigm, and enables secure, privacy-preserving multi-party collaboration.

FLARE’s underlying engine originated in NVIDIA Clara Train and has already been used in development of numerous AI applications within the medical imaging research domain. Now available as a standalone open-source platform, FLARE’s architecture provides components for securely provisioning a federation, establishing secure communication, and defining and orchestrating distributed computational workflows.

NVIDIA-FLARE-2

Figure 1 – NVIDIA FLARE components.

NVIDIA FLARE is designed with a componentized architecture built on a specification-based API that allows researchers and developers to easily adapt and experiment with customized workflows and deployment scenarios.

The components outlined in Figure 1 work in conjunction to enable end-to-end federated learning workflows. Management tools are a set of libraries which are used during initial provisioning of an FL environment, along with the orchestration and monitoring of tasks.

Within the FLARE runtime, the federated specification incorporates commonly-used algorithms to illustrate best practices and allow simplified development of common FL workflows in the domains of training, evaluation, learning, and privacy-preserving.

The learner configuration allows developers to leverage open-source frameworks such as PyTorch-based MONAI (Medical Open Network for Artificial Intelligence) to authenticate, train, and experiment with ML models. MONAI includes many domain-optimized foundational capabilities and tools for developing imaging training workflows in a native PyTorch paradigm which integrates well with NVIDIA FLARE.

Figure 2 demonstrates a typical FL workflow in a healthcare setting, where multiple healthcare organizations can collaborate on training global ML models without having to share sensitive or private data with each other.

In this example, we see a community hospital, medical research center, and treatment center collaborate in model development using medical imaging DICOM (Digital Imaging and Communications in Medicine) datasets. Each organization benefits from the privacy-preserving nature of federated learning workflows.

NVIDIA-FLARE-1

Figure 2 – Typical federated learning workflow in a healthcare setting.

How it Works

NVIDIA FLARE provides a reference implementation of the components needed to provision a federation, establish server and client workflow, orchestrate, and monitor federated applications in the open provision, server controller, client worker, and admin APIs.

Figure 3 depicts the relationship between these components and the communication patterns among them.

NVIDIA-FLARE-3.1

Figure 3 – High-level API interaction within NVIDIA FLARE federated workflow.

The first step in establishing a federation is provisioning, in which the open provision API and its builder modules are used to create the identities of the server, clients, and admin clients. Provisioning is separate from what is considered “operational federated learning” and happens only at the outset of a project.

Operational federated learning is shown in Figure 3 as the interaction of server, clients, and admin client to execute a federated application. This is an ongoing process and may comprise many federated experiments within a project.

There are two communication channels in use during this phase of federated learning: 1) secure client-server communication over gRPC; and 2) secure admin-server communication over TCP. Both channels use shared Secure Sockets Layer (SSL) certificates generated during provisioning to establish the identities and secure communication between participants.

In the client-server communication model, all communications are initiated by the clients. The server’s response to the client’s request determines the action to be performed by the clients.

Similarly, in the admin-server communication model, the admin client issues commands or deploys applications to the server. The server sends a response to the admin client and changes its state based on subsequent client requests. In this way, the admin tool can be used to upload an application to the server and instruct it to deploy the application to clients and begin client execution.

Federated Applications – Controller and Worker API

NVIDIA FLARE collaborative computing is achieved through controller/worker interaction. The controller is a Python object that controls or coordinates workers to perform tasks. It runs on the server and defines the overall collaborative computing workflow.

In its control logic, the controller assigns tasks to workers and processes task results from the workers. The controller and worker APIs are used to implement task-based interactions defined in the FLARE Application, as shown in Figure 4.

A task is a piece of work (Python code) that’s assigned by the controller to client workers. Depending on how the task is assigned (broadcast, send, or relay), it will be performed by one or more clients. In the application, server configuration defines the components to be used in the controller workflow.

For example, server configuration may define the aggregator to accumulate client task data, a model persistor to initialize and save models, and the shareable object to exchange data. The server configuration also defines the controller workflow—for example, scatter and gather—that leverages these components with a series of tasks that are broadcast to the client participants for execution.

NVIDIA-FLARE-4.1

Figure 4 – Controller and worker task interaction between FLARE server and FLARE client.

The client configuration defines the set of tasks that are available for execution in a client worker, along with the path to the code that implements the task and any arguments required for executing it.

There may not be a one-to-one mapping between tasks assigned in the server controller workflow and client configuration. A client may be configured to be capable of additional tasks or only a subset of the tasks in the global workflow.

The relationship between these components is shown in Figure 4, where the server controller workflow defines task assignments that are broadcast and executed on the client worker. The results are then returned to the server for aggregation.

As can be seen in the diagram, both on task assignment and task result submission, filtering can be applied on the server and client sides.

Privacy and Security

NVIDIA FLARE employs differential privacy and homomorphic encryption to preserve data privacy during a federated learning workflow. The filters described in the previous section are entirely customizable as part of the server and client configuration, and can be used to implement these security measures.

NVIDIA FLARE ships with reference filters to implement privacy preservation through exclusion of variables, truncation of weights by percentile, or sparse vector techniques. It also provides a framework for homomorphic encryption through encryption and decryption filters, which can be used by clients to encrypt data before sharing it with a peer.

FLARE allows the server to operate on encrypted data for aggregation and return encrypted results that are decrypted by clients to continue local training. In this model, the server does not have decryption keys to ensure unencrypted data is only visible to the client that owns the data and decryption key.

NVIDIA FLARE on AWS

The compute and network requirements for the FLARE server and client systems depend on the workflow to be executed on both the server and client side. A federated learning workflow can be deployed on AWS with multiple possible configurations.

Unless GPU-based data processing is implemented as part of the server’s controller workflow, a GPU is typically not required for the server. The bulk of the computation happens on the clients, which can typically benefit from GPU-accelerated client systems.

Network requirements depend on the data shared between the server and client systems. In the case of deep learning, the data shared between clients and server is the model weights (or updates to weights), which means the network bandwidth is a function of both the model architecture and number of clients.

For the case of a simple demo, dedicated network bandwidth is typically not a concern. However, in a real-world use case, it may be beneficial to provision clients with gigabit networking, and server with 10+ gigabit, depending on the number of clients.

You can use a standalone workstation or laptop for provisioning the NVIDIA FLARE startup kits and running the admin client. The FLARE server and clients are provisioned on Amazon Elastic Compute Cloud (Amazon EC2) instances.

We will continue using our example federated learning environment from Figure 1 with three independent healthcare providers as part of a federation in a hybrid cloud/on-premises environment.

Below, Figure 5 shows the community hospital is hosting a FLARE client environment in an on-premises data center, while the medical research center and cancer treatment center are running on AWS. The scenario also demonstrates multiple researchers and a study owner interfacing with the environment and collaborating on model development.

Let’s review the pattern employed by the members of the federation in this scenario.

NVIDIA-FLARE-5

Figure 5 – NVIDIA FLARE deployment on AWS.

The community hospital is a member of the federation and running on an on-premises FLARE client. Researcher A is using an ML workstation to interface with the FLARE client and perform training tasks.

The FLARE client has access to the private DICOM data and communicates with the FLARE server (inside the medical research center) via a virtual private network (VPN) connection. The VPN connection is used for gRPC traffic and terminates into an AWS Transit Gateway within the medical research center’s AWS account.

The medical research center is the primary entity hosting the study with the FLARE server instance. They also have a segregated clinical research department, which is part of federation, hosting the FLARE client instance with private medical imaging (DICOM) data.

The medical research center’s FLARE server and clients are hosted in two distinct virtual private clouds (VPCs) to keep resources isolated. The FLARE server environment is set up to support highly available deployments where multiple instances can run on general-purpose EC2 instances within an auto scaling group. The FLARE server also utilizes Amazon Simple Storage Service (Amazon S3) for configuration data and communicates via an S3 endpoint outside the VPC.

The study owner/lead researcher interfaces with the FLARE server instance using the admin client workstation via TCP-based traffic. To enable communication with the members of the federation (FLARE clients) over gRPC, we provision an AWS Transit Gateway which acts as the hub to interconnect between the VPCs and on premises.

The FLARE client environment in the second VPC leverages a GPU-optimized EC2 instance inside a private subnet. The client instance has access to private DICOM data which will be accessed via an S3 endpoint and used for model development. Researcher B interfaces with the FLARE client using the ML workstation.

The cancer treatment center is a member of the federation and hosts a FLARE client in its AWS account. The FLARE client environment follows a similar pattern as the medical research center client VPC when it comes to communication with the FLARE server, although the communication is cross-account. The FLARE client has access to private DICOM dataset via an S3 endpoint, and Researcher C is able to interface with the client using the ML workstation.

Summary

In this post on NVIDIA FLARE, we introduced the key concepts of federated learning (FL) and the challenges it addresses in the medical imaging domain. We described the NVIDIA FLARE architecture, key components that enable FL workflows, and their interaction.

We also discussed the security and privacy features built into the FLARE platform and how it can be used with AWS in a hybrid healthcare provider setting. We encourage you to deploy NVIDIA FLARE examples or your own custom application on AWS and welcome feedback and feature requests on the NVIDIA FLARE GitHub.

Much of what’s included in this guide is based on the built-in features of NVIDIA FLARE. Because the framework is entirely specification-based and implemented in flexible APIs, these basic steps can be easily extended. Using this blueprint, one could, for example, include custom builder modules to automate project provisioning on AWS or automate administration of the FLARE experiment by leveraging the admin API rather than the command line interface (CLI).

.
NVIDIA-APN-Blog-Connect-2022
.


NVIDIA – AWS Partner Spotlight

NVIDIA is an AWS Competency Partner that has pioneered accelerated computing to tackle challenges in AI and computer graphics.

Contact NVIDIA | Partner Overview | AWS Marketplace