AWS Partner Network (APN) Blog

Building a Cloud-Native Architecture for Vertical Federated Learning on AWS

By Hiroki Moriya, Sr. Research Engineer – DOCOMO Innovations
By Yoshitaka Inoue, Sr. Data Scientist – DOCOMO Innovations
By Qiong Zhang, Kris Skrinak, Rumi Olsen, Sr. Partner Solutions Architects – AWS


In recent years, machine learning (ML) has become must-have technology for industries, and the backbone of machine learning technology is data. As more data becomes available for training, the performance of an ML model is expected to improve.

When gathering data for training, we often face geographically dispersed data sources such as Internet of Things (IoT) devices, smartphones, and edge locations. There are two major challenges when organizations try to centralize data from these data sources for the purpose of building ML models.

The first challenge is privacy. For example, GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act) are privacy control policies which require data residency and may prevent data from moving to other locations. Organizations also have their own privacy policies which could prevent them from disclosing data to third parties.

The second challenge is the cost of data transfer. Many organizations have their own data lakes and, typically, there are two types of costs involved: storage and data transfer. For the latter, as data size grows it becomes too heavy to lift since it requires high bandwidth and stable network connections to transfer large amounts of data.

Federated learning (FL) addresses these challenges in machine learning. It’s a distributed ML technique that doesn’t require data to be centralized, and it doesn’t disclose data to other parties while building the model. In federated learning, a central server coordinates a group of FL clients (such as devices, organizations, or companies) with local datasets to train a global ML model.

There are two flavors of FL which cover different use cases:

  • Horizontal federated learning (HFL) covers use cases where the distributed datasets at all FL clients have the same set of features.
  • Vertical federated learning (VFL) covers use cases where the distributed datasets at FL clients can have different set of features, as long as there are overlapped features for clients to map to the same sample.

DOCOMO Innovations focuses on federated learning, and particularly VFL because it has potential to get better model performance by collaborating with other data providers. DOCOMO Innovations has been investigating the VFL algorithm and its implementation on Amazon Web Services (AWS) for real-world scenarios.

In this post, we present a cloud-native architecture for VFL on AWS, and describe DOCOMO Innovations’ practice to implement it as a serverless application which minimizes operational costs.

DOCOMO Innovations is a subsidiary of NTT DOCOMO, an AWS Select Tier Services Partner and Japan’s largest telecommunications provider. It delivers innovative, convenient, and secure mobile services that enable customers to realize smarter lives.

Vertical Federated Learning

In VFL, the server and multiple clients work together to train a global ML model by exchanging intermediate data (embeddings and gradients), while the server doesn’t access the local data stored at the clients.

Figure 1 shows an example of vertical federated learning. The global model consists of four sub-networks (models) including one server model and three client models. The output layers of the client models are connected with the input layer of the server model.


Figure 1 – Example of vertical federated learning.

In this example, the raw, original data remains at each client, and the server has the label data. Each client doesn’t need to have the same set of features (column data), which provides the flexibility in datasets.

The only macro requirement for the data is to have a common unique identifier (UID) which uniquely identifies a corresponding data sample across the clients. For example, each customer’s ID or email address could be a UID. With a common UID, the local datasets and label data are shuffled in the same order.

The global model is being trained across the server and clients. In a single training round, each client feeds forward its own data into the client model and sends embeddings files to the server. The server feeds them forward into the server model, calculates loss and gradients, updates the server model, and then returns corresponding gradients files back to each client.

The VFL workflow consists of two phases: the training phase and validation phase. It repeats the two phases until the loss from validation phase is converged or the number of training epochs defined in a hyperparameter is achieved.

Figure 2 shows the workflow of the inference even though the detail of it is out of scope. The user of the service can get an inference result by requesting a specific UID. The server model and client models work together to get the inference result based on the latest data and get it back to the user.


Figure 2 – Inference workflow.

Building Vertical Federated Learning on AWS

Next, we will show you a VFL implementation based on serverless architecture which helps reduce operational costs in a real-world use case.

Figure 3 shows the AWS reference architecture of VFL. The workflow on the server is managed by an AWS Step Functions state machine which orchestrates AWS Lambda functions and the steps of interaction with all clients.

The server and each client communicate through Amazon Simple Queue Service (SQS) messages. An Amazon Simple Storage Service (Amazon S3) bucket works as an intermediary between the server and client for exchanging the object files required for building a model.

Lambda functions are used for training and validation, and are deployed as container images with Python ML libraries such as PyTorch and scikit-learn.

The use of containers is advantageous for packaging and deploying as we need multiple libraries and assets for this ML use case. You can replace them with other computing services like Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS) tasks, or Amazon SageMaker training jobs if more computing resources including GPU are needed.

Lambda functions load embedding and gradient files and store them in the temporary storage /tmp for computation. This works because we don’t need persistent or more than 10GB storage for our example dataset. However, Amazon Elastic File Service (Amazon EFS) can be a good option if you have such requirements.


Figure 3 – VFL architecture on AWS.

VFL Workflow Implementation Details

Now, let’s dive deep into the VFL workflow to understand how it works. Figure 4 shows the flow of the training phase, and as a common ML method the training dataset is divided into batches. Both the server and clients execute training on each batch in a training round.

The following describes each step:

  1. The server requests all clients for their embeddings.
  2. Each client calculates the embeddings by using its local model and its specific local batch.
  3. Each client sends the embeddings back to the server.
  4. The server calculates the gradients by using the server model and the embeddings received from clients.
  5. The server sends the gradients to the clients.
  6. The client updates their local model using the gradients.
  7. Return to Step 1 with the next batch of the dataset until all the batches are processed.
  8. Repeat the above steps until the epoch count is achieved or the model converges.


Figure 4 – Training workflow.

The validation phase determines whether the model has converged or not. It’s similar to the training phase except that it doesn’t have gradient updates. The server judges whether the loss is converged and sends a message to the clients to stop the VFL workflow if converged.

Workflow Management

For the VFL workload, the workflow management is a key component. AWS Step Functions works well for orchestrating the request/response between the server and the clients.

Figure 5 shows an entire workflow defined by Step Functions state machines. It consists of three state machines: main, training, and validation. The main state machine triggers a training and validation state machine by calling the Step Functions StartExecution API.

AWS Step Functions has a maximum execution history size quota which is 25,000 events in a single state machine execution history. VFL workflow needs a lot of steps to run training and validation and it may reach the quota. To avoid an unexpected end of workflow, we follow the instructions to avoid reaching the history quota.


Figure 5 – The state machine in AWS Step Functions.

We use the callback pattern to integrate the client’s training and validation steps with the state machine. When the state machine needs an action of a client, it pauses the workflow and sends a message including a task token to the client through Amazon SQS. The client takes an action based on the message and sends the token back to the state machine. Once the state machine gets the task token, it resumes the workflow.

We also use the map state in Step Functions for communicating with all clients in parallel. The map state waits until the actions for all clients are completed, then passes all messages from the clients to the next step.

Interaction Between the Server and Clients

The interaction between the server and clients are based on SQS queues and Amazon S3 buckets. For messages sent from the server to the clients, we use SQS queues, each of which is for a client. The messages may contain file names for clients to download from S3.

Similarly, the callback messages sent from the clients to the server may include file names for the server to download from the S3 bucket. We do not directly include the embeddings and gradients files in SQS messages, since these files are binary files and can be very large in size.

Model Accuracy and Training Time

We evaluated the accuracy of the global model trained with VFL when the number of clients varies between 2 and 4. Each client may contribute unique features to the global model in VFL. The higher number of clients results in more features input to the VFL global model.

In our experimentations, the server is deployed in the Oregon (us-west-2) region and the clients are deployed as Amazon EC2 instances in different regions, as shown below.

Client Region
#1 Oregon – us-west-2
#2 Virginia – us-east-1
#3 Ireland – eu-west-1
#4 Tokyo – ap-northeast-1

We used the Adult DataSet in the UCI machine learning repository in these experiments. The dataset consists of 14 features, which are divided into four subsets of features. Each subset of features is associated with a client.

The next table below shows the ROC-AUC and total training time when changing the number of clients. You can see the ROC-AUC is improved when the number of clients increases. This indicates a model can perform better as more clients are involved and then more features are trained in the VFL global model.

The total training time is longer as the number of clients increases. It depends on many factors, including the computing time at the server and clients, the size of embedding and gradient files, and the network bandwidth and latency between the server and clients.

In our experiments, for the case that Client 3 joins Clients 1 and 2, the total training time is significantly increased from 1,187 seconds to 1,575 seconds. This is because Client 3 is located at the eu-west-1 region, while the others are at the regions in U.S. The network latency between the server and Client 3 is much longer than the latency between the server and Clients 1 and 2.

Clients ROC-AUC Training Time (s)
#1 + #2 0.8117 1,187
#1 + #2 + #3 0.8887 1,575
#1 + #2 + #3 + #4 0.9007 1,758

Note the number of epochs = 10 and the batch size = 1024.


In this post, we introduced vertical federated learning (VFL) and the cloud-native architecture of VFL on AWS. VFL is a good solution for machine learning environments with privacy concerns and limitations of data transfer.

The cloud-native VFL architecture leverages AWS serverless services, including AWS Step Functions, AWS Lambda, Amazon SQS, and Amazon S3. It’s beneficial for real-world use cases to minimize operational cost.

DOCOMO Innovations implemented experiments on the proposed architecture, and its experimental results showed the effectiveness of VFL. You can deploy or customize the implementation by referring to the template published on GitHub. In this implementation, VFL clients are different AWS Identity and Access Management (IAM) users under the same AWS account.


NTT DOCOMO – AWS Partner Spotlight

DOCOMO Innovations is a subsidiary of NTT DOCOMO, an AWS Select Tier Services Partner and Japan’s largest telecommunications provider. It delivers innovative, convenient, and secure mobile services that enable customers to realize smarter lives.

Contact NTT DOCOMO | Partner Overview | AWS Marketplace