Skip to main content

General

This document highlights the architecture, overall idea of the federated learning project we are currently working on, and also an initial steps for environment setup.

High level overview

Our solution is based mostly on already existing and established technologies in the domain.

For several months, we have conducted a big market research, where we identified several different federated learning (FL) frameworks. Some of the frameworks weren't actually able to run in distributed settings (only on localhost), some weren't able to utilize different machine learning models, while some weren't comming close to reliability and resiliency we were looking at. After a careful testing of different aspects, we decided to proceed with FL framework called NVFlare.

NVIDIA NVFlare

NVFlare is a FL framework, created and maintained by NVIDIA. Originally NVIDIA Clara featured a FL capabilites, specific for medical domain. However, these became more general over time and evolved into a separated framework. The framework became opensource at RSNA 2021. Since those time, Flare has evolved and is, to the best of our knowledge, the best framework to use today.

When reading the section below, the links may be outdated. Please make sure to check in the left bottom side of the readthedocs.io site whether selected version is the newest one.

Main features

While NVFlare is a federated learning library, it provides a big number of different features that may be useable, depending on your usecase. Please check documentation page.

Modus Operandi

NVFlare features several steps needed to be done in order to federate:

1. Provisioning

Provisioning is a step, when server address, its ports, admin user names, client names and adresses, owners of sites etc. are defined. This step is described via .yaml file and then zip files with all necessary files to run server and clients are generated. These files include several scripts, several json configs (most of these shouldn't change once deployed due to the checksumming) and also certificates for the protection of data transfer (TLS). The provisining thus create a server and client deployment zips. In earlier versions of NVFlare, the provisioning was a static step which needed to be done at the start of the experiment. If one wanted to add new sites during experiment, it wasn't possible, the provisioning needed to be done again. Nowadays, NVFlare supports dynamic provisioning, which allows to provision new sites even during experiments. The provisioning itself doesn't need to be done on server or client, it can be done on any computer with NVFlare installed, whether it participates on FL experiments or not.

2. Running the clients and server

Now we have the zips, we distribute them over to the clients and server. In general, NVFlare clients can be run on the OS-level or in a container. Everything client and server needs in order to run is the NVFlare environment (conda or venv) with NVFlare and dependencies installed. Server and clients are run via bash script.

3. Launching the experiment

Once we have server launched and clients connected. We can actually launch the experiment. The two possible ways to do is either via Admin Client or FLAdminAPI. In order to launch an experiment, the code need to follow some guidelines and accomodate some NVFlare specific APIs. We will talk about more specifics in ... However, NVFlare should be able to work with TensorFlow, PyTorch, MONAI and Scikit-learn models.

4. Monitoring the experiment

Once the experiment is launched, all the clients and server are outputting the results to the standard output channels. This way, it is possible to monitor what is happening, what are the metrics etc. Once experiment finishes, one can download an aggregated model from the server.

Well... This seems like a lot of manual work, right? Provisioning by hand and command line, submitting by hand and command line, launching experiment, jumping VMs/containers... Also, while NVFlare is continously developed to be as easy to work with, as possible, it still requires quite a bit of engineering experience to work with. To the best of our knowledge, no FL framework currently offers a reasonable user interface for model management, code deployment etc. However, to ease the workflow of NVFlare, we needed a front-end. After thinking about what are the possibilities and what supporting technologies we can use, we decided to go with Azure Machine Learning.

Azure Machine Learning

Azure Machine Learning (AML) is an environment for machine learning in Azure, provided and developed by Microsoft.

Features

  1. Creating ML experiments in three different ways - jupyter notebooks, autoML and visual programming (Designer)
  2. Management of assets needed for experiments or created by experiments - data, jobs, components, python environments, models
  3. Management of services - computational, data labeling

For deeper overview of AML capabilities, please refer to this presentation by Microsoft, provided to SHS.

This environment is created, mantained and actively developed by Microsoft, requiring zero effort to mantain on our side.

However, FL is kind of based on a premise, that data shouldn't leave on-premise environment. This way, we also needed to create local environment, where we could launch a NVFlare client and do local computations. However, we also needed to be able to control that via AML. The answer was utilizing Azure ARC.

Azure ARC

Azure ARC is a private cloud technology that in short allows one to register their own server/kubernetes cluster to the Azure and submit tasks/control it from there. There are two ways - either having Arc-enabled kubernetes cluster or Arc-enabled servers. We went with the kubernetes version as it allows us to separate our training from the host.

In our scenario, kubernetes clusters are registered to Azure via Azure Arc and we deploy machine learning tasks/FL clients and tasks there through AML.

What does AML provide for our project?

  1. Access control, budgeting, monitoring, logs - all the power of Azure over management of resources is also present for AML workspaces
  2. Code development in jupyter notebooks - synced with git, developed in VS Code (through AML integration) or in browser
  3. Data management - in case there are also cloud data that will be used from Azure, they can be registered in the workspace and managed from there
  4. Jobs management and logging - experiment and task management. We are able to get logs and logged files from every on-premise machine registered through Arc
  5. Container and conda management - as AML tasks are submitted as containers, the Environments allow one to specify Dockerfile and conda config for particular Environment where code/task is deployed
  6. Model management - models are logged directly in model tab, through the MLFlow integration
  7. Cluster/Machine monitoring - AML has interface, which allows to check currently working machines, their utilization, jobs currently running on them, history of jobs etc.
  8. Easy access to VMs and addition of VM sites - there may be cases, when data can be stored in cloud and we will need to process them and add to FL, without actually having on-premise HW on site

Fig.1 - Example of mixed on-prem/cloud deployment

General control flow

As said, whole NVFlare environment is controlled from a single point - .ipynb notebook in AML Studio. This notebook initialize the environment, provision the clients and server based on current config, create a docker containers with necessary information for both server and clients and submits them.

Afterwards, once everything has been launched, it initialize the admin client of NVFlare on server and allows once to submit their training code. In the notebook, one can check all the status codes returned and , through the admin client, also can check the status of jobs. Metrics and models are submitted to the MLFlow, which is part of AML Workspace.

Thus, in AML Workspace one can also monitor the training on the level of participant, through the jobs submitted to the clients and server. There they can view the metrics and logs. Some of the logs, metrics and status codes should also be available through the admin client. While admin client can also abort the training, it is also possible to abort the training via stopping the Jobs on the AML level.

Fig. 2 - Sequence diagram of the solution

General workflow visualisation

In order to explain what is the workflow of our solution, we shortly demonstrate what happens in 4 client setup (3 on-premise, 1 cloud). We assume that infrastracture is already set in place. The setup is following:

Clients:

Hospital Vishnu - Azure Arc enabled Kubernetes on-prem, client

Hospital Shiva - Azure Arc enabled Kubernetes on-prem, client

Hospital Vishnu - Azure Arc enabled Kubernetes on-prem, client

Azure Compute Cluster - Azure compute cluster in Azure cloud, client

Server:

Fedserver - Azure VM, admin and server

Compute instance - Azure compute instance, which powers the machine, where .ipynb notebook runs

Fig. 3 - Initial state after setup of architecture

Until NVFLARE is deployed and fully functional, the jupyter notebook running on compute instance (submit-fl-jobs.ipynb currently) is being the single point of control of the whole system.

The whole environment, libraries, environmental variables and everything else is created and formed here, before being submitted to the clients in form of a DOCKERFILE, which is then turned to the image and run as a container. This is basically what happens until the section Submit NVIDIA FLARE Job in the .ipynb file. The NVFlare jobs are being submitted as a part of the container and we currently don't have a mechanism to copy them to containers during the training itself, so please make sure to configure the paths in the .ipynb properly.

Fig. 4 - State of the machines after deployment of docker containers

Once DOCKERFILEs are deployed to the machines and turned into containers, NVFlare clients and servers start to run automatically. Most of the time, the server is first to start, due to the lesser number of dependencies.

Fig. 5 - NVFlare containers are being run

Now that clients and server are fully initialized, we can initialize an admin/APIRunner and submit a job. After the job is submitted, the NVFlare takes care of the communication itself, there is no involvement from the Jupyter notebook anymore. However, if you want to cancel the task and reuse the same clients, it can be utilized via the admin/APIRunner.

Fig. 6 - NVFlare training is running, no involvement from .ipynb

In the last step, which comes into play during every aggregation, 2 other things happens:

  1. Models are being sent for an aggregation
  2. Metrics are submitted to the Metrics tab of the server

In second step, another component, called MLFlow, comes into the play. This component collects the metrics about each and every machine that comes into the training.

Fig. 7 - NVFlare training is running, no involvement from .ipynb