Skip to content

Commit 41ad4aa

Browse files
committed
initial reference implementation for Breast Density FL Challenge
1 parent b66354a commit 41ad4aa

38 files changed

+40648
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Ignore the following files/folders during docker build
2+
3+
__pycache__/
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# IDE
2+
.idea/
3+
4+
# artifacts
5+
poc/
6+
*.pyc
7+
result_*
8+
*.pth
9+
logs
10+
11+
# example data
12+
*preprocessed*
13+
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# use python base image
2+
FROM python:3.8.10
3+
ENV DEBIAN_FRONTEND noninteractive
4+
5+
# specify the server FQDN as commandline argument
6+
ARG server_fqdn
7+
RUN echo "Setting up FL workspace wit FQDN: ${server_fqdn}"
8+
9+
# add your code to container
10+
COPY code /code
11+
12+
# add code to path
13+
ENV PYTHONPATH=${PYTHONPATH}:"/code"
14+
15+
# install dependencies
16+
# RUN python -m pip install --upgrade pip
17+
RUN pip3 install tensorboard sklearn torchvision
18+
RUN pip3 install monai==0.8.1
19+
RUN pip3 install nvflare==2.0.16
20+
21+
# mount nvflare from source
22+
#RUN pip install tenseal
23+
#WORKDIR /code
24+
#RUN git clone https://github.com/NVIDIA/NVFlare.git
25+
#ENV PYTHONPATH=${PYTHONPATH}:"/code/NVFlare"
26+
27+
# download pretrained weights
28+
ENV TORCH_HOME=/opt/torch
29+
RUN python3 /code/pt/utils/download_model.py --model_url=https://download.pytorch.org/models/resnet18-f37072fd.pth
30+
31+
# prepare FL workspace
32+
WORKDIR /code
33+
RUN sed -i "s|{SERVER_FQDN}|${server_fqdn}|g" fl_project.yml
34+
RUN python3 -m nvflare.lighter.provision -p fl_project.yml
35+
RUN cp -r workspace/fl_project/prod_00 fl_workspace
36+
RUN mv fl_workspace/${server_fqdn} fl_workspace/server
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
## MammoFL_MICCAI2022
2+
3+
Reference implementation for
4+
[ACR-NVIDIA-NCI Breast Density FL challenge](http://BreastDensityFL.acr.org).
5+
6+
Held in conjunction with [MICCAI 2022](https://conferences.miccai.org/2022/en/).
7+
8+
9+
------------------------------------------------
10+
## 1. Run Training using [NVFlare](https://github.com/NVIDIA/NVFlare) reference implementation
11+
12+
We provide a minimal example of how to implement Federated Averaging using [NVFlare 2.0](https://github.com/NVIDIA/NVFlare) and [MONAI](https://monai.io/) to train
13+
a breast density prediction model with ResNet18.
14+
15+
### 1.1 Download example data
16+
Follow the steps described in [./data/README.md](./data/README.md) to download an example breast density mammography dataset.
17+
Note, the data used in the actual challenge will be different. We do however follow the same preprocessing steps and
18+
use the same four BI-RADS breast density classes for prediction.
19+
20+
We provide a set of random data splits as `./data/dataset_blinded_site-*.json` which follows the same format as what
21+
will be used in the challenge. Please do not modify the data list filenames in the configs as they will be the same during the challenge.
22+
23+
Note, the location of the dataset and data lists will be given by the system.
24+
Do not change the locations given in [config_fed_client.json](./code/configs/mammo_fedavg/config/config_fed_client.json):
25+
```
26+
"DATASET_ROOT": "/data/preprocessed",
27+
"DATALIST_PREFIX": "/data/dataset_blinded_",
28+
```
29+
30+
### 1.2 Build container
31+
The argument specifies the FQDN of the FL server. Use `localhost` when simulating FL on your machine.
32+
```
33+
./build_docker.sh localhost
34+
```
35+
Note, all code and pretrained models need to be included in the docker image.
36+
The virtual machines running the containers will not have public internet access during training.
37+
For an example, please see the `download_model.py` used to download ImageNet pretrained weights in this example.
38+
39+
The Dockerfile be submitted using the [MedICI platform](https://www.medici-challenges.org).
40+
For detailed instructions, see the [challenge website](http://BreastDensityFL.acr.org).
41+
42+
### 1.3 Run server and clients containers, and start training
43+
Run all commands at once using. Note this will also create separate logs under `./logs`
44+
```
45+
./run_all_fl.sh
46+
```
47+
Note, the GPU index to use for each client is specified inside `run_all_fl.sh`.
48+
See the individual `run_docker_site-*.sh` commands described below.
49+
Note, the server script will automatically kill all running container used in this example
50+
and final results will be placed under `./result_server`.
51+
52+
(optional) Run each command in a separate terminals to get site-specific printouts in separate windows.
53+
54+
The argument for each shell script specifies the GPU index to be used.
55+
```
56+
./run_docker_server.sh
57+
./run_docker_site-1.sh 0
58+
./run_docker_site-2.sh 1
59+
./run_docker_site-3.sh 0
60+
```
61+
62+
### 1.4 (Optional) Visualize training using TensorBoard
63+
After training completed, the training curves can be visualized using
64+
```
65+
tensorboard --logdir=./result_server
66+
```
67+
A visualization of the global accuracy and [Kappa](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html) validation scores for each site with the provided example data is shown below.
68+
The current setup runs on a machine with two NVIDIA GPUs with 12GB memory each.
69+
The runtime for this experiment is about 45 minutes.
70+
You can adjust the argument to the `run_docker_site-*.sh` scripts to specify different
71+
GPU indices if needed in your environment.
72+
73+
![](./figs/example_data_val_global_acc_kappa.png)
74+
75+
### 1.5 (Optional) Kill all containers
76+
If you didn't use `run_all_fl.sh`, all containers can be killed by running
77+
```
78+
docker kill server site-1 site-2 site-3
79+
```
80+
81+
82+
------------------------------------------------
83+
## 2. Modify the FL algorithm
84+
85+
You can modify and extend the provided example code under [./code/pt](./code/pt).
86+
87+
You could use other components available at [NVFlare](https://github.com/NVIDIA/NVFlare)
88+
or enhance the training pipeline using your custom code or features of other libraries.
89+
90+
See the [NVFlare examples](https://github.com/NVIDIA/NVFlare/tree/main/examples) for features that could be utilized in this challenge.
91+
92+
### 2.1 Debugging the learning algorithm
93+
94+
The example NVFlare `Learner` class is implemented at [./code/pt/learners/mammo_learner.py](./code/pt/learners/mammo_learner.py).
95+
You can debug the file using the `MockClientEngine` as shown in the script by running
96+
```
97+
python3 code/pt/learners/mammo_learner.py
98+
```
99+
Furthermore, you can test it inside the container, by first running
100+
```
101+
./run_docker_debug.sh
102+
```
103+
Note, set `inside_container = True` to reflect the changed filepaths inside the container.
104+
105+
106+
------------------------------------------------
107+
## 3. Bring your own FL framework
108+
If you would like to use your own FL framework to participate in the challenge,
109+
please modify the Dockerfile accordingly to include all the dependencies.
110+
111+
Your container needs to provide the following scripts that implement the starting of server, clients, and finalizing of the server.
112+
They will be executed by the system in the following order.
113+
114+
### 3.1 start server
115+
```
116+
/code/start_server.sh
117+
```
118+
119+
### 3.2 start each client (in parallel)
120+
```
121+
/code/start_site-1.sh
122+
/code/start_site-2.sh
123+
/code/start_site-3.sh
124+
```
125+
126+
### 3.3 finalize the server
127+
```
128+
/code/finalize_server.sh
129+
```
130+
For an example on how the challenge system will execute these commands, see the provided `run_docker*.sh` scripts.
131+
132+
### 3.4 Communication
133+
The communication channels for FL will be restricted to the ports specified in [fl_project.yml](./code/fl_project.yml).
134+
Your FL framework will also need those ports for implementing the communication.
135+
136+
### 3.5 Results
137+
Results will need to be written to `/result/predictions.json`.
138+
Please follow the format produced by the reference implementation at [./result_server/predictions.json](./result_server/predictions.json)
139+
The code is expected to return a json file containing at least list of image names and prediction probabilities for each breast density class
140+
for the global model (should be named `SRV_best_FL_global_model.pt`).
141+
```
142+
{
143+
"site-1": {
144+
"SRV_best_FL_global_model.pt": {
145+
...
146+
"test_probs": [{
147+
"image": "Calc-Test_P_00643_LEFT_MLO.npy",
148+
"probs": [0.005602597258985043, 0.7612965703010559, 0.23040543496608734, 0.0026953918859362602]
149+
}, {
150+
...
151+
},
152+
"site-2": {
153+
"SRV_best_FL_global_model.pt": {
154+
...
155+
"test_probs": [{
156+
"image": "Calc-Test_P_00643_LEFT_MLO.npy",
157+
"probs": [0.005602597258985043, 0.7612965703010559, 0.23040543496608734, 0.0026953918859362602]
158+
}, {
159+
...
160+
},
161+
"site-3": {
162+
"SRV_best_FL_global_model.pt": {
163+
...
164+
"test_probs": [{
165+
"image": "Calc-Test_P_00643_LEFT_MLO.npy",
166+
"probs": [0.005602597258985043, 0.7612965703010559, 0.23040543496608734, 0.0026953918859362602]
167+
}, {
168+
...
169+
}
170+
```
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/usr/bin/env bash
2+
3+
#SERVER_FQDN="localhost"
4+
SERVER_FQDN=$1
5+
6+
if test -z "${SERVER_FQDN}"
7+
then
8+
echo "Usage: ./build_docker.sh [SERVER_FQDN], e.g. ./build_docker.sh localhost"
9+
exit 1
10+
fi
11+
12+
NEW_IMAGE=monai-nvflare:latest
13+
14+
DOCKER_BUILDKIT=0 # show command outputs
15+
docker build --network=host -t ${NEW_IMAGE} --build-arg server_fqdn=${SERVER_FQDN} -f Dockerfile .
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
{
2+
"format_version": 2,
3+
4+
"DATASET_ROOT": "/data/preprocessed",
5+
"DATALIST_PREFIX": "/data/dataset_blinded_",
6+
7+
"executors": [
8+
{
9+
"tasks": [
10+
"train", "submit_model", "validate"
11+
],
12+
"executor": {
13+
"id": "Executor",
14+
"path": "nvflare.app_common.executors.learner_executor.LearnerExecutor",
15+
"args": {
16+
"learner_id": "learner"
17+
}
18+
}
19+
}
20+
],
21+
22+
"task_result_filters": [
23+
],
24+
"task_data_filters": [
25+
],
26+
27+
"components": [
28+
{
29+
"id": "learner",
30+
"path": "pt.learners.mammo_learner.MammoLearner",
31+
"args": {
32+
"dataset_root": "{DATASET_ROOT}",
33+
"datalist_prefix": "{DATALIST_PREFIX}",
34+
"aggregation_epochs": 1,
35+
"lr": 2e-3,
36+
"batch_size": 64,
37+
"val_frac": 0.1
38+
}
39+
},
40+
{
41+
"id": "analytic_sender",
42+
"name": "AnalyticsSender",
43+
"args": {}
44+
},
45+
{
46+
"id": "event_to_fed",
47+
"name": "ConvertToFedEvent",
48+
"args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
49+
}
50+
]
51+
}
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
{
2+
"format_version": 2,
3+
4+
"min_clients": 3,
5+
"num_rounds": 100,
6+
7+
"server": {
8+
"heart_beat_timeout": 600
9+
},
10+
"task_data_filters": [],
11+
"task_result_filters": [],
12+
"components": [
13+
{
14+
"id": "persistor",
15+
"name": "PTFileModelPersistor",
16+
"args": {
17+
"model": {
18+
"path": "monai.networks.nets.TorchVisionFCModel",
19+
"args": {
20+
"model_name": "resnet18",
21+
"n_classes": 4,
22+
"use_conv": false,
23+
"pretrained": true,
24+
"pool": null
25+
}
26+
}
27+
}
28+
},
29+
{
30+
"id": "shareable_generator",
31+
"name": "FullModelShareableGenerator",
32+
"args": {}
33+
},
34+
{
35+
"id": "aggregator",
36+
"name": "InTimeAccumulateWeightedAggregator",
37+
"args": {}
38+
},
39+
{
40+
"id": "model_selector",
41+
"name": "IntimeModelSelectionHandler",
42+
"args": {}
43+
},
44+
{
45+
"id": "model_locator",
46+
"name": "PTFileModelLocator",
47+
"args": {
48+
"pt_persistor_id": "persistor"
49+
}
50+
},
51+
{
52+
"id": "json_generator",
53+
"name": "ValidationJsonGenerator",
54+
"args": {}
55+
},
56+
{
57+
"id": "tb_analytics_receive",
58+
"name": "TBAnalyticsReceiver",
59+
"args": {"events": ["fed.analytix_log_stats"]}
60+
}
61+
],
62+
"workflows": [
63+
{
64+
"id": "scatter_gather_ctl",
65+
"name": "ScatterAndGather",
66+
"args": {
67+
"min_clients" : "{min_clients}",
68+
"num_rounds" : "{num_rounds}",
69+
"start_round": 0,
70+
"wait_time_after_min_received": 10,
71+
"aggregator_id": "aggregator",
72+
"persistor_id": "persistor",
73+
"shareable_generator_id": "shareable_generator",
74+
"train_task_name": "train",
75+
"train_timeout": 0
76+
}
77+
},
78+
{
79+
"id": "global_model_eval",
80+
"name": "GlobalModelEval",
81+
"args": {
82+
"model_locator_id": "model_locator",
83+
"validation_timeout": 6000,
84+
"cleanup_models": true
85+
}
86+
}
87+
]
88+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/usr/bin/env bash
2+
SERVER="server"
3+
echo "FINALIZING ${CLIENT_NAME}"
4+
cp -r ./fl_workspace/${SERVER}/run_1 /result/.
5+
cp ./fl_workspace/${SERVER}/*.txt /result/.
6+
cp ./fl_workspace/*_log.txt /result/.
7+
cp ./fl_workspace/${SERVER}/run_1/cross_site_val/cross_val_results.json /result/predictions.json # only file required for leaderboard computation
8+
# TODO: might need some more standardization of the result folder

0 commit comments

Comments
 (0)