Skip to content

Commit ff586e0

Browse files
authored
initial reference implementation for Breast Density FL Challenge (#680)
* initial reference implementation for Breast Density FL Challenge
1 parent cf3df67 commit ff586e0

30 files changed

+1879
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Ignore the following files/folders during docker build
2+
3+
__pycache__/
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
# IDE
2+
.idea/
3+
4+
# artifacts
5+
poc/
6+
*.pyc
7+
result_*
8+
*.pth
9+
logs
10+
11+
# example data
12+
*preprocessed*
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# use python base image
2+
FROM python:3.8.10
3+
ENV DEBIAN_FRONTEND noninteractive
4+
5+
# specify the server FQDN as commandline argument
6+
ARG server_fqdn
7+
RUN echo "Setting up FL workspace wit FQDN: ${server_fqdn}"
8+
9+
# add your code to container
10+
COPY code /code
11+
12+
# add code to path
13+
ENV PYTHONPATH=${PYTHONPATH}:"/code"
14+
15+
# install dependencies
16+
# RUN python -m pip install --upgrade pip
17+
RUN pip3 install tensorboard sklearn torchvision
18+
RUN pip3 install monai==0.8.1
19+
RUN pip3 install nvflare==2.0.16
20+
21+
# mount nvflare from source
22+
#RUN pip install tenseal
23+
#WORKDIR /code
24+
#RUN git clone https://github.com/NVIDIA/NVFlare.git
25+
#ENV PYTHONPATH=${PYTHONPATH}:"/code/NVFlare"
26+
27+
# download pretrained weights
28+
ENV TORCH_HOME=/opt/torch
29+
RUN python3 /code/pt/utils/download_model.py --model_url=https://download.pytorch.org/models/resnet18-f37072fd.pth
30+
31+
# prepare FL workspace
32+
WORKDIR /code
33+
RUN sed -i "s|{SERVER_FQDN}|${server_fqdn}|g" fl_project.yml
34+
RUN python3 -m nvflare.lighter.provision -p fl_project.yml
35+
RUN cp -r workspace/fl_project/prod_00 fl_workspace
36+
RUN mv fl_workspace/${server_fqdn} fl_workspace/server
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
## MammoFL_MICCAI2022
2+
3+
Reference implementation for
4+
[ACR-NVIDIA-NCI Breast Density FL challenge](http://BreastDensityFL.acr.org).
5+
6+
Held in conjunction with [MICCAI 2022](https://conferences.miccai.org/2022/en/).
7+
8+
9+
------------------------------------------------
10+
## 1. Run Training using [NVFlare](https://github.com/NVIDIA/NVFlare) reference implementation
11+
12+
We provide a minimal example of how to implement Federated Averaging using [NVFlare 2.0](https://github.com/NVIDIA/NVFlare) and [MONAI](https://monai.io/) to train
13+
a breast density prediction model with ResNet18.
14+
15+
### 1.1 Download example data
16+
Follow the steps described in [./data/README.md](./data/README.md) to download an example breast density mammography dataset.
17+
Note, the data used in the actual challenge will be different. We do however follow the same preprocessing steps and
18+
use the same four BI-RADS breast density classes for prediction, See [./code/pt/utils/preprocess_dicomdir.py](./code/pt/utils/preprocess_dicomdir.py) for details.
19+
20+
We provide a set of random data splits. Please download them using
21+
```
22+
python3 ./code/pt/utils/download_datalists_and_predictions.py
23+
```
24+
After download, they will be available as `./data/dataset_blinded_site-*.json` which follows the same format as what
25+
will be used in the challenge.
26+
Please do not modify the data list filenames in the configs as they will be the same during the challenge.
27+
28+
Note, the location of the dataset and data lists will be given by the system.
29+
Do not change the locations given in [config_fed_client.json](./code/configs/mammo_fedavg/config/config_fed_client.json):
30+
```
31+
"DATASET_ROOT": "/data/preprocessed",
32+
"DATALIST_PREFIX": "/data/dataset_blinded_",
33+
```
34+
35+
### 1.2 Build container
36+
The argument specifies the FQDN (Fully Qualified Domain Name) of the FL server. Use `localhost` when simulating FL on your machine.
37+
```
38+
./build_docker.sh localhost
39+
```
40+
Note, all code and pretrained models need to be included in the docker image.
41+
The virtual machines running the containers will not have public internet access during training.
42+
For an example, please see the `download_model.py` used to download ImageNet pretrained weights in this example.
43+
44+
The Dockerfile will be submitted using the [MedICI platform](https://www.medici-challenges.org).
45+
For detailed instructions, see the [challenge website](http://BreastDensityFL.acr.org).
46+
47+
### 1.3 Run server and clients containers, and start training
48+
Run all commands at once using. Note this will also create separate logs under `./logs`
49+
```
50+
./run_all_fl.sh
51+
```
52+
Note, the GPU index to use for each client is specified inside `run_all_fl.sh`.
53+
See the individual `run_docker_site-*.sh` commands described below.
54+
Note, the server script will automatically kill all running container used in this example
55+
and final results will be placed under `./result_server`.
56+
57+
(optional) Run each command in a separate terminals to get site-specific printouts in separate windows.
58+
59+
The argument for each shell script specifies the GPU index to be used.
60+
```
61+
./run_docker_server.sh
62+
./run_docker_site-1.sh 0
63+
./run_docker_site-2.sh 1
64+
./run_docker_site-3.sh 0
65+
```
66+
67+
### 1.4 (Optional) Visualize training using TensorBoard
68+
After training completed, the training curves can be visualized using
69+
```
70+
tensorboard --logdir=./result_server
71+
```
72+
A visualization of the global accuracy and [Kappa](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html) validation scores for each site with the provided example data is shown below.
73+
The current setup runs on a machine with two NVIDIA GPUs with 12GB memory each.
74+
The runtime for this experiment is about 45 minutes.
75+
You can adjust the argument to the `run_docker_site-*.sh` scripts to specify different
76+
GPU indices if needed in your environment.
77+
78+
![](./figs/example_data_val_global_acc_kappa.png)
79+
80+
### 1.5 (Optional) Kill all containers
81+
If you didn't use `run_all_fl.sh`, all containers can be killed by running
82+
```
83+
docker kill server site-1 site-2 site-3
84+
```
85+
86+
87+
------------------------------------------------
88+
## 2. Modify the FL algorithm
89+
90+
You can modify and extend the provided example code under [./code/pt](./code/pt).
91+
92+
You could use other components available at [NVFlare](https://github.com/NVIDIA/NVFlare)
93+
or enhance the training pipeline using your custom code or features of other libraries.
94+
95+
See the [NVFlare examples](https://github.com/NVIDIA/NVFlare/tree/main/examples) for features that could be utilized in this challenge.
96+
97+
### 2.1 Debugging the learning algorithm
98+
99+
The example NVFlare `Learner` class is implemented at [./code/pt/learners/mammo_learner.py](./code/pt/learners/mammo_learner.py).
100+
You can debug the file using the `MockClientEngine` as shown in the script by running
101+
```
102+
python3 code/pt/learners/mammo_learner.py
103+
```
104+
Furthermore, you can test it inside the container, by first running
105+
```
106+
./run_docker_debug.sh
107+
```
108+
Note, set `inside_container = True` to reflect the changed filepaths inside the container.
109+
110+
111+
------------------------------------------------
112+
## 3. Bring your own FL framework
113+
If you would like to use your own FL framework to participate in the challenge,
114+
please modify the Dockerfile accordingly to include all the dependencies.
115+
116+
Your container needs to provide the following scripts that implement the starting of server, clients, and finalizing of the server.
117+
They will be executed by the system in the following order.
118+
119+
### 3.1 start server
120+
```
121+
/code/start_server.sh
122+
```
123+
124+
### 3.2 start each client (in parallel)
125+
```
126+
/code/start_site-1.sh
127+
/code/start_site-2.sh
128+
/code/start_site-3.sh
129+
```
130+
131+
### 3.3 finalize the server
132+
```
133+
/code/finalize_server.sh
134+
```
135+
For an example on how the challenge system will execute these commands, see the provided `run_docker*.sh` scripts.
136+
137+
### 3.4 Communication
138+
The communication channels for FL will be restricted to the ports specified in [fl_project.yml](./code/fl_project.yml).
139+
Your FL framework will also need those ports for implementing the communication.
140+
141+
### 3.5 Results
142+
Results will need to be written to `/result/predictions.json`.
143+
Please follow the format produced by the reference implementation at [./result_server_example/predictions.json](./result_server_example/predictions.json)
144+
(available after running `python3 ./code/pt/utils/download_datalists_and_predictions.py`)
145+
The code is expected to return a json file containing at least list of image names and prediction probabilities for each breast density class
146+
for the global model (should be named `SRV_best_FL_global_model.pt`).
147+
```
148+
{
149+
"site-1": {
150+
"SRV_best_FL_global_model.pt": {
151+
...
152+
"test_probs": [{
153+
"image": "Calc-Test_P_00643_LEFT_MLO.npy",
154+
"probs": [0.005602597258985043, 0.7612965703010559, 0.23040543496608734, 0.0026953918859362602]
155+
}, {
156+
...
157+
},
158+
"site-2": {
159+
"SRV_best_FL_global_model.pt": {
160+
...
161+
"test_probs": [{
162+
"image": "Calc-Test_P_00643_LEFT_MLO.npy",
163+
"probs": [0.005602597258985043, 0.7612965703010559, 0.23040543496608734, 0.0026953918859362602]
164+
}, {
165+
...
166+
},
167+
"site-3": {
168+
"SRV_best_FL_global_model.pt": {
169+
...
170+
"test_probs": [{
171+
"image": "Calc-Test_P_00643_LEFT_MLO.npy",
172+
"probs": [0.005602597258985043, 0.7612965703010559, 0.23040543496608734, 0.0026953918859362602]
173+
}, {
174+
...
175+
}
176+
```
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/usr/bin/env bash
2+
3+
#SERVER_FQDN="localhost"
4+
SERVER_FQDN=$1
5+
6+
if test -z "${SERVER_FQDN}"
7+
then
8+
echo "Usage: ./build_docker.sh [SERVER_FQDN], e.g. ./build_docker.sh localhost"
9+
exit 1
10+
fi
11+
12+
NEW_IMAGE=monai-nvflare:latest
13+
14+
DOCKER_BUILDKIT=0 # show command outputs
15+
docker build --network=host -t ${NEW_IMAGE} --build-arg server_fqdn=${SERVER_FQDN} -f Dockerfile .
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
{
2+
"format_version": 2,
3+
4+
"DATASET_ROOT": "/data/preprocessed",
5+
"DATALIST_PREFIX": "/data/dataset_blinded_",
6+
7+
"executors": [
8+
{
9+
"tasks": [
10+
"train", "submit_model", "validate"
11+
],
12+
"executor": {
13+
"id": "Executor",
14+
"path": "nvflare.app_common.executors.learner_executor.LearnerExecutor",
15+
"args": {
16+
"learner_id": "learner"
17+
}
18+
}
19+
}
20+
],
21+
22+
"task_result_filters": [
23+
],
24+
"task_data_filters": [
25+
],
26+
27+
"components": [
28+
{
29+
"id": "learner",
30+
"path": "pt.learners.mammo_learner.MammoLearner",
31+
"args": {
32+
"dataset_root": "{DATASET_ROOT}",
33+
"datalist_prefix": "{DATALIST_PREFIX}",
34+
"aggregation_epochs": 1,
35+
"lr": 2e-3,
36+
"batch_size": 64,
37+
"val_frac": 0.1
38+
}
39+
},
40+
{
41+
"id": "analytic_sender",
42+
"name": "AnalyticsSender",
43+
"args": {}
44+
},
45+
{
46+
"id": "event_to_fed",
47+
"name": "ConvertToFedEvent",
48+
"args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
49+
}
50+
]
51+
}
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
{
2+
"format_version": 2,
3+
4+
"min_clients": 3,
5+
"num_rounds": 100,
6+
7+
"server": {
8+
"heart_beat_timeout": 600
9+
},
10+
"task_data_filters": [],
11+
"task_result_filters": [],
12+
"components": [
13+
{
14+
"id": "persistor",
15+
"name": "PTFileModelPersistor",
16+
"args": {
17+
"model": {
18+
"path": "monai.networks.nets.TorchVisionFCModel",
19+
"args": {
20+
"model_name": "resnet18",
21+
"n_classes": 4,
22+
"use_conv": false,
23+
"pretrained": true,
24+
"pool": null
25+
}
26+
}
27+
}
28+
},
29+
{
30+
"id": "shareable_generator",
31+
"name": "FullModelShareableGenerator",
32+
"args": {}
33+
},
34+
{
35+
"id": "aggregator",
36+
"name": "InTimeAccumulateWeightedAggregator",
37+
"args": {}
38+
},
39+
{
40+
"id": "model_selector",
41+
"name": "IntimeModelSelectionHandler",
42+
"args": {}
43+
},
44+
{
45+
"id": "model_locator",
46+
"name": "PTFileModelLocator",
47+
"args": {
48+
"pt_persistor_id": "persistor"
49+
}
50+
},
51+
{
52+
"id": "json_generator",
53+
"name": "ValidationJsonGenerator",
54+
"args": {}
55+
},
56+
{
57+
"id": "tb_analytics_receive",
58+
"name": "TBAnalyticsReceiver",
59+
"args": {"events": ["fed.analytix_log_stats"]}
60+
}
61+
],
62+
"workflows": [
63+
{
64+
"id": "scatter_gather_ctl",
65+
"name": "ScatterAndGather",
66+
"args": {
67+
"min_clients" : "{min_clients}",
68+
"num_rounds" : "{num_rounds}",
69+
"start_round": 0,
70+
"wait_time_after_min_received": 10,
71+
"aggregator_id": "aggregator",
72+
"persistor_id": "persistor",
73+
"shareable_generator_id": "shareable_generator",
74+
"train_task_name": "train",
75+
"train_timeout": 0
76+
}
77+
},
78+
{
79+
"id": "global_model_eval",
80+
"name": "GlobalModelEval",
81+
"args": {
82+
"model_locator_id": "model_locator",
83+
"validation_timeout": 6000,
84+
"cleanup_models": true
85+
}
86+
}
87+
]
88+
}
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/usr/bin/env bash
2+
SERVER="server"
3+
echo "FINALIZING ${CLIENT_NAME}"
4+
cp -r ./fl_workspace/${SERVER}/run_1 /result/.
5+
cp ./fl_workspace/${SERVER}/*.txt /result/.
6+
cp ./fl_workspace/*_log.txt /result/.
7+
cp ./fl_workspace/${SERVER}/run_1/cross_site_val/cross_val_results.json /result/predictions.json # only file required for leaderboard computation
8+
# TODO: might need some more standardization of the result folder

0 commit comments

Comments
 (0)