Skip to content

670 add bundle example for multi-gpu training #673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 29, 2022
54 changes: 33 additions & 21 deletions modules/bundles/get_started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@
"source": [
"# Get started to MONAI bundle\n",
"\n",
"MONAI bundle usually includes the stored weights of a model, TorchScript model, JSON files that include configs and metadata about the model, information for constructing training, inference, and post-processing transform sequences, plain-text description, legal information, and other data the model creator wishes to include.\n",
"A MONAI bundle usually includes the stored weights of a model, TorchScript model, JSON files which include configs and metadata about the model, information for constructing training, inference, and post-processing transform sequences, plain-text description, legal information, and other data the model creator wishes to include.\n",
"\n",
"For more information about MONAI bundle description: https://docs.monai.io/en/latest/bundle_intro.html.\n",
"For more information about MONAI bundles read the description: https://docs.monai.io/en/latest/bundle_intro.html.\n",
"\n",
"This notebook is step-by-step tutorial to help get started to develop a bundle package, which contains a config file to construct the training pipeline and also have a `metadata.json` file to define the metadata information.\n",
"This notebook is a step-by-step tutorial to help get started to develop a bundle package, which contains a config file to construct the training pipeline and also has a `metadata.json` file to define the metadata information.\n",
"\n",
"This notebook mainly contains below sections:\n",
"This notebook mainly contains the below sections:\n",
"- Define a training config with `JSON` or `YAML` format\n",
"- Execute training based on bundle scripts and configs\n",
"- Hybrid programming with config and python code\n",
Expand All @@ -21,7 +21,6 @@
"- Instantiate a python object from a dictionary config with `_target_` indicating class or function name or module path.\n",
"- Execute python expression from a string config with the `$` syntax.\n",
"- Refer to other python object with the `@` syntax.\n",
"- Require other independent config items to execute or instantiate first with the `_requires_` syntax.\n",
"- Macro text replacement with the `%` syntax to simplify the config content.\n",
"- Leverage the `_disabled_` syntax to tune or debug different components.\n",
"- Override config content at runtime.\n",
Expand Down Expand Up @@ -144,13 +143,13 @@
"source": [
"## Define train config - Set imports and input / output environments\n",
"\n",
"Now let's start to define the config file for a regular training task. MONAI bundle support both `JSON` and `YAML` format, here we use `JSON` as example.\n",
"Now let's start to define the config file for a regular training task. MONAI bundles support both `JSON` and `YAML` format, here we use `JSON` as the example.\n",
"\n",
"According to the predefined syntax of MONAI bundle, `$` indicates an expression to evaluate, `@` refers to another object in the config content. For more details about the syntax in bundle config, please check: https://docs.monai.io/en/latest/config_syntax.html.\n",
"\n",
"Please note that MONAI bundle doesn't require any hard-code logic in the config, so users can define the config content in any structure.\n",
"Please note that a MONAI bundle doesn't require any hard-coded logic in the config, so users can define the config content in any structure.\n",
"\n",
"For the first step, import `os` and `glob` to use in the expressions (start with `$`). Then define input / output environments and enable `cudnn.benchmark` for better performance."
"For the first step, import `os` and `glob` to use in the expressions (start with `$`), then define input / output environments and enable `cudnn.benchmark` for better performance."
]
},
{
Expand All @@ -164,8 +163,6 @@
" \"$import os\",\n",
" \"$import ignite\"\n",
" ],\n",
" \"determinism\": \"$monai.utils.set_determinism(seed=123)\",\n",
" \"cudnn_opt\": \"$setattr(torch.backends.cudnn, 'benchmark', True)\",\n",
" \"device\": \"$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')\",\n",
" \"ckpt_path\": \"/workspace/data/models/model.pt\",\n",
" \"dataset_dir\": \"/workspace/data/Task09_Spleen\",\n",
Expand Down Expand Up @@ -325,7 +322,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The train and validation image file names are organized into a list of dictionaries."
"The train and validation image file names are organized into a list of dictionaries.\n",
"\n",
"Here we use `dataset` instance as 1 argument of `dataloader` by the `@` syntax, and please note that `\"#\"` in the reference id are interpreted as special characters to go one level further into the nested config structures. For example: `\"dataset\": \"@train#dataset\"`."
]
},
{
Expand Down Expand Up @@ -430,8 +429,6 @@
"\n",
"Here we use MONAI engine `SupervisedTrainer` to execute a regular training.\n",
"\n",
"`determinism` and `cudnn_opt` are not args of the trainer, but should execute them before training, so here mark them in the `_requires_` field.\n",
"\n",
"If users have customized logic, then can put the logic in the `iteration_update` arg or implement their own `trainer` in python code and set `_target_` to the class directly."
]
},
Expand All @@ -442,7 +439,6 @@
"```json\n",
"\"trainer\": {\n",
" \"_target_\": \"SupervisedTrainer\",\n",
" \"_requires_\": [\"@determinism\", \"@cudnn_opt\"],\n",
" \"max_epochs\": 100,\n",
" \"device\": \"@device\",\n",
" \"train_data_loader\": \"@train#dataloader\",\n",
Expand Down Expand Up @@ -499,7 +495,7 @@
"source": [
"## Define metadata information\n",
"\n",
"Optinally, we can define a `metadata` file in the bundle, which contains the metadata information relating to the model, including what the shape and format of inputs and outputs are, what the meaning of the outputs are, what type of model is present, and other information. The structure is a dictionary containing a defined set of keys with additional user-specified keys.\n",
"We can define a `metadata` file in the bundle, which contains the metadata information relating to the model, including what the shape and format of inputs and outputs are, what the meaning of the outputs are, what type of model is present, and other information. The structure is a dictionary containing a defined set of keys with additional user-specified keys.\n",
"\n",
"A typical `metadata` example is available: \n",
"https://github.com/Project-MONAI/tutorials/blob/master/modules/bundles/spleen_segmentation/configs/metadata.json"
Expand All @@ -513,14 +509,29 @@
"\n",
"There are several predefined scripts in MONAI bundle module to help execute `regular training`, `metadata verification base on schema`, `network input / output verification`, `export to TorchScript model`, etc.\n",
"\n",
"Here we leverage the `run` script and specify the ID of trainer in the config."
"Here we leverage the `run` script and specify the ID of trainer in the config.\n",
"\n",
"Just define the entry point expressions in the config to execute in order, and specify the `runner_id` in CLI script."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`python -m monai.bundle run \"'train#trainer'\" --config_file configs/train.json`"
"```json\n",
"\"training\": [\n",
" \"$monai.utils.set_determinism(seed=123)\",\n",
" \"$setattr(torch.backends.cudnn, 'benchmark', True)\",\n",
" \"$@train#trainer.run()\"\n",
"]\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`python -m monai.bundle run training --config_file configs/train.json`"
]
},
{
Expand All @@ -538,7 +549,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`python -m monai.bundle run \"'train#trainer'\" --config_file configs/train.json --device \"\\$torch.device('cuda:1')\"`"
"`python -m monai.bundle run training --config_file configs/train.json --device \"\\$torch.device('cuda:1')\"`"
]
},
{
Expand All @@ -552,7 +563,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"`python -m monai.bundle run \"'train#trainer'\" --config_file configs/train.json --network \"%configs/test.json#network\"`"
"`python -m monai.bundle run training --config_file configs/train.json --network \"%configs/test.json#network\"`"
]
},
{
Expand All @@ -561,8 +572,9 @@
"source": [
"## Hybrid programming with config and python code\n",
"\n",
"MONAI bundle is flexible to support customized logic, there are several ways to achieve that:\n",
"- If defining own components like transform, loss, trainer, etc. in a python file, just use its module path in `_target_`.\n",
"A MONAI bundle supports flexible customized logic, there are several ways to achieve this:\n",
"\n",
"- If defining own components like transform, loss, trainer, etc. in a python file, just use its module path in `_target_` within the config file.\n",
"- Parse the config in your own python program and do lazy instantiation with customized logic.\n",
"\n",
"Here we show an example to parse the config in python code and execute the training."
Expand Down
58 changes: 58 additions & 0 deletions modules/bundles/spleen_segmentation/configs/evaluate.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
{
"validate#postprocessing":{
"_target_": "Compose",
"transforms": [
{
"_target_": "Activationsd",
"keys": "pred",
"softmax": true
},
{
"_target_": "Invertd",
"keys": ["pred", "label"],
"transform": "@validate#preprocessing",
"orig_keys": "image",
"meta_key_postfix": "meta_dict",
"nearest_interp": [false, true],
"to_tensor": true
},
{
"_target_": "AsDiscreted",
"keys": ["pred", "label"],
"argmax": [true, false],
"to_onehot": 2
},
{
"_target_": "SaveImaged",
"keys": "pred",
"meta_keys": "pred_meta_dict",
"output_dir": "@output_dir",
"resample": false,
"squeeze_end_dims": true
}
]
},
"validate#handlers": [
{
"_target_": "CheckpointLoader",
"load_path": "$@ckpt_dir + '/model.pt'",
"load_dict": {"model": "@network"}
},
{
"_target_": "StatsHandler",
"iteration_log": false
},
{
"_target_": "MetricsSaver",
"save_dir": "@output_dir",
"metrics": ["val_mean_dice", "val_acc"],
"metric_details": ["val_mean_dice"],
"batch_transform": "$monai.handlers.from_engine(['image_meta_dict'])",
"summary_ops": "*"
}
],
"evaluating": [
"$setattr(torch.backends.cudnn, 'benchmark', True)",
"$@validate#evaluator.run()"
]
}
20 changes: 10 additions & 10 deletions modules/bundles/spleen_segmentation/configs/inference.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,11 @@
"$import glob",
"$import os"
],
"cudnn_opt": "$setattr(torch.backends.cudnn, 'benchmark', True)",
"device": "$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')",
"ckpt_path": "/workspace/data/tutorials/modules/bundles/spleen_segmentation/models/model.pt",
"download_ckpt": "$monai.apps.utils.download_url('https://huggingface.co/MONAI/example_spleen_segmentation/resolve/main/model.pt', @ckpt_path)",
"bundle_root": "/workspace/data/tutorials/modules/bundles/spleen_segmentation",
"output_dir": "$@bundle_root + '/eval'",
"dataset_dir": "/workspace/data/Task09_Spleen",
"datalist": "$list(sorted(glob.glob(@dataset_dir + '/imagesTs/*.nii.gz')))",
"device": "$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')",
"network_def": {
"_target_": "UNet",
"spatial_dims": 3,
Expand Down Expand Up @@ -101,16 +100,14 @@
"_target_": "SaveImaged",
"keys": "pred",
"meta_keys": "pred_meta_dict",
"output_dir": "eval"
"output_dir": "@output_dir"
}
]
},
"handlers": [
{
"_target_": "CheckpointLoader",
"_requires_": "@download_ckpt",
"_disabled_": "$not os.path.exists(@ckpt_path)",
"load_path": "@ckpt_path",
"load_path": "$@bundle_root + '/models/model.pt'",
"load_dict": {"model": "@network"}
},
{
Expand All @@ -120,13 +117,16 @@
],
"evaluator": {
"_target_": "SupervisedEvaluator",
"_requires_": "@cudnn_opt",
"device": "@device",
"val_data_loader": "@dataloader",
"network": "@network",
"inferer": "@inferer",
"postprocessing": "@postprocessing",
"val_handlers": "@handlers",
"amp": true
}
},
"evaluating": [
"$setattr(torch.backends.cudnn, 'benchmark', True)",
"[email protected]()"
]
}
34 changes: 34 additions & 0 deletions modules/bundles/spleen_segmentation/configs/multi_gpu_train.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"device": "$torch.device(f'cuda:{dist.get_rank()}')",
"network": {
"_target_": "torch.nn.parallel.DistributedDataParallel",
"module": "$@network_def.to(@device)",
"device_ids": ["@device"]
},
"train#sampler": {
"_target_": "DistributedSampler",
"dataset": "@train#dataset",
"even_divisible": true,
"shuffle": true
},
"train#dataloader#sampler": "@train#sampler",
"train#dataloader#shuffle": false,
"train#trainer#train_handlers": "$@train#handlers[: 1 if dist.get_rank() > 0 else None]",
"validate#sampler": {
"_target_": "DistributedSampler",
"dataset": "@validate#dataset",
"even_divisible": false,
"shuffle": false
},
"validate#dataloader#sampler": "@validate#sampler",
"validate#evaluator#val_handlers": "$None if dist.get_rank() > 0 else @validate#handlers",
"training": [
"$import torch.distributed as dist",
"$dist.init_process_group(backend='nccl')",
"$torch.cuda.set_device(@device)",
"$monai.utils.set_determinism(seed=123)",
"$setattr(torch.backends.cudnn, 'benchmark', True)",
"$@train#trainer.run()",
"$dist.destroy_process_group()"
]
}
22 changes: 13 additions & 9 deletions modules/bundles/spleen_segmentation/configs/train.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@
"$import os",
"$import ignite"
],
"determinism": "$monai.utils.set_determinism(seed=123)",
"cudnn_opt": "$setattr(torch.backends.cudnn, 'benchmark', True)",
"device": "$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')",
"ckpt_dir": "/workspace/data/tutorials/modules/bundles/spleen_segmentation/models",
"bundle_root": "/workspace/data/tutorials/modules/bundles/spleen_segmentation",
"ckpt_dir": "$@bundle_root + '/models'",
"output_dir": "$@bundle_root + '/eval'",
"dataset_dir": "/workspace/data/Task09_Spleen",
"images": "$list(sorted(glob.glob(@dataset_dir + '/imagesTr/*.nii.gz')))",
"labels": "$list(sorted(glob.glob(@dataset_dir + '/labelsTr/*.nii.gz')))",
"device": "$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')",
"network_def": {
"_target_": "UNet",
"spatial_dims": 3,
Expand Down Expand Up @@ -94,7 +94,7 @@
"_target_": "DataLoader",
"dataset": "@train#dataset",
"batch_size": 2,
"shuffle": false,
"shuffle": true,
"num_workers": 4
},
"inferer": {
Expand Down Expand Up @@ -130,7 +130,7 @@
},
{
"_target_": "TensorBoardStatsHandler",
"log_dir": "eval",
"log_dir": "@output_dir",
"tag_name": "train_loss",
"output_transform": "$monai.handlers.from_engine(['loss'], first=True)"
}
Expand All @@ -143,7 +143,6 @@
},
"trainer": {
"_target_": "SupervisedTrainer",
"_requires_": ["@determinism", "@cudnn_opt"],
"max_epochs": 100,
"device": "@device",
"train_data_loader": "@train#dataloader",
Expand Down Expand Up @@ -196,7 +195,7 @@
},
{
"_target_": "TensorBoardStatsHandler",
"log_dir": "eval",
"log_dir": "@output_dir",
"iteration_log": false
},
{
Expand Down Expand Up @@ -232,5 +231,10 @@
"val_handlers": "@validate#handlers",
"amp": true
}
}
},
"training": [
"$monai.utils.set_determinism(seed=123)",
"$setattr(torch.backends.cudnn, 'benchmark', True)",
"$@train#trainer.run()"
]
}
16 changes: 14 additions & 2 deletions modules/bundles/spleen_segmentation/docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,25 @@ Mean Dice = 0.96
Execute training:

```
python -m monai.bundle run "'train#trainer'" --meta_file configs/metadata.json --config_file configs/train.json --logging_file configs/logging.conf
python -m monai.bundle run training --meta_file configs/metadata.json --config_file configs/train.json --logging_file configs/logging.conf
```

Override the `train` config to execute multi-GPU training:

```
torchrun --standalone --nnodes=1 --nproc_per_node=2 -m monai.bundle run training --meta_file configs/metadata.json --config_file "['configs/train.json','configs/multi_gpu_train.json']" --logging_file configs/logging.conf
```

Override the `train` config to execute evaluation with the trained model:

```
python -m monai.bundle run evaluating --meta_file configs/metadata.json --config_file "['configs/train.json','configs/evaluate.json']" --logging_file configs/logging.conf
```

Execute inference:

```
python -m monai.bundle run evaluator --meta_file configs/metadata.json --config_file configs/inference.json --logging_file configs/logging.conf
python -m monai.bundle run evaluating --meta_file configs/metadata.json --config_file configs/inference.json --logging_file configs/logging.conf
```

Verify the metadata format:
Expand Down