670 add bundle example for multi-gpu training (Project-MONAI#673)

Nic-Ma · pre-commit-ci[bot] · web-flow · commit e573874dfa3c · 2022-04-29T12:52:18.000+01:00
* [DLMED] draft config Signed-off-by: Nic Ma <nma@nvidia.com> * [DLMED] update for test Signed-off-by: Nic Ma <nma@nvidia.com> * [DLMED] update based on enhancement Signed-off-by: Nic Ma <nma@nvidia.com> * [DLMED] update tutorial Signed-off-by: Nic Ma <nma@nvidia.com> * [DLMED] simplify to override Signed-off-by: Nic Ma <nma@nvidia.com> * [DLMED] update according to comments Signed-off-by: Nic Ma <nma@nvidia.com> * [DLMED] remove test file Signed-off-by: Nic Ma <nma@nvidia.com> * [DLMED] add evaluation config Signed-off-by: Nic Ma <nma@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [DLMED] simplify inference Signed-off-by: Nic Ma <nma@nvidia.com> * [DLMED] update according to comments Signed-off-by: Nic Ma <nma@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
diff --git a/modules/bundles/get_started.ipynb b/modules/bundles/get_started.ipynb
@@ -6,13 +6,13 @@
    "source": [
     "# Get started to MONAI bundle\n",
     "\n",
-    "MONAI bundle usually includes the stored weights of a model, TorchScript model, JSON files that include configs and metadata about the model, information for constructing training, inference, and post-processing transform sequences, plain-text description, legal information, and other data the model creator wishes to include.\n",
+    "A MONAI bundle usually includes the stored weights of a model, TorchScript model, JSON files which include configs and metadata about the model, information for constructing training, inference, and post-processing transform sequences, plain-text description, legal information, and other data the model creator wishes to include.\n",
     "\n",
-    "For more information about MONAI bundle description: https://docs.monai.io/en/latest/bundle_intro.html.\n",
+    "For more information about MONAI bundles read the description: https://docs.monai.io/en/latest/bundle_intro.html.\n",
     "\n",
-    "This notebook is step-by-step tutorial to help get started to develop a bundle package, which contains a config file to construct the training pipeline and also have a `metadata.json` file to define the metadata information.\n",
+    "This notebook is a step-by-step tutorial to help get started to develop a bundle package, which contains a config file to construct the training pipeline and also has a `metadata.json` file to define the metadata information.\n",
     "\n",
-    "This notebook mainly contains below sections:\n",
+    "This notebook mainly contains the below sections:\n",
     "- Define a training config with `JSON` or `YAML` format\n",
     "- Execute training based on bundle scripts and configs\n",
     "- Hybrid programming with config and python code\n",
@@ -21,7 +21,6 @@
     "- Instantiate a python object from a dictionary config with `_target_` indicating class or function name or module path.\n",
     "- Execute python expression from a string config with the `$` syntax.\n",
     "- Refer to other python object with the `@` syntax.\n",
-    "- Require other independent config items to execute or instantiate first with the `_requires_` syntax.\n",
     "- Macro text replacement with the `%` syntax to simplify the config content.\n",
     "- Leverage the `_disabled_` syntax to tune or debug different components.\n",
     "- Override config content at runtime.\n",
@@ -144,13 +143,13 @@
    "source": [
     "## Define train config - Set imports and input / output environments\n",
     "\n",
-    "Now let's start to define the config file for a regular training task. MONAI bundle support both `JSON` and `YAML` format, here we use `JSON` as example.\n",
+    "Now let's start to define the config file for a regular training task. MONAI bundles support both `JSON` and `YAML` format, here we use `JSON` as the example.\n",
     "\n",
     "According to the predefined syntax of MONAI bundle, `$` indicates an expression to evaluate, `@` refers to another object in the config content. For more details about the syntax in bundle config, please check: https://docs.monai.io/en/latest/config_syntax.html.\n",
     "\n",
-    "Please note that MONAI bundle doesn't require any hard-code logic in the config, so users can define the config content in any structure.\n",
+    "Please note that a MONAI bundle doesn't require any hard-coded logic in the config, so users can define the config content in any structure.\n",
     "\n",
-    "For the first step, import `os` and `glob` to use in the expressions (start with `$`). Then define input / output environments and enable `cudnn.benchmark` for better performance."
+    "For the first step, import `os` and `glob` to use in the expressions (start with `$`), then define input / output environments and enable `cudnn.benchmark` for better performance."
    ]
   },
   {
@@ -164,8 +163,6 @@
     "        \"$import os\",\n",
     "        \"$import ignite\"\n",
     "    ],\n",
-    "    \"determinism\": \"$monai.utils.set_determinism(seed=123)\",\n",
-    "    \"cudnn_opt\": \"$setattr(torch.backends.cudnn, 'benchmark', True)\",\n",
     "    \"device\": \"$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')\",\n",
     "    \"ckpt_path\": \"/workspace/data/models/model.pt\",\n",
     "    \"dataset_dir\": \"/workspace/data/Task09_Spleen\",\n",
@@ -325,7 +322,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The train and validation image file names are organized into a list of dictionaries."
+    "The train and validation image file names are organized into a list of dictionaries.\n",
+    "\n",
+    "Here we use `dataset` instance as 1 argument of `dataloader` by the `@` syntax, and please note that `\"#\"` in the reference id are interpreted as special characters to go one level further into the nested config structures. For example: `\"dataset\": \"@train#dataset\"`."
    ]
   },
   {
@@ -430,8 +429,6 @@
     "\n",
     "Here we use MONAI engine `SupervisedTrainer` to execute a regular training.\n",
     "\n",
-    "`determinism` and `cudnn_opt` are not args of the trainer, but should execute them before training, so here mark them in the `_requires_` field.\n",
-    "\n",
     "If users have customized logic, then can put the logic in the `iteration_update` arg or implement their own `trainer` in python code and set `_target_` to the class directly."
    ]
   },
@@ -442,7 +439,6 @@
     "```json\n",
     "\"trainer\": {\n",
     "    \"_target_\": \"SupervisedTrainer\",\n",
-    "    \"_requires_\": [\"@determinism\", \"@cudnn_opt\"],\n",
     "    \"max_epochs\": 100,\n",
     "    \"device\": \"@device\",\n",
     "    \"train_data_loader\": \"@train#dataloader\",\n",
@@ -499,7 +495,7 @@
    "source": [
     "## Define metadata information\n",
     "\n",
-    "Optinally, we can define a `metadata` file in the bundle, which contains the metadata information relating to the model, including what the shape and format of inputs and outputs are, what the meaning of the outputs are, what type of model is present, and other information. The structure is a dictionary containing a defined set of keys with additional user-specified keys.\n",
+    "We can define a `metadata` file in the bundle, which contains the metadata information relating to the model, including what the shape and format of inputs and outputs are, what the meaning of the outputs are, what type of model is present, and other information. The structure is a dictionary containing a defined set of keys with additional user-specified keys.\n",
     "\n",
     "A typical `metadata` example is available:  \n",
     "https://github.com/Project-MONAI/tutorials/blob/master/modules/bundles/spleen_segmentation/configs/metadata.json"
@@ -513,14 +509,29 @@
     "\n",
     "There are several predefined scripts in MONAI bundle module to help execute `regular training`, `metadata verification base on schema`, `network input / output verification`, `export to TorchScript model`, etc.\n",
     "\n",
-    "Here we leverage the `run` script and specify the ID of trainer in the config."
+    "Here we leverage the `run` script and specify the ID of trainer in the config.\n",
+    "\n",
+    "Just define the entry point expressions in the config to execute in order, and specify the `runner_id` in CLI script."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "`python -m monai.bundle run \"'train#trainer'\" --config_file configs/train.json`"
+    "```json\n",
+    "\"training\": [\n",
+    "    \"$monai.utils.set_determinism(seed=123)\",\n",
+    "    \"$setattr(torch.backends.cudnn, 'benchmark', True)\",\n",
+    "    \"$@train#trainer.run()\"\n",
+    "]\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`python -m monai.bundle run training --config_file configs/train.json`"
    ]
   },
   {
@@ -538,7 +549,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "`python -m monai.bundle run \"'train#trainer'\" --config_file configs/train.json --device \"\\$torch.device('cuda:1')\"`"
+    "`python -m monai.bundle run training --config_file configs/train.json --device \"\\$torch.device('cuda:1')\"`"
    ]
   },
   {
@@ -552,7 +563,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "`python -m monai.bundle run \"'train#trainer'\" --config_file configs/train.json --network \"%configs/test.json#network\"`"
+    "`python -m monai.bundle run training --config_file configs/train.json --network \"%configs/test.json#network\"`"
    ]
   },
   {
@@ -561,8 +572,9 @@
    "source": [
     "## Hybrid programming with config and python code\n",
     "\n",
-    "MONAI bundle is flexible to support customized logic, there are several ways to achieve that:\n",
-    "- If defining own components like transform, loss, trainer, etc. in a python file, just use its module path in `_target_`.\n",
+    "A MONAI bundle supports flexible customized logic, there are several ways to achieve this:\n",
+    "\n",
+    "- If defining own components like transform, loss, trainer, etc. in a python file, just use its module path in `_target_` within the config file.\n",
     "- Parse the config in your own python program and do lazy instantiation with customized logic.\n",
     "\n",
     "Here we show an example to parse the config in python code and execute the training."
diff --git a/modules/bundles/spleen_segmentation/configs/evaluate.json b/modules/bundles/spleen_segmentation/configs/evaluate.json
@@ -0,0 +1,58 @@
+{
+    "validate#postprocessing":{
+        "_target_": "Compose",
+        "transforms": [
+            {
+                "_target_": "Activationsd",
+                "keys": "pred",
+                "softmax": true
+            },
+            {
+                "_target_": "Invertd",
+                "keys": ["pred", "label"],
+                "transform": "@validate#preprocessing",
+                "orig_keys": "image",
+                "meta_key_postfix": "meta_dict",
+                "nearest_interp": [false, true],
+                "to_tensor": true
+            },
+            {
+              "_target_": "AsDiscreted",
+                "keys": ["pred", "label"],
+                "argmax": [true, false],
+                "to_onehot": 2
+            },
+            {
+              "_target_": "SaveImaged",
+                "keys": "pred",
+                "meta_keys": "pred_meta_dict",
+                "output_dir": "@output_dir",
+                "resample": false,
+                "squeeze_end_dims": true
+            }
+        ]
+    },
+    "validate#handlers": [
+        {
+            "_target_": "CheckpointLoader",
+            "load_path": "$@ckpt_dir + '/model.pt'",
+            "load_dict": {"model": "@network"}
+        },
+        {
+            "_target_": "StatsHandler",
+            "iteration_log": false
+        },
+        {
+            "_target_": "MetricsSaver",
+            "save_dir": "@output_dir",
+            "metrics": ["val_mean_dice", "val_acc"],
+            "metric_details": ["val_mean_dice"],
+            "batch_transform": "$monai.handlers.from_engine(['image_meta_dict'])",
+            "summary_ops": "*"
+        }
+      ],
+    "evaluating": [
+        "$setattr(torch.backends.cudnn, 'benchmark', True)",
+        "$@validate#evaluator.run()"
+    ]
+}
diff --git a/modules/bundles/spleen_segmentation/configs/inference.json b/modules/bundles/spleen_segmentation/configs/inference.json
@@ -3,12 +3,11 @@
         "$import glob",
         "$import os"
     ],
-    "cudnn_opt": "$setattr(torch.backends.cudnn, 'benchmark', True)",
-    "device": "$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')",
-    "ckpt_path": "/workspace/data/tutorials/modules/bundles/spleen_segmentation/models/model.pt",
-    "download_ckpt": "$monai.apps.utils.download_url('https://huggingface.co/MONAI/example_spleen_segmentation/resolve/main/model.pt', @ckpt_path)",
+    "bundle_root": "/workspace/data/tutorials/modules/bundles/spleen_segmentation",
+    "output_dir": "$@bundle_root + '/eval'",
     "dataset_dir": "/workspace/data/Task09_Spleen",
     "datalist": "$list(sorted(glob.glob(@dataset_dir + '/imagesTs/*.nii.gz')))",
+    "device": "$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')",
     "network_def": {
         "_target_": "UNet",
         "spatial_dims": 3,
@@ -101,16 +100,14 @@
                 "_target_": "SaveImaged",
                 "keys": "pred",
                 "meta_keys": "pred_meta_dict",
-                "output_dir": "eval"
+                "output_dir": "@output_dir"
             }
         ]
     },
     "handlers": [
         {
             "_target_": "CheckpointLoader",
-            "_requires_": "@download_ckpt",
-            "_disabled_": "$not os.path.exists(@ckpt_path)",
-            "load_path": "@ckpt_path",
+            "load_path": "$@bundle_root + '/models/model.pt'",
             "load_dict": {"model": "@network"}
         },
         {
@@ -120,13 +117,16 @@
     ],
     "evaluator": {
         "_target_": "SupervisedEvaluator",
-        "_requires_": "@cudnn_opt",
         "device": "@device",
         "val_data_loader": "@dataloader",
         "network": "@network",
         "inferer": "@inferer",
         "postprocessing": "@postprocessing",
         "val_handlers": "@handlers",
         "amp": true
-    }
+    },
+    "evaluating": [
+        "$setattr(torch.backends.cudnn, 'benchmark', True)",
+        "$@evaluator.run()"
+    ]
 }
diff --git a/modules/bundles/spleen_segmentation/configs/multi_gpu_train.json b/modules/bundles/spleen_segmentation/configs/multi_gpu_train.json
@@ -0,0 +1,34 @@
+{
+    "device": "$torch.device(f'cuda:{dist.get_rank()}')",
+    "network": {
+        "_target_": "torch.nn.parallel.DistributedDataParallel",
+        "module": "$@network_def.to(@device)",
+        "device_ids": ["@device"]
+    },
+    "train#sampler": {
+        "_target_": "DistributedSampler",
+        "dataset": "@train#dataset",
+        "even_divisible": true,
+        "shuffle": true
+    },
+    "train#dataloader#sampler": "@train#sampler",
+    "train#dataloader#shuffle": false,
+    "train#trainer#train_handlers": "$@train#handlers[: 1 if dist.get_rank() > 0 else None]",
+    "validate#sampler": {
+        "_target_": "DistributedSampler",
+        "dataset": "@validate#dataset",
+        "even_divisible": false,
+        "shuffle": false
+    },
+    "validate#dataloader#sampler": "@validate#sampler",
+    "validate#evaluator#val_handlers": "$None if dist.get_rank() > 0 else @validate#handlers",
+    "training": [
+        "$import torch.distributed as dist",
+        "$dist.init_process_group(backend='nccl')",
+        "$torch.cuda.set_device(@device)",
+        "$monai.utils.set_determinism(seed=123)",
+        "$setattr(torch.backends.cudnn, 'benchmark', True)",
+        "$@train#trainer.run()",
+        "$dist.destroy_process_group()"
+    ]
+}
diff --git a/modules/bundles/spleen_segmentation/configs/train.json b/modules/bundles/spleen_segmentation/configs/train.json
@@ -4,13 +4,13 @@
         "$import os",
         "$import ignite"
     ],
-    "determinism": "$monai.utils.set_determinism(seed=123)",
-    "cudnn_opt": "$setattr(torch.backends.cudnn, 'benchmark', True)",
-    "device": "$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')",
-    "ckpt_dir": "/workspace/data/tutorials/modules/bundles/spleen_segmentation/models",
+    "bundle_root": "/workspace/data/tutorials/modules/bundles/spleen_segmentation",
+    "ckpt_dir": "$@bundle_root + '/models'",
+    "output_dir": "$@bundle_root + '/eval'",
     "dataset_dir": "/workspace/data/Task09_Spleen",
     "images": "$list(sorted(glob.glob(@dataset_dir + '/imagesTr/*.nii.gz')))",
     "labels": "$list(sorted(glob.glob(@dataset_dir + '/labelsTr/*.nii.gz')))",
+    "device": "$torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')",
     "network_def": {
         "_target_": "UNet",
         "spatial_dims": 3,
@@ -94,7 +94,7 @@
             "_target_": "DataLoader",
             "dataset": "@train#dataset",
             "batch_size": 2,
-            "shuffle": false,
+            "shuffle": true,
             "num_workers": 4
         },
         "inferer": {
@@ -130,7 +130,7 @@
             },
             {
                 "_target_": "TensorBoardStatsHandler",
-                "log_dir": "eval",
+                "log_dir": "@output_dir",
                 "tag_name": "train_loss",
                 "output_transform": "$monai.handlers.from_engine(['loss'], first=True)"
             }
@@ -143,7 +143,6 @@
         },
         "trainer": {
             "_target_": "SupervisedTrainer",
-            "_requires_": ["@determinism", "@cudnn_opt"],
             "max_epochs": 100,
             "device": "@device",
             "train_data_loader": "@train#dataloader",
@@ -196,7 +195,7 @@
             },
             {
                 "_target_": "TensorBoardStatsHandler",
-                "log_dir": "eval",
+                "log_dir": "@output_dir",
                 "iteration_log": false
             },
             {
@@ -232,5 +231,10 @@
             "val_handlers": "@validate#handlers",
             "amp": true
         }
-    }
+    },
+    "training": [
+        "$monai.utils.set_determinism(seed=123)",
+        "$setattr(torch.backends.cudnn, 'benchmark', True)",
+        "$@train#trainer.run()"
+    ]
 }
diff --git a/modules/bundles/spleen_segmentation/docs/README.md b/modules/bundles/spleen_segmentation/docs/README.md
@@ -26,13 +26,25 @@ Mean Dice = 0.96
 Execute training:
 
 ```
-python -m monai.bundle run "'train#trainer'" --meta_file configs/metadata.json --config_file configs/train.json --logging_file configs/logging.conf
+python -m monai.bundle run training --meta_file configs/metadata.json --config_file configs/train.json --logging_file configs/logging.conf
+```
+
+Override the `train` config to execute multi-GPU training:
+
+```
+torchrun --standalone --nnodes=1 --nproc_per_node=2 -m monai.bundle run training --meta_file configs/metadata.json --config_file "['configs/train.json','configs/multi_gpu_train.json']" --logging_file configs/logging.conf
+```
+
+Override the `train` config to execute evaluation with the trained model:
+
+```
+python -m monai.bundle run evaluating --meta_file configs/metadata.json --config_file "['configs/train.json','configs/evaluate.json']" --logging_file configs/logging.conf
 ```
 
 Execute inference:
 
 ```
-python -m monai.bundle run evaluator --meta_file configs/metadata.json --config_file configs/inference.json --logging_file configs/logging.conf
+python -m monai.bundle run evaluating --meta_file configs/metadata.json --config_file configs/inference.json --logging_file configs/logging.conf
 ```
 
 Verify the metadata format: