Skip to content

Commit 6be932f

Browse files
mingxin-zhengpre-commit-ci[bot]wyli
authored
Make iteration-base training params epoch-base in auto3dseg (#1213)
Signed-off-by: Mingxin Zheng <[email protected]> Fixes #1212 ### Description A few sentences describing the changes proposed in this pull request. ### Checks <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Avoid including large-size files in the PR. - [x] Clean up long text outputs from code cells in the notebook. - [x] For security purposes, please check the contents and remove any sensitive info such as user names and private key. - [x] Ensure (1) hyperlinks and markdown anchors are working (2) use relative paths for tutorial repo files (3) put figure and graphs in the `./figure` folder - [x] Notebook runs automatically - `./runner.sh -t 'auto3dseg/notebooks/auto3dseg_hello_world.ipynb'` - `./runner.sh -t 'auto3dseg/notebooks/auto3dseg_autorunner_ref_api.ipynb'` - `./runner.sh -t 'auto3dseg/notebooks/auto_runner.ipynb'` - `./runner.sh -t 'auto3dseg/notebooks/hpo_optuna.ipynb'` --------- Signed-off-by: Mingxin Zheng <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Wenqi Li <[email protected]>
1 parent e896eca commit 6be932f

File tree

14 files changed

+107
-164
lines changed

14 files changed

+107
-164
lines changed

3d_segmentation/unetr_btcv_segmentation_3d.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@
3333
"\n",
3434
"Under Institutional Review Board (IRB) supervision, 50 abdomen CT scans of were randomly selected from a combination of an ongoing colorectal cancer chemotherapy trial, and a retrospective ventral hernia study. The 50 scans were captured during portal venous contrast phase with variable volume sizes (512 x 512 x 85 - 512 x 512 x 198) and field of views (approx. 280 x 280 x 280 mm3 - 500 x 500 x 650 mm3). The in-plane resolution varies from 0.54 x 0.54 mm2 to 0.98 x 0.98 mm2, while the slice thickness ranges from 2.5 mm to 5.0 mm. \n",
3535
"\n",
36-
"Target: 13 abdominal organs including 1. Spleen 2. Right Kidney 3. Left Kideny 4.Gallbladder 5.Esophagus 6. Liver 7. Stomach 8.Aorta 9. IVC 10. Portal and Splenic Veins 11. Pancreas 12 Right adrenal gland 13 Left adrenal gland.\n",
36+
"Target: 13 abdominal organs including 1. Spleen 2. Right Kidney 3. Left Kidney 4.Gallbladder 5.Esophagus 6. Liver 7. Stomach 8.Aorta 9. IVC 10. Portal and Splenic Veins 11. Pancreas 12 Right adrenal gland 13 Left adrenal gland.\n",
3737
"\n",
3838
"Modality: CT\n",
3939
"Size: 30 3D volumes (24 Training + 6 Testing) \n",

3d_segmentation/unetr_btcv_segmentation_3d_lightning.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@
3636
"\n",
3737
"Under Institutional Review Board (IRB) supervision, 50 abdomen CT scans of were randomly selected from a combination of an ongoing colorectal cancer chemotherapy trial, and a retrospective ventral hernia study. The 50 scans were captured during portal venous contrast phase with variable volume sizes (512 x 512 x 85 - 512 x 512 x 198) and field of views (approx. 280 x 280 x 280 mm3 - 500 x 500 x 650 mm3). The in-plane resolution varies from 0.54 x 0.54 mm2 to 0.98 x 0.98 mm2, while the slice thickness ranges from 2.5 mm to 5.0 mm. \n",
3838
"\n",
39-
"Target: 13 abdominal organs including 1. Spleen 2. Right Kidney 3. Left Kideny 4.Gallbladder 5.Esophagus 6. Liver 7. Stomach 8.Aorta 9. IVC 10. Portal and Splenic Veins 11. Pancreas 12 Right adrenal gland 13 Left adrenal gland.\n",
39+
"Target: 13 abdominal organs including 1. Spleen 2. Right Kidney 3. Left Kidney 4.Gallbladder 5.Esophagus 6. Liver 7. Stomach 8.Aorta 9. IVC 10. Portal and Splenic Veins 11. Pancreas 12 Right adrenal gland 13 Left adrenal gland.\n",
4040
"\n",
4141
"Modality: CT\n",
4242
"Size: 30 3D volumes (24 Training + 6 Testing) \n",

auto3dseg/notebooks/auto3dseg_autorunner_ref_api.ipynb

Lines changed: 20 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,6 @@
5252
"outputs": [],
5353
"source": [
5454
"import os\n",
55-
"import torch\n",
5655
"import tempfile\n",
5756
"\n",
5857
"from monai.apps import download_and_extract\n",
@@ -64,23 +63,21 @@
6463
" export_bundle_algo_history,\n",
6564
" import_bundle_algo_history,\n",
6665
")\n",
67-
"from monai.auto3dseg import algo_to_pickle, datafold_read\n",
66+
"from monai.auto3dseg import algo_to_pickle\n",
6867
"from monai.bundle.config_parser import ConfigParser\n",
6968
"from monai.config import print_config\n",
7069
"\n",
7170
"print_config()"
7271
]
7372
},
7473
{
74+
"attachments": {},
7575
"cell_type": "markdown",
7676
"metadata": {},
7777
"source": [
7878
"## Download dataset\n",
7979
"\n",
80-
"We provide a toy datalist file that splits a subset of the downloaded datasets into five folds.\n",
81-
"\n",
82-
"> NOTE: Each validation set only has 6 images in one fold of training.\n",
83-
"> Therefore, we need to set a limit on the total number of GPUs we're using in this notebook."
80+
"We provide a toy datalist file that splits a subset of the downloaded datasets into five folds."
8481
]
8582
},
8683
{
@@ -101,11 +98,7 @@
10198
"if not os.path.exists(dataroot):\n",
10299
" download_and_extract(resource, compressed_file, root_dir)\n",
103100
"\n",
104-
"datalist_file = os.path.join(\"..\", \"tasks\", \"msd\", msd_task, \"msd_\" + msd_task.lower() + \"_folds.json\")\n",
105-
"\n",
106-
"if torch.cuda.device_count() > 6:\n",
107-
" os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"\n",
108-
" os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1,2,3,4,5\""
101+
"datalist_file = os.path.join(\"..\", \"tasks\", \"msd\", msd_task, \"msd_\" + msd_task.lower() + \"_folds.json\")"
109102
]
110103
},
111104
{
@@ -231,14 +224,15 @@
231224
]
232225
},
233226
{
227+
"attachments": {},
234228
"cell_type": "markdown",
235229
"metadata": {},
236230
"source": [
237231
"## Getting and saving the algorithm generation history to the local drive\n",
238232
"\n",
239233
"If the users continue to train the algorithms on local system, The history of the algorithm generation can be fetched via `get_history` method of the `BundleGen` object. There also are scenarios that users need to stop the Python process after the `algo_gen`. For example, the users may need to transfer the files to a remote cluster to start the training. `Auto3DSeg` offers a utility function `export_bundle_algo_history` to dump the history to hard drive and recall it by `import_bundle_algo_history`. \n",
240234
"\n",
241-
"If the files are copied to a remote system, please make sure the algorithm templates are also copied there. Some functions require the path to instantiate the algorithm class properly."
235+
"If the files are copied to a remote system, please ensure the algorithm templates are also copied there. Some functions require the path to instantiate the algorithm class properly."
242236
]
243237
},
244238
{
@@ -252,6 +246,7 @@
252246
]
253247
},
254248
{
249+
"attachments": {},
255250
"cell_type": "markdown",
256251
"metadata": {},
257252
"source": [
@@ -266,7 +261,15 @@
266261
"The users can use either `train()` or `train({})` if no changes are needed.\n",
267262
"Then the algorithms will go for the full training and repeat 5 folds.\n",
268263
"\n",
269-
"On the other hand, users can also use set `train_param` for each algorithm."
264+
"On the other hand, users can also use set `train_param` for each algorithm.\n",
265+
"\n",
266+
"\n",
267+
"For demo purposes, below is a code block to convert num_epoch to iteration style and override all algorithms with the same training parameters.\n",
268+
"The setup works fine for a machine that has GPUs less than or equal to 8.\n",
269+
"The datalist in this example is only using a subset of the original dataset.\n",
270+
"Users need to ensure the number of GPUs is not greater than the number that the training dataset can be partitioned.\n",
271+
"For example, the following code block is not suitable for a 16-GPU system.\n",
272+
"In such cases, please change the code block accordingly."
270273
]
271274
},
272275
{
@@ -277,24 +280,11 @@
277280
"source": [
278281
"max_epochs = 2 # change epoch number to 2 to cut down the notebook running time\n",
279282
"\n",
280-
"# safeguard to ensure max_epochs is greater or equal to 2\n",
281-
"max_epochs = max(max_epochs, 2)\n",
282-
"\n",
283-
"num_gpus = 1 if \"multigpu\" in input and not input[\"multigpu\"] else torch.cuda.device_count()\n",
284-
"\n",
285-
"num_epoch = max_epochs\n",
286-
"num_images_per_batch = 2\n",
287-
"files_train_fold0, _ = datafold_read(datalist_file, \"\", 0)\n",
288-
"n_data = len(files_train_fold0)\n",
289-
"n_iter = int(num_epoch * n_data / num_images_per_batch / max(num_gpus, 1))\n",
290-
"n_iter_val = int(n_iter / 2)\n",
291-
"\n",
292283
"train_param = {\n",
293-
" \"num_iterations\": n_iter,\n",
294-
" \"num_iterations_per_validation\": n_iter_val,\n",
295-
" \"num_images_per_batch\": num_images_per_batch,\n",
296-
" \"num_epochs\": num_epoch,\n",
297-
" \"num_warmup_iterations\": n_iter_val,\n",
284+
" \"num_epochs_per_validation\": 1,\n",
285+
" \"num_images_per_batch\": 2,\n",
286+
" \"num_epochs\": max_epochs,\n",
287+
" \"num_warmup_epochs\": 1,\n",
298288
"}\n",
299289
"\n",
300290
"print(train_param)"

auto3dseg/notebooks/auto3dseg_hello_world.ipynb

Lines changed: 11 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,6 @@
5454
"import nibabel as nib\n",
5555
"import numpy as np\n",
5656
"import matplotlib.pyplot as plt\n",
57-
"import torch\n",
5857
"\n",
5958
"from monai.apps.auto3dseg import AutoRunner\n",
6059
"from monai.config import print_config\n",
@@ -64,17 +63,15 @@
6463
]
6564
},
6665
{
66+
"attachments": {},
6767
"cell_type": "markdown",
6868
"metadata": {},
6969
"source": [
7070
"## Simulate a special dataset\n",
7171
"\n",
7272
"It is well known that AI takes time to train. To provide the \"Hello World!\" experience of Auto3D in this notebook, we will simulate a small dataset and run training only for multiple epochs. Due to the nature of AI, the performance shouldn't be highly expected, but the entire pipeline will be completed within minutes!\n",
7373
"\n",
74-
"`sim_datalist` provides the information of the simulated datasets. It lists 12 training and 2 testing images and labels. The training data are split into 3 folds. Each fold will use 8 images to train and 4 images to validate. The size of the dimension is defined by the `sim_dim` .\n",
75-
"\n",
76-
"> NOTE: Each validation set only has 4 images in one fold of training.\n",
77-
"> Therefore, we need to set a limit on the total number of GPUs we're using in this notebook."
74+
"`sim_datalist` provides the information of the simulated datasets. It lists 12 training and 2 testing images and labels. The training data are split into 3 folds. Each fold will use 8 images to train and 4 images to validate. The size of the dimension is defined by the `sim_dim` ."
7875
]
7976
},
8077
{
@@ -104,11 +101,7 @@
104101
" ],\n",
105102
"}\n",
106103
"\n",
107-
"sim_dim = (64, 64, 64)\n",
108-
"\n",
109-
"if torch.cuda.device_count() > 4:\n",
110-
" os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"\n",
111-
" os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1,2,3\""
104+
"sim_dim = (64, 64, 64)"
112105
]
113106
},
114107
{
@@ -216,10 +209,15 @@
216209
]
217210
},
218211
{
212+
"attachments": {},
219213
"cell_type": "markdown",
220214
"metadata": {},
221215
"source": [
222-
"## Override the training parameters so that we can complete the pipeline in minutes"
216+
"## Override the training parameters so that we can complete the pipeline in minutes\n",
217+
"\n",
218+
"For demo purposes, below is a code block to convert num_epoch to iteration style and override all algorithms with the same training parameters.\n",
219+
"If users would like to use more than one GPU, they can change the `CUDA_VISIBLE_DEVICES`, or just remove the key to use all available devices.\n",
220+
"Users also need to ensure the number of GPUs is not greater than the number that the training dataset can be partitioned."
223221
]
224222
},
225223
{
@@ -230,16 +228,12 @@
230228
"source": [
231229
"max_epochs = 2\n",
232230
"\n",
233-
"# safeguard to ensure max_epochs is greater or equal to 2\n",
234-
"max_epochs = max(max_epochs, 2)\n",
235-
"\n",
236231
"train_param = {\n",
237232
" \"CUDA_VISIBLE_DEVICES\": [0], # use only 1 gpu\n",
238-
" \"num_iterations\": 4 * max_epochs,\n",
239-
" \"num_iterations_per_validation\": 2 * max_epochs,\n",
233+
" \"num_epochs_per_validation\": 1,\n",
240234
" \"num_images_per_batch\": 2,\n",
241235
" \"num_epochs\": max_epochs,\n",
242-
" \"num_warmup_iterations\": 2 * max_epochs,\n",
236+
" \"num_warmup_epochs\": 1,\n",
243237
"}\n",
244238
"runner.set_training_params(train_param)\n",
245239
"runner.set_num_fold(num_fold=1)"

auto3dseg/notebooks/auto_runner.ipynb

Lines changed: 36 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -60,28 +60,24 @@
6060
"source": [
6161
"import os\n",
6262
"import tempfile\n",
63-
"import torch\n",
6463
"\n",
6564
"from monai.bundle.config_parser import ConfigParser\n",
6665
"from monai.apps import download_and_extract\n",
6766
"\n",
6867
"from monai.apps.auto3dseg import AutoRunner\n",
69-
"from monai.auto3dseg import datafold_read\n",
7068
"from monai.config import print_config\n",
7169
"\n",
7270
"print_config()"
7371
]
7472
},
7573
{
74+
"attachments": {},
7675
"cell_type": "markdown",
7776
"metadata": {},
7877
"source": [
7978
"## Download dataset\n",
8079
"\n",
81-
"We provide a toy datalist file that splits a subset of the downloaded datasets into five folds.\n",
82-
"\n",
83-
"> NOTE: Each validation set only has 6 images in one fold of training.\n",
84-
"> Therefore, we need to set a limit on the total number of GPUs we're using in this notebook."
80+
"We provide a toy datalist file that splits a subset of the downloaded datasets into five folds."
8581
]
8682
},
8783
{
@@ -102,11 +98,7 @@
10298
"if not os.path.exists(dataroot):\n",
10399
" download_and_extract(resource, compressed_file, root_dir)\n",
104100
"\n",
105-
"datalist_file = os.path.join(\"..\", \"tasks\", \"msd\", msd_task, \"msd_\" + msd_task.lower() + \"_folds.json\")\n",
106-
"\n",
107-
"if torch.cuda.device_count() > 6:\n",
108-
" os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"\n",
109-
" os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0,1,2,3,4,5\""
101+
"datalist_file = os.path.join(\"..\", \"tasks\", \"msd\", msd_task, \"msd_\" + msd_task.lower() + \"_folds.json\")"
110102
]
111103
},
112104
{
@@ -267,18 +259,26 @@
267259
]
268260
},
269261
{
262+
"attachments": {},
270263
"cell_type": "markdown",
271264
"metadata": {},
272265
"source": [
273266
"## Customize training parameters by override the default values\n",
274267
"\n",
275268
"`set_training_params` in `AutoRunner` provides an interface to change all algorithms' training parameters in one line. \n",
276269
"\n",
277-
"> NOTE **Auto3DSeg** uses MONAI bundle templates to perform training, validation, and inference. The number of epochs/iterations of training is specified by the config files in each template.\n",
278-
"> Users can override these these values in the bundle templates.\n",
279-
"> But users should consider that some bundle templates may use `num_iterations` and other may use `num_epochs` to iterate.\n",
270+
"NOTE: \n",
271+
"**Auto3DSeg** uses MONAI bundle templates to perform training, validation, and inference.\n",
272+
"The number of epochs/iterations of training is specified by the config files in each template.\n",
273+
"Users can override these these values in the bundle templates.\n",
274+
"But users should consider that some bundle templates may use `num_iterations` and other may use `num_epochs` to iterate.\n",
280275
"\n",
281-
"For demo purpose, below is a code block to convert num_epoch to iteration style and override all algorithms with the same training parameters for 1-GPU/2-GPU machine. \n"
276+
"For demo purposes, below is a code block to convert num_epoch to iteration style and override all algorithms with the same training parameters.\n",
277+
"The setup works fine for a machine that has GPUs less than or equal to 8.\n",
278+
"The datalist in this example is only using a subset of the original dataset.\n",
279+
"Users need to ensure the number of GPUs is not greater than the number that the training dataset can be partitioned.\n",
280+
"For example, the following code block is not suitable for a 16-GPU system.\n",
281+
"In such cases, please change the code block accordingly.\n"
282282
]
283283
},
284284
{
@@ -289,25 +289,13 @@
289289
"source": [
290290
"max_epochs = 2\n",
291291
"\n",
292-
"# safeguard to ensure max_epochs is greater or equal to 2\n",
293-
"max_epochs = max(max_epochs, 2)\n",
294-
"\n",
295-
"num_gpus = 1 if \"multigpu\" in input_cfg and not input_cfg[\"multigpu\"] else torch.cuda.device_count()\n",
296-
"\n",
297-
"num_epoch = max_epochs\n",
298-
"num_images_per_batch = 2\n",
299-
"files_train_fold0, _ = datafold_read(datalist_file, \"\", 0)\n",
300-
"n_data = len(files_train_fold0)\n",
301-
"n_iter = int(num_epoch * n_data / num_images_per_batch / num_gpus)\n",
302-
"n_iter_val = int(n_iter / 2)\n",
303-
"\n",
304292
"train_param = {\n",
305-
" \"num_iterations\": n_iter,\n",
306-
" \"num_iterations_per_validation\": n_iter_val,\n",
307-
" \"num_images_per_batch\": num_images_per_batch,\n",
308-
" \"num_epochs\": num_epoch,\n",
309-
" \"num_warmup_iterations\": n_iter_val,\n",
293+
" \"num_epochs_per_validation\": 1,\n",
294+
" \"num_images_per_batch\": 2,\n",
295+
" \"num_epochs\": max_epochs,\n",
296+
" \"num_warmup_epochs\": 1,\n",
310297
"}\n",
298+
"\n",
311299
"runner = AutoRunner(input=input)\n",
312300
"runner.set_training_params(params=train_param)\n",
313301
"# runner.run()"
@@ -360,25 +348,26 @@
360348
]
361349
},
362350
{
351+
"attachments": {},
363352
"cell_type": "markdown",
364353
"metadata": {},
365354
"source": [
366355
"## Train model with HPO\n",
367356
"\n",
368357
"**Auto3DSeg** supports hyper parameter optimization (HPO) via `NNI` and `Optuna` backends.\n",
369-
"If you wound like to the use `Optuna`, please check the [notebook](hpo_optuna.ipynb) for detailed usage.\n",
358+
"If you would like to the use `Optuna`, please check the [notebook](hpo_optuna.ipynb) for detailed usage.\n",
370359
"\n",
371360
"Here we demonstrate the HPO option with `NNI` by Microsoft.\n",
372361
"Please install it via `pip install nni` if you hope to execute HPO with it in tutorial and haven't done so in the beginning of the notebook.\n",
373362
"AutoRunner supports `NNI` backend with a grid search method via automatically generating a the `NNI` config and run `nnictl` commands in subprocess.\n",
374363
"\n",
375364
"## Use `AutoRunner` with `NNI` backend to perform grid search\n",
376365
"\n",
377-
"After `runner.run()` is executed, `nni` will attempt to start a web service using port 8088 by default. If you are running the tutorial in a remote host, please make sure the port is available on the system.\n",
366+
"After `runner.run()` is executed, `nni` will attempt to start a web service using port 8088 by default. If you are running the tutorial in a remote host, please ensure the port is available on the system.\n",
378367
"\n",
379368
"> NOTE: it is recommended to turn off ensemble if the users are using HPO features.\n",
380369
"> By default, all the models are saved under the working directory, including the ones tuned by the HPO package.\n",
381-
"> Users may want to read the HPO results before the taking the next step.\n",
370+
"> Users may want to read the HPO results before taking the next step.\n",
382371
"> If the users want to ensemble all the models, the `ensemble` option can be set to True."
383372
]
384373
},
@@ -395,6 +384,7 @@
395384
]
396385
},
397386
{
387+
"attachments": {},
398388
"cell_type": "markdown",
399389
"metadata": {},
400390
"source": [
@@ -403,6 +393,7 @@
403393
"The default `NNI` config that `AutoRunner` looks like below. User can override some of the parameters via the `set_hpo_params` interface:\n",
404394
"\n",
405395
"```python\n",
396+
"import torch\n",
406397
"default_nni_config = {\n",
407398
" \"trialCodeDirectory\": \".\",\n",
408399
" \"trialGpuNumber\": torch.cuda.device_count(),\n",
@@ -449,19 +440,19 @@
449440
"outputs": [],
450441
"source": [
451442
"runner = AutoRunner(input=input, hpo=True, ensemble=False)\n",
443+
"num_epoch = 2\n",
452444
"hpo_params = {\n",
453445
" \"maxTrialNumber\": 20,\n",
454446
" \"maxExperimentDuration\": \"30m\",\n",
455-
" \"num_iterations\": n_iter,\n",
456-
" \"num_iterations_per_validation\": n_iter_val,\n",
457-
" \"num_images_per_batch\": num_images_per_batch,\n",
458-
" \"num_epochs\": num_epoch,\n",
459-
" \"num_warmup_iterations\": n_iter_val,\n",
460-
" \"training#num_iterations\": n_iter,\n",
461-
" \"training#num_iterations_per_validation\": n_iter_val,\n",
462-
" \"searching#num_iterations\": n_iter,\n",
463-
" \"searching#num_iterations_per_validation\": n_iter_val,\n",
464-
" \"searching#num_warmup_iterations\": n_iter,\n",
447+
" \"num_epochs_per_validation\": 1,\n",
448+
" \"num_images_per_batch\": 1,\n",
449+
" \"num_epochs\": 2,\n",
450+
" \"num_warmup_epochs\": 1,\n",
451+
" \"training#num_epochs\": 2,\n",
452+
" \"training#num_epochs_per_validation\": 1,\n",
453+
" \"searching#num_epochs\": 2,\n",
454+
" \"searching#num_epochs_per_validation\": 1,\n",
455+
" \"searching#num_warmup_epochs\": 1,\n",
465456
"}\n",
466457
"search_space = {\"learning_rate\": {\"_type\": \"choice\", \"_value\": [0.0001, 0.01]}}\n",
467458
"runner.set_num_fold(num_fold=1)\n",

0 commit comments

Comments
 (0)