Auto3dSeg CUDA out of memory issues. #1089

peterhessey · 2022-12-01T17:48:44Z

peterhessey
Dec 1, 2022

Hi, I've been experimenting with the Auto3dSeg pipeline for a few days. I keep running into CUDA out of memory issues when the the segresnet tries to train on my data. I have 4 Tesla P100 GPUs with 16GB VRAM each so wasn't expected to hit out of memory issues. My NIFTI images have shape (500, 500, 130) and have 5 labelled structures each.

Does anyone have any tips on how to avoid such problems? I've tried altering the patch sizes used by segresnet (they appear to be larger than the other algorithms, (224,224,224) as opposed to (96,96,96)) through the AutoRunner.set_training_params() function but can't seem to find a syntax that can be parsed.

I'd like to avoid manually editing the hyperparameter YAML files if possible, as I'm trying to build an automated solution that requires minimal manual user intervention.

Happy to provide more technical details and logs if that would be useful for the discussion.

Many thanks,

Peter

Answered by dongyang0122

Dec 5, 2022

I believe the OOM is caused by model validation steps during the training process. It is highly possible that some volumes after data pre-processing are very large. We currently are working towards resolving the OOM issues during validation. And the update will be release by this month. At current stage, you can modify the target re-sampling resolution/spacing in the transform configuration .yaml files (e.g., from 1 x 1 x 1 to 1.5 x 1.5 x 1.5 or 2.0 x 2.0 x 2.0).

View full answer

dongyang0122 · 2022-12-01T21:18:13Z

dongyang0122
Dec 1, 2022
Collaborator

hi @peterhessey, there are two potential sources for OOM type of issues. The first one is from the model training. Reducing batch size or patch size would resolve the issue. You can refer this link to set up parameters in the configuration. The 2nd one is from model validation. The scripts load the entire images into GPU memory.for sliding-window inference. If the image is very large, the loading could cause OOM issue. Could you please confirm the issue if it is from the 1st or 2nd source? If it is the 2nd source, we will update the repo with a fix for the issue. Thanks!

2 replies

peterhessey Dec 2, 2022
Author

Hi @dongyang0122, thank you for your response.

Currently the OOM issues are occurring during the train.py script. I have already reduced the num_images_per_batch param to 1 as I can edit that with runner.set_training_params(). Thanks for highlighting where to alter the patch sizes, I will give that a try now too!

Is there any way to edit these patch parameters in my code when creating the AutoRunner object instead of having to edit the YAML files in between the algorithm generation and training stages?

Thank you! 😄

peterhessey Dec 2, 2022
Author

So I've tried editing the YAML files and running the train command detailed in the hyperparamerters tutorial (replacing .json with .yaml and adjusting for my GPUs):

torchrun --nnodes=1 --nproc_per_node=4 -m scripts.train run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml','configs/transforms_train.yaml','configs/transforms_validate.yaml']"

However, this script fails with a ModuleNotFoundError:

Traceback (most recent call last):
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/bundle/config_item.py", line 285, in instantiate
    return instantiate(modname, **args)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/utils/module.py", line 229, in instantiate
    raise ModuleNotFoundError(f"Cannot locate class or function path: '{path}'.")
ModuleNotFoundError: Cannot locate class or function path: 'loss.DiceCELoss2'.

I don't appear to get this error when running my script (in the same environment) containing :

    max_epochs = 2
    train_param = {
        "num_iterations": 4 * max_epochs,
        "num_iterations_per_validation": 2 * max_epochs,
        "num_images_per_batch": 1,
        "num_epochs": max_epochs,
        "num_warmup_iterations": 2 * max_epochs,
    }
    runner.set_training_params(train_param)
    runner.set_num_fold(num_fold=1)
    runner.run()

but I do still get OOM errors even when changing the patch_size and patch_size_valid params to [96, 96, 96] in both hyper_parameters.yaml files (the template and config for segresnet_0).

Any thoughts on what might be going wrong to cause the above errors?

dongyang0122 · 2022-12-02T19:36:52Z

dongyang0122
Dec 2, 2022
Collaborator

hi @peterhessey, you can refer this README to training SegResNet model. could you please share the log once you start the training commands. I can help you further resolve the issue.

2 replies

dongyang0122 Dec 2, 2022
Collaborator

I am curious if OOM issues happened in the very beginning or after a while.

peterhessey Dec 5, 2022
Author

So if I run the following script:

from monai.apps.auto3dseg import AutoRunner
from torch.distributed.elastic.multiprocessing.errors import record


@record
def main():
    datalist_file = "./datalists/lung_one_label.json"
    dataroot = "../../../localfiles/Datasets/lung_one_label/"
    data_src_cfg = {
        "name": "test_lung_auto_seg",
        "task": "segmentation",
        "modality": "CT",
        "datalist": datalist_file,
        "dataroot": dataroot,
    }

    runner = AutoRunner(input=data_src_cfg, work_dir="./outputs/lung")

    max_epochs = 2
    train_param = {
        "num_iterations": 4 * max_epochs,
        "num_iterations_per_validation": 2 * max_epochs,
        "num_images_per_batch": 1,
        "num_epochs": max_epochs,
        "num_warmup_iterations": 2 * max_epochs,
    }
    runner.set_training_params(train_param)
    runner.set_num_fold(num_fold=1)
    runner.run()


if __name__=="__main__":
    main()

This is the full log:

2022-12-05 11:12:40,680 - INFO - ./outputs/lung does not exists. Creating...
2022-12-05 11:12:40,681 - INFO - ./outputs/lung created to save all results
2022-12-05 11:12:40,683 - INFO - The output_dir is not specified. /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/ensemble_output will be used to save ensemble predictions
2022-12-05 11:12:40,683 - INFO - Directory /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/ensemble_output is created to save ensemble predictions
 48%|██████████████▊                | 23/48 [03:00<04:56, 11.85 52%|████████████████▏              | 25/48 [03:30<05:03, 13.17 56%|█████████████████▍             | 27/48 [03:44<03:48, 10.90 60%|██████████████████▋            | 29/48 [03:53<02:43,  8.60 65%|████████████████████           | 31/48 [04:03<02:06,  7.46 69%|█████████████████████▎         | 33/48 [04:19<01:53,  7.59 73%|██████████████████████▌        | 35/48 [04:29<01:27,  6.76 77%|███████████████████████▉       | 37/48 [04:39<01:08,  6.25 79%|████████████████████████▌      | 38/48 [04:45<01:02,  6.21 81%|█████████████████████████▏     | 39/48 [04:55<01:02,  6.95 83%|█████████████████████████▊     | 40/48 [05:08<01:07,  8.40 85%|██████████████████████████▍    | 41/48 [05:25<01:13, 10.45 90%|███████████████████████████▊   | 43/48 [05:41<00:46,  9.32 92%|████████████████████████████▍  | 44/48 [05:44<00:31,  7.90100%|███████████████████████████████| 48/48 [06:13<00:00,  7.77s/it]
2022-12-05 11:18:54,109 - WARNING - data spacing is not completely uniform. MONAI transforms may provide unexpected result
algo_templates.tar.gz: 296kB [00:00, 597kB/s]                                                                                          
2022-12-05 11:18:55,605 - INFO - Downloaded: /tmp/tmp3b9k2_ct/algo_templates.tar.gz
2022-12-05 11:18:55,605 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmp3b9k2_ct/algo_templates.tar.gz.
2022-12-05 11:18:55,606 - INFO - Writing into directory: /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung.
2022-12-05 11:18:57,100 - INFO - /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0
2022-12-05 11:18:58,889 - INFO - /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0
2022-12-05 11:19:00,537 - INFO - /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0
2022-12-05 11:19:02,219 - INFO - /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0
2022-12-05 11:19:02,221 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=4 /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/scripts/search.py run --config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/hyper_parameters.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/hyper_parameters_search.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/network_search.yaml' --searching#num_iterations=8 --searching#num_iterations_per_validation=4 --searching#num_images_per_batch=1 --searching#num_epochs=2 --searching#num_warmup_iterations=4
2022-12-05 11:33:19,717 - INFO - CompletedProcess(args=['torchrun', '--nnodes=1', '--nproc_per_node=4', '/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/scripts/search.py', 'run', "--config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/hyper_parameters.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/hyper_parameters_search.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/network_search.yaml'", '--searching#num_iterations=8', '--searching#num_iterations_per_validation=4', '--searching#num_images_per_batch=1', '--searching#num_epochs=2', '--searching#num_warmup_iterations=4'], returncode=0, stdout=b"[info] number of GPUs: 4
2022-12-05 11:19:09,066 - Added key: store_based_barrier_key:1 to store for rank: 3
[info] number of GPUs: 4
2022-12-05 11:19:09,101 - Added key: store_based_barrier_key:1 to store for rank: 2
[info] number of GPUs: 4
2022-12-05 11:19:09,156 - Added key: store_based_barrier_key:1 to store for rank: 0
[info] number of GPUs: 4
2022-12-05 11:19:09,197 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-12-05 11:19:09,197 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files_w: 5
train_files_a: 5
val_files: 3
2022-12-05 11:19:09,199 - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files_w: 5
train_files_a: 5
val_files: 3
2022-12-05 11:19:09,203 - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files_w: 5
train_files_a: 5
val_files: 3
2022-12-05 11:19:09,207 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files_w: 5
train_files_a: 5
val_files: 3
num_epochs 2
num_epochs_warmup 1
num_epochs_per_validation 1
[info] amp enabled
----------
epoch 1/2
learning rate is set to 0.025
[2022-12-05 11:20:04] 1/5, train_loss: 1.0774
[2022-12-05 11:20:07] 2/5, train_loss: 1.0756
[2022-12-05 11:20:10] 3/5, train_loss: 1.0496
[2022-12-05 11:20:13] 4/5, train_loss: 1.1202
[2022-12-05 11:20:17] 5/5, train_loss: 1.1060
epoch 1 average loss: 1.1144, best mean dice: -1.0000 at epoch -1
1 / 3 tensor([[0.0013, 0.0005, 0.0042, 0.0435, 0.0276]], device='cuda:0')
1 / 3 tensor([[0.0008, 0.0016, 0.0061, 0.0465, 0.0292]], device='cuda:3')
1 / 3 tensor([[0.0015, 0.0015, 0.0111, 0.0553, 0.0282]], device='cuda:1')
1 / 3 tensor([[0.0007, 0.0004, 0.0027, 0.0311, 0.0307]], device='cuda:2')
2 / 3 tensor([[0.0005, 0.0007, 0.0063, 0.0401, 0.0216]], device='cuda:0')
2 / 3 tensor([[0.0014, 0.0003, 0.0091, 0.0111, 0.0273]], device='cuda:2')
2 / 3 tensor([[0.0004, 0.0017, 0.0076, 0.0341, 0.0179]], device='cuda:1')
2 / 3 tensor([[0.0012, 0.0008, 0.0112, 0.0284, 0.0285]], device='cuda:3')
3 / 3 tensor([[0.0009, 0.0003, 0.0114, 0.0210, 0.0140]], device='cuda:2')
3 / 3 tensor([[0.0021, 0.0011, 0.0058, 0.0654, 0.0377]], device='cuda:3')
3 / 3 tensor([[0.0007, 0.0006, 0.0074, 0.0461, 0.0270]], device='cuda:1')
3 / 3 tensor([[0.0003, 0.0003, 0.0052, 0.0080, 0.0140]], device='cuda:0')
evaluation metric - class 1: 0.000987515551969409
evaluation metric - class 2: 0.0008068766134480635
evaluation metric - class 3: 0.0073484573513269424
evaluation metric - class 4: 0.0358705868323644
evaluation metric - class 5: 0.025319593648115795
avg_metric 0.014066605999444922
saved new best metric model
current epoch: 1 current mean dice: 0.0141 best mean dice: 0.0141 at epoch 1
----------
epoch 2/2
learning rate is set to 0.025
[2022-12-05 11:26:35] 1/5, train_loss: 1.1859
[2022-12-05 11:26:40] 1/5, train_loss_arch: 1.1452
[2022-12-05 11:26:43] 2/5, train_loss: 1.1122
[2022-12-05 11:26:45] 2/5, train_loss_arch: 1.1110
[2022-12-05 11:26:48] 3/5, train_loss: 1.0597
[2022-12-05 11:26:50] 3/5, train_loss_arch: 1.0652
[2022-12-05 11:26:53] 4/5, train_loss: 1.0599
[2022-12-05 11:26:55] 4/5, train_loss_arch: 1.0992
[2022-12-05 11:26:58] 5/5, train_loss: 1.1887
[2022-12-05 11:27:00] 5/5, train_loss_arch: 1.1030
epoch 2 average loss: 1.1088, best mean dice: 0.0141 at epoch 1
epoch 2 average arch loss: 1.1047, best mean dice: 0.0141 at epoch 1
1 / 3 tensor([[0.0015, 0.0004, 0.0061, 0.0436, 0.0340]], device='cuda:0')
1 / 3 tensor([[0.0010, 0.0008, 0.0071, 0.0443, 0.0312]], device='cuda:3')
1 / 3 tensor([[0.0020, 0.0015, 0.0126, 0.0618, 0.0334]], device='cuda:1')
1 / 3 tensor([[0.0007, 0.0009, 0.0028, 0.0305, 0.0318]], device='cuda:2')
2 / 3 tensor([[0.0012, 0.0019, 0.0054, 0.0437, 0.0194]], device='cuda:0')
2 / 3 tensor([[0.0010, 0.0004, 0.0094, 0.0135, 0.0298]], device='cuda:2')
2 / 3 tensor([[0.0002, 0.0027, 0.0079, 0.0354, 0.0199]], device='cuda:1')
2 / 3 tensor([[0.0014, 0.0008, 0.0101, 0.0271, 0.0289]], device='cuda:3')
3 / 3 tensor([[0.0008, 0.0004, 0.0108, 0.0237, 0.0148]], device='cuda:2')
3 / 3 tensor([[0.0008, 0.0021, 0.0062, 0.0652, 0.0390]], device='cuda:3')
3 / 3 tensor([[0.0014, 0.0001, 0.0054, 0.0520, 0.0299]], device='cuda:1')
3 / 3 tensor([[5.6614e-04, 8.6907e-05, 4.0767e-03, 9.2310e-03, 1.5870e-02]],
       device='cuda:0')
evaluation metric - class 1: 0.0010612148325890303
evaluation metric - class 2: 0.0010007199986527364
evaluation metric - class 3: 0.007334208115935326
evaluation metric - class 4: 0.03750258187452952
evaluation metric - class 5: 0.027332164347171783
avg_metric 0.014846177833775679
saved new best metric model
current epoch: 2 current mean dice: 0.0148 best mean dice: 0.0148 at epoch 2
train completed, best_metric: 0.0148 at epoch: 2
train completed, best_metric: -1.0000 at epoch: -1
train completed, best_metric: -1.0000 at epoch: -1
train completed, best_metric: -1.0000 at epoch: -1
", stderr=b'WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
<function compute_meandice at 0x7ff3afa5b158>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f1ff3c03158>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f6418e8a158>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7fda94521158>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
')
2022-12-05 11:33:19,718 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=4 /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/scripts/train.py run --config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/hyper_parameters.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/hyper_parameters_search.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/network_search.yaml' --training#num_iterations=8 --training#num_iterations_per_validation=4 --training#num_images_per_batch=1 --training#num_epochs=2 --training#num_warmup_iterations=4
2022-12-05 11:44:07,000 - INFO - CompletedProcess(args=['torchrun', '--nnodes=1', '--nproc_per_node=4', '/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/scripts/train.py', 'run', "--config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/hyper_parameters.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/hyper_parameters_search.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/dints_0/configs/network_search.yaml'", '--training#num_iterations=8', '--training#num_iterations_per_validation=4', '--training#num_images_per_batch=1', '--training#num_epochs=2', '--training#num_warmup_iterations=4'], returncode=0, stdout=b"[info] number of GPUs: 4
[info] number of GPUs: 4
[info] number of GPUs: 4
[info] number of GPUs: 4
2022-12-05 11:33:26,714 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-12-05 11:33:26,714 - Added key: store_based_barrier_key:1 to store for rank: 3
2022-12-05 11:33:26,721 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-12-05 11:33:26,724 - Added key: store_based_barrier_key:1 to store for rank: 2
2022-12-05 11:33:26,725 - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
2022-12-05 11:33:26,725 - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size:[info] world_size: 4 
4
train_files:train_files:  99

val_files:val_files:  33

2022-12-05 11:33:26,731 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
2022-12-05 11:33:26,734 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
num_epochs 2
num_epochs_per_validation 1
[info] training from scratch[info] training from scratch

[info] training from scratch
[info] amp enabled
[info] training from scratch
----------
epoch 1/2
learning rate is set to 0.2
[2022-12-05 11:34:16] 1/9, train_loss: 1.1999
[2022-12-05 11:34:18] 2/9, train_loss: 1.1093
[2022-12-05 11:34:20] 3/9, train_loss: 1.1239
[2022-12-05 11:34:22] 4/9, train_loss: 1.0884
[2022-12-05 11:34:24] 5/9, train_loss: 1.1591
[2022-12-05 11:34:26] 6/9, train_loss: 1.0283
[2022-12-05 11:34:28] 7/9, train_loss: 1.0817
[2022-12-05 11:34:30] 8/9, train_loss: 1.0495
[2022-12-05 11:34:32] 9/9, train_loss: 0.9848
epoch 1 average loss: 1.0844, best mean dice: -1.0000 at epoch -1
1 / 3 tensor([[0.0023, 0.0016, 0.0066, 0.0239, 0.1182]], device='cuda:0')
1 / 3 tensor([[0.0010, 0.0031, 0.0120, 0.0460, 0.0894]], device='cuda:1')
1 / 3 tensor([[0.0005, 0.0015, 0.0086, 0.0234, 0.0832]], device='cuda:2')
1 / 3 tensor([[0.0009, 0.0017, 0.0068, 0.0362, 0.0857]], device='cuda:3')
2 / 3 tensor([[0.0011, 0.0015, 0.0093, 0.0272, 0.0552]], device='cuda:0')
2 / 3 tensor([[0.0022, 0.0018, 0.0079, 0.0153, 0.0959]], device='cuda:2')
2 / 3 tensor([[0.0008, 0.0028, 0.0085, 0.0283, 0.0625]], device='cuda:1')
2 / 3 tensor([[0.0014, 0.0010, 0.0138, 0.0241, 0.0545]], device='cuda:3')
3 / 3 tensor([[0.0017, 0.0011, 0.0106, 0.0246, 0.0442]], device='cuda:2')
3 / 3 tensor([[0.0011, 0.0017, 0.0189, 0.0531, 0.1185]], device='cuda:3')
3 / 3 tensor([[0.0005, 0.0015, 0.0076, 0.0283, 0.1186]], device='cuda:1')
3 / 3 tensor([[0.0006, 0.0009, 0.0046, 0.0093, 0.0535]], device='cuda:0')
evaluation metric - class 1: 0.0011685324522356193
evaluation metric - class 2: 0.0016851954472561677
evaluation metric - class 3: 0.009595669185121855
evaluation metric - class 4: 0.028320051729679108
evaluation metric - class 5: 0.08162473638852437
avg_metric 0.024478837040563424
saved new best metric model
current epoch: 1 current mean dice: 0.0245 best mean dice: 0.0245 at epoch 1
----------
epoch 2/2
learning rate is set to 0.000390625
[2022-12-05 11:39:11] 1/9, train_loss: 1.1059
[2022-12-05 11:39:13] 2/9, train_loss: 1.0683
[2022-12-05 11:39:16] 3/9, train_loss: 1.0420
[2022-12-05 11:39:17] 4/9, train_loss: 1.0687
[2022-12-05 11:39:19] 5/9, train_loss: 1.0299
[2022-12-05 11:39:21] 6/9, train_loss: 0.9713
[2022-12-05 11:39:23] 7/9, train_loss: 1.0313
[2022-12-05 11:39:25] 8/9, train_loss: 1.1719
[2022-12-05 11:39:27] 9/9, train_loss: 1.0302
epoch 2 average loss: 1.0631, best mean dice: 0.0245 at epoch 1
1 / 3 tensor([[0.0023, 0.0016, 0.0067, 0.0243, 0.1184]], device='cuda:0')
1 / 3 tensor([[0.0010, 0.0031, 0.0120, 0.0465, 0.0895]], device='cuda:1')
1 / 3 tensor([[0.0005, 0.0015, 0.0086, 0.0238, 0.0835]], device='cuda:2')
1 / 3 tensor([[0.0009, 0.0017, 0.0068, 0.0365, 0.0860]], device='cuda:3')
2 / 3 tensor([[0.0011, 0.0015, 0.0093, 0.0276, 0.0553]], device='cuda:0')
2 / 3 tensor([[0.0022, 0.0018, 0.0080, 0.0154, 0.0962]], device='cuda:2')
2 / 3 tensor([[0.0008, 0.0028, 0.0086, 0.0287, 0.0627]], device='cuda:1')
2 / 3 tensor([[0.0014, 0.0010, 0.0140, 0.0243, 0.0547]], device='cuda:3')
3 / 3 tensor([[0.0017, 0.0011, 0.0107, 0.0248, 0.0443]], device='cuda:2')
3 / 3 tensor([[0.0011, 0.0017, 0.0194, 0.0536, 0.1190]], device='cuda:3')
3 / 3 tensor([[0.0005, 0.0015, 0.0078, 0.0288, 0.1188]], device='cuda:1')
3 / 3 tensor([[0.0006, 0.0009, 0.0047, 0.0095, 0.0537]], device='cuda:0')
evaluation metric - class 1: 0.0011708876118063927
evaluation metric - class 2: 0.001676679899295171
evaluation metric - class 3: 0.009727096185088158
evaluation metric - class 4: 0.028646695117155712
evaluation metric - class 5: 0.08183171351750691
avg_metric 0.02461061446617047
saved new best metric model
current epoch: 2 current mean dice: 0.0246 best mean dice: 0.0246 at epoch 2
train completed, best_metric: 0.0246 at epoch: 2
-1
0.02461061446617047
-1
-1
", stderr=b'WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
<function compute_meandice at 0x7fe9e01e4048>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f82abc40048>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f87fabf7048>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f7dcb076048>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
')
2022-12-05 11:44:07,015 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=4 /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/scripts/train.py run --config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/hyper_parameters.yaml' --num_iterations=8 --num_iterations_per_validation=4 --num_images_per_batch=1 --num_epochs=2 --num_warmup_iterations=4
2022-12-05 11:47:58,975 - INFO - CompletedProcess(args=['torchrun', '--nnodes=1', '--nproc_per_node=4', '/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/scripts/train.py', 'run', "--config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/swinunetr_0/configs/hyper_parameters.yaml'", '--num_iterations=8', '--num_iterations_per_validation=4', '--num_images_per_batch=1', '--num_epochs=2', '--num_warmup_iterations=4'], returncode=0, stdout=b"[info] number of GPUs: 4
[info] number of GPUs: 4
2022-12-05 11:44:13,990 - Added key: store_based_barrier_key:1 to store for rank: 2
2022-12-05 11:44:13,994 - Added key: store_based_barrier_key:1 to store for rank: 1
[info] number of GPUs: 4
2022-12-05 11:44:14,082 - Added key: store_based_barrier_key:1 to store for rank: 3
[info] number of GPUs: 4
2022-12-05 11:44:14,174 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-12-05 11:44:14,174 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
2022-12-05 11:44:14,177 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
2022-12-05 11:44:14,184 - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
2022-12-05 11:44:14,184 - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
train_files: 9
val_files: 3
num_epochs 2
num_epochs_per_validation 1
[info] training from scratch
[info] training from scratch
[info] training from scratch
[info] training from scratch
[info] amp enabled
----------
epoch 1/2
learning rate is set to 0.0001
[2022-12-05 11:44:54] 1/9, train_loss: 2.3849
2022-12-05 11:44:55,021 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 11:44:55,021 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 11:44:55,177 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 11:44:55,177 - Reducer buckets have been rebuilt in this iteration.
[2022-12-05 11:44:56] 2/9, train_loss: 2.2081
[2022-12-05 11:44:57] 3/9, train_loss: 2.3736
[2022-12-05 11:44:58] 4/9, train_loss: 2.0163
[2022-12-05 11:44:58] 5/9, train_loss: 2.1884
[2022-12-05 11:44:59] 6/9, train_loss: 2.2588
[2022-12-05 11:45:00] 7/9, train_loss: 1.9894
[2022-12-05 11:45:01] 8/9, train_loss: 2.0830
[2022-12-05 11:45:02] 9/9, train_loss: 2.4998
epoch 1 average loss: 2.2026, best mean dice: -1.0000 at epoch -1
1 / 3 tensor([[6.9502e-05, 8.9022e-04, 2.2638e-02, 1.3870e-02, 1.4892e-01]],
       device='cuda:0')
1 / 3 tensor([[0.0000, 0.0017, 0.0241, 0.0263, 0.1238]], device='cuda:1')
1 / 3 tensor([[0.0000e+00, 1.9026e-05, 4.2305e-03, 2.5742e-02, 1.3511e-01]],
       device='cuda:3')
1 / 3 tensor([[0.0000, 0.0013, 0.0190, 0.0143, 0.1078]], device='cuda:2')
2 / 3 tensor([[0.0000, 0.0017, 0.0212, 0.0220, 0.0778]], device='cuda:0')
2 / 3 tensor([[9.2657e-05, 1.3829e-04, 1.2488e-02, 2.4386e-02, 6.8381e-02]],
       device='cuda:1')
2 / 3 tensor([[0.0000, 0.0000, 0.0104, 0.0069, 0.0768]], device='cuda:2')
2 / 3 tensor([[0.0001, 0.0000, 0.0098, 0.0421, 0.0669]], device='cuda:3')
3 / 3 tensor([[0.0000, 0.0000, 0.0094, 0.0224, 0.1373]], device='cuda:1')
3 / 3 tensor([[0.0000, 0.0000, 0.0082, 0.0307, 0.1332]], device='cuda:3')
3 / 3 tensor([[6.3297e-05, 0.0000e+00, 6.1290e-03, 8.2737e-03, 5.1247e-02]],
       device='cuda:2')
3 / 3 tensor([[9.8846e-05, 1.2531e-04, 8.2922e-03, 4.1127e-03, 7.1816e-02]],
       device='cuda:0')
evaluation metric - class 1: 3.794628719333559e-05
evaluation metric - class 2: 0.0004894480807706714
evaluation metric - class 3: 0.012996147076288858
evaluation metric - class 4: 0.020087378720442455
evaluation metric - class 5: 0.0999158223470052
avg_metric 0.026705348502340104
saved new best metric model
current epoch: 1 current mean dice: 0.0267 best mean dice: 0.0267 at epoch 1
----------
epoch 2/2
learning rate is set to 1.953125e-07
[2022-12-05 11:46:24] 1/9, train_loss: 2.3586
[2022-12-05 11:46:25] 2/9, train_loss: 2.0342
[2022-12-05 11:46:27] 3/9, train_loss: 2.1620
[2022-12-05 11:46:28] 4/9, train_loss: 2.2533
[2022-12-05 11:46:29] 5/9, train_loss: 2.0511
[2022-12-05 11:46:30] 6/9, train_loss: 1.9031
[2022-12-05 11:46:31] 7/9, train_loss: 2.0456
[2022-12-05 11:46:32] 8/9, train_loss: 2.1353
[2022-12-05 11:46:32] 9/9, train_loss: 2.0728
epoch 2 average loss: 2.1428, best mean dice: 0.0267 at epoch 1
1 / 3 tensor([[6.9481e-05, 8.9139e-04, 2.2613e-02, 1.3923e-02, 1.4905e-01]],
       device='cuda:0')
1 / 3 tensor([[0.0000, 0.0017, 0.0241, 0.0264, 0.1238]], device='cuda:1')
1 / 3 tensor([[0.0000e+00, 1.9040e-05, 4.2187e-03, 2.5849e-02, 1.3522e-01]],
       device='cuda:3')
1 / 3 tensor([[0.0000, 0.0013, 0.0190, 0.0144, 0.1079]], device='cuda:2')
2 / 3 tensor([[0.0000, 0.0017, 0.0212, 0.0221, 0.0778]], device='cuda:0')
2 / 3 tensor([[9.2593e-05, 1.3840e-04, 1.2486e-02, 2.4550e-02, 6.8418e-02]],
       device='cuda:1')
2 / 3 tensor([[0.0000, 0.0000, 0.0104, 0.0070, 0.0769]], device='cuda:2')
2 / 3 tensor([[0.0001, 0.0000, 0.0098, 0.0423, 0.0669]], device='cuda:3')
3 / 3 tensor([[0.0000, 0.0000, 0.0094, 0.0225, 0.1374]], device='cuda:1')
3 / 3 tensor([[0.0000, 0.0000, 0.0082, 0.0308, 0.1333]], device='cuda:3')
3 / 3 tensor([[6.3325e-05, 0.0000e+00, 6.1266e-03, 8.3340e-03, 5.1293e-02]],
       device='cuda:2')
3 / 3 tensor([[9.8836e-05, 1.2533e-04, 8.2865e-03, 4.1288e-03, 7.1888e-02]],
       device='cuda:0')
evaluation metric - class 1: 3.793992073042318e-05
evaluation metric - class 2: 0.0004899733078976473
evaluation metric - class 3: 0.012987160434325537
evaluation metric - class 4: 0.02018752197424571
evaluation metric - class 5: 0.09999465942382812
avg_metric 0.02673945101220549
saved new best metric model
current epoch: 2 current mean dice: 0.0267 best mean dice: 0.0267 at epoch 2
train completed, best_metric: 0.0267 at epoch: 2
0.02673945101220549
-1
-1
-1
", stderr=b'WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
<function compute_meandice at 0x7f5ec3c1ff28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f57a8afef28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7fe216d2ef28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f63ea3cdf28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
')
2022-12-05 11:47:58,986 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=4 /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/scripts/train.py run --config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/hyper_parameters.yaml' --num_iterations=8 --num_iterations_per_validation=4 --num_images_per_batch=1 --num_epochs=2 --num_warmup_iterations=4
2022-12-05 11:50:30,776 - INFO - CompletedProcess(args=['torchrun', '--nnodes=1', '--nproc_per_node=4', '/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/scripts/train.py', 'run', "--config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet2d_0/configs/hyper_parameters.yaml'", '--num_iterations=8', '--num_iterations_per_validation=4', '--num_images_per_batch=1', '--num_epochs=2', '--num_warmup_iterations=4'], returncode=0, stdout=b"[info] number of GPUs: 4
[info] number of GPUs: 4
[info] number of GPUs: 4
2022-12-05 11:48:05,927 - Added key: store_based_barrier_key:1 to store for rank: 3
2022-12-05 11:48:05,927 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-12-05 11:48:05,932 - Added key: store_based_barrier_key:1 to store for rank: 1
[info] number of GPUs: 4
2022-12-05 11:48:05,992 - Added key: store_based_barrier_key:1 to store for rank: 2
2022-12-05 11:48:05,992 - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
2022-12-05 11:48:05,993 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
train_files: 9
val_files: 3
2022-12-05 11:48:05,999 - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
2022-12-05 11:48:05,999 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
train_files: 9
val_files: 3
val_files: 3
num_epochs 2
num_epochs_per_validation 1
[info] training from scratch
[info] training from scratch[info] amp enabled

[info] training from scratch
[info] training from scratch
----------
epoch 1/2
learning rate is set to 0.2
[2022-12-05 11:48:37] 1/9, train_loss: 1.1473
2022-12-05 11:48:37,357 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 11:48:37,358 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 11:48:37,424 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 11:48:37,424 - Reducer buckets have been rebuilt in this iteration.
[2022-12-05 11:48:37] 2/9, train_loss: 0.9867
[2022-12-05 11:48:38] 3/9, train_loss: 1.0205
[2022-12-05 11:48:38] 4/9, train_loss: 0.9215
[2022-12-05 11:48:38] 5/9, train_loss: 0.9128
[2022-12-05 11:48:38] 6/9, train_loss: 0.8879
[2022-12-05 11:48:38] 7/9, train_loss: 0.8778
[2022-12-05 11:48:38] 8/9, train_loss: 0.8800
[2022-12-05 11:48:38] 9/9, train_loss: 0.8893
epoch 1 average loss: 0.9429, best mean dice: -1.0000 at epoch -1
1 / 3 tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 3.3249e-05, 8.8061e-05]],
       device='cuda:0')
1 / 3 tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 5.3719e-04, 5.6176e-05]],
       device='cuda:2')
1 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0006, 0.0007]], device='cuda:3')
1 / 3 tensor([[0., 0., 0., 0., 0.]], device='cuda:1')
2 / 3 tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 7.4363e-05, 1.7411e-04]],
       device='cuda:0')
2 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0015, 0.0003]], device='cuda:2')
2 / 3 tensor([[0., 0., 0., 0., 0.]], device='cuda:1')
2 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0002, 0.0004]], device='cuda:3')
3 / 3 tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 4.1311e-04, 6.1578e-05]],
       device='cuda:2')
3 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0006, 0.0014]], device='cuda:3')
3 / 3 tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 5.1826e-05, 2.8212e-04]],
       device='cuda:1')
3 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0002, 0.0001]], device='cuda:0')
evaluation metric - class 1: 0.0
evaluation metric - class 2: 0.0
evaluation metric - class 3: 0.0
evaluation metric - class 4: 0.00034392070180426043
evaluation metric - class 5: 0.0002988338431653877
avg_metric 0.00012855090899392961
saved new best metric model
current epoch: 1 current mean dice: 0.0001 best mean dice: 0.0001 at epoch 1
----------
epoch 2/2
learning rate is set to 0.000390625
[2022-12-05 11:49:33] 1/9, train_loss: 0.8951
[2022-12-05 11:49:33] 2/9, train_loss: 0.8602
[2022-12-05 11:49:33] 3/9, train_loss: 0.8544
[2022-12-05 11:49:33] 4/9, train_loss: 0.9013
[2022-12-05 11:49:34] 5/9, train_loss: 0.8810
[2022-12-05 11:49:34] 6/9, train_loss: 0.8860
[2022-12-05 11:49:34] 7/9, train_loss: 0.8746
[2022-12-05 11:49:34] 8/9, train_loss: 0.8581
[2022-12-05 11:49:35] 9/9, train_loss: 0.9732
epoch 2 average loss: 0.8952, best mean dice: 0.0001 at epoch 1
1 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0002, 0.0002]], device='cuda:0')
1 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0011, 0.0002]], device='cuda:2')
1 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0012, 0.0009]], device='cuda:3')
1 / 3 tensor([[0., 0., 0., 0., 0.]], device='cuda:1')
2 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0002, 0.0003]], device='cuda:0')
2 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0028, 0.0004]], device='cuda:2')
2 / 3 tensor([[0., 0., 0., 0., 0.]], device='cuda:1')
2 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0003, 0.0005]], device='cuda:3')
3 / 3 tensor([[0.0000e+00, 0.0000e+00, 0.0000e+00, 7.3910e-04, 8.2090e-05]],
       device='cuda:2')
3 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0007, 0.0015]], device='cuda:3')
3 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0001, 0.0004]], device='cuda:1')
3 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0002, 0.0003]], device='cuda:0')
evaluation metric - class 1: 0.0
evaluation metric - class 2: 0.0
evaluation metric - class 3: 0.0
evaluation metric - class 4: 0.000633474012526373
evaluation metric - class 5: 0.0004087414126843214
avg_metric 0.00020844308504213885
saved new best metric model
current epoch: 2 current mean dice: 0.0002 best mean dice: 0.0002 at epoch 2
train completed, best_metric: 0.0002 at epoch: 2
-1
0.00020844308504213885
-1
-1
", stderr=b'WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
<function compute_meandice at 0x7fd69c172268>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f2afd433268>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f84406c8268>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7efdd7f71268>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
')
2022-12-05 11:50:30,787 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=4 /home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/scripts/train.py run --config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/hyper_parameters.yaml' --num_iterations=8 --num_iterations_per_validation=4 --num_images_per_batch=1 --num_epochs=2 --num_warmup_iterations=4
Traceback (most recent call last):
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/apps/auto3dseg/bundle_gen.py", line 185, in _run_cmd
    normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True, capture_output=True)
  File "/anaconda/envs/monai_himl/lib/python3.7/subprocess.py", line 487, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['torchrun', '--nnodes=1', '--nproc_per_node=4', '/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/scripts/train.py', 'run', "--config_file='/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/transforms_validate.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/transforms_infer.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/network.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/transforms_train.yaml','/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/configs/hyper_parameters.yaml'", '--num_iterations=8', '--num_iterations_per_validation=4', '--num_images_per_batch=1', '--num_epochs=2', '--num_warmup_iterations=4']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "lung_autoseg.py", line 48, in <module>
    run_info = get_run_info()
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "lung_autoseg.py", line 43, in main
    runner.set_num_fold(num_fold=1)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/apps/auto3dseg/auto_runner.py", line 586, in run
    self._train_algo_in_sequence(history)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/apps/auto3dseg/auto_runner.py", line 488, in _train_algo_in_sequence
    algo.train(self.train_params)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/apps/auto3dseg/bundle_gen.py", line 202, in train
    return self._run_cmd(cmd, devices_info)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/apps/auto3dseg/bundle_gen.py", line 190, in _run_cmd
    raise RuntimeError(f"subprocess call error {e.returncode}: {errors}, {output}") from e
RuntimeError: subprocess call error 1: b'WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
<function compute_meandice at 0x7f049f8fbf28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f465f30bf28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7fca8a7eff28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7f4cbf26df28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
Traceback (most recent call last):
  File "/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/scripts/train.py", line 405, in <module>
    fire.Fire()
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/scripts/train.py", line 307, in run
    overlap=overlap_ratio,
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/inferers/utils.py", line 180, in sliding_window_inference
    seg_prob_out = predictor(window_data, *args, **kwargs)  # batched patch segmentation
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/networks/nets/segresnet.py", line 178, in forward
    x, down_x = self.encode(x)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/networks/nets/segresnet.py", line 162, in encode
    x = down(x)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/networks/blocks/segresnet_block.py", line 85, in forward
    x = self.act(x)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/modules/activation.py", line 102, in forward
    return F.relu(input, inplace=self.inplace)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/nn/functional.py", line 1457, in relu
    result = torch.relu(input)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 882.00 MiB (GPU 0; 15.90 GiB total capacity; 11.89 GiB already allocated; 725.75 MiB free; 14.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 35373 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 35374 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 35375 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 35372) of binary: /anaconda/envs/monai_himl/bin/python
Traceback (most recent call last):
  File "/anaconda/envs/monai_himl/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/azureuser/localfiles/Repos/autoseg-monai-himl/outputs/lung/segresnet_0/scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-12-05_11:56:02
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 35372)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
', b"[info] number of GPUs: 4
[info] number of GPUs: 4
2022-12-05 11:50:37,918 - Added key: store_based_barrier_key:1 to store for rank: 2
[info] number of GPUs: 4
2022-12-05 11:50:37,951 - Added key: store_based_barrier_key:1 to store for rank: 3
2022-12-05 11:50:37,953 - Added key: store_based_barrier_key:1 to store for rank: 1
[info] number of GPUs: 4
2022-12-05 11:50:37,996 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-12-05 11:50:37,996 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
2022-12-05 11:50:38,000 - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
2022-12-05 11:50:38,002 - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
2022-12-05 11:50:38,004 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
num_epochs 2
num_epochs_per_validation 1
[info] training from scratch
[info] training from scratch
[info] training from scratch
[info] amp enabled
[info] training from scratch
----------
epoch 1/2
learning rate is set to 2e-05
[2022-12-05 11:52:35] 1/9, train_loss: 2.8981
2022-12-05 11:52:35,962 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 11:52:35,962 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 11:52:36,201 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 11:52:36,201 - Reducer buckets have been rebuilt in this iteration.
[2022-12-05 11:52:40] 2/9, train_loss: 3.1776
[2022-12-05 11:52:44] 3/9, train_loss: 3.2334
[2022-12-05 11:52:48] 4/9, train_loss: 2.7516
[2022-12-05 11:52:52] 5/9, train_loss: 2.5871
[2022-12-05 11:52:56] 6/9, train_loss: 2.9382
[2022-12-05 11:53:01] 7/9, train_loss: 2.6689
[2022-12-05 11:53:05] 8/9, train_loss: 2.6322
[2022-12-05 11:53:09] 9/9, train_loss: 2.6743
epoch 1 average loss: 2.7632, best mean dice: -1.0000 at epoch -1
1 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0498, 0.0000]], device='cuda:0')
1 / 3 tensor([[0.0000e+00, 0.0000e+00, 5.0703e-06, 3.0378e-02, 0.0000e+00]],
       device='cuda:2')
1 / 3 tensor([[0.0000, 0.0000, 0.0002, 0.0471, 0.0000]], device='cuda:3')
1 / 3 tensor([[0.0000e+00, 0.0000e+00, 5.8023e-06, 6.9958e-02, 0.0000e+00]],
       device='cuda:1')
2 / 3 tensor([[0.0000, 0.0000, 0.0000, 0.0408, 0.0000]], device='cuda:0')
"

If I run the following command (after editing the patch sizes in the YAML files to be [64, 64, 64], very small patches in comparison to the image size):

torchrun --nnodes=1 --nproc_per_node=4 scripts/train.py run --config_file "[configs/hyper_parameters.yaml','configs/network.yaml', 'configs/transforms_train.yaml','configs/transforms_validate.yaml']"

I get the following output, including OOM error:

❯ torchrun --nnodes=1 --nproc_per_node=4 scripts/train.py run --config_file "['configs/hyper_parameters.yaml','configs/network.yaml', 'configs/transforms_train.yaml','configs/transforms_validate.yaml']"
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[info] number of GPUs: 4
[info] number of GPUs: 4
[info] number of GPUs: 4
[info] number of GPUs: 4
2022-12-05 13:05:09,080 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-12-05 13:05:09,084 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-12-05 13:05:09,093 - Added key: store_based_barrier_key:1 to store for rank: 2
2022-12-05 13:05:09,093 - Added key: store_based_barrier_key:1 to store for rank: 3
2022-12-05 13:05:09,094 - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
2022-12-05 13:05:09,094 - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
2022-12-05 13:05:09,094 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
train_files: 9
val_files: 3val_files:
 3
train_files: 9
val_files: 3
2022-12-05 13:05:09,100 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
[info] world_size: 4
train_files: 9
val_files: 3
num_epochs 2
num_epochs_per_validation 1
[info] training from scratch
[info] training from scratch
[info] training from scratch
[info] training from scratch
[info] amp enabled
----------
epoch 1/2
learning rate is set to 2e-05
torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
[2022-12-05 13:06:56] 1/5, train_loss: 2.7685
2022-12-05 13:06:56,127 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 13:06:56,128 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 13:06:56,128 - Reducer buckets have been rebuilt in this iteration.
2022-12-05 13:06:56,129 - Reducer buckets have been rebuilt in this iteration.
[2022-12-05 13:06:56] 2/5, train_loss: 2.4563
[2022-12-05 13:06:57] 3/5, train_loss: 2.5899
[2022-12-05 13:06:57] 4/5, train_loss: 2.6373
[2022-12-05 13:06:58] 5/5, train_loss: 2.3178
epoch 1 average loss: 2.5964, best mean dice: -1.0000 at epoch -1
<function compute_meandice at 0x7fb742adcf28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
1 / 3 tensor([[0.0046, 0.0000, 0.0000, 0.0903, 0.0629]], device='cuda:0')
<function compute_meandice at 0x7f470e704f28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
1 / 3 tensor([[0.0038, 0.0000, 0.0000, 0.0859, 0.0519]], device='cuda:2')
<function compute_meandice at 0x7fd444c3bf28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
<function compute_meandice at 0x7fc44d78af28>: Function `compute_meandice` has been deprecated since version 1.0.0. use `compute_dice` instead.
1 / 3 tensor([[0.0041, 0.0000, 0.0000, 0.0611, 0.0623]], device='cuda:3')
1 / 3 tensor([[0.0025, 0.0000, 0.0000, 0.2197, 0.1008]], device='cuda:1')
2 / 3 tensor([[0.0036, 0.0000, 0.0000, 0.1542, 0.0707]], device='cuda:0')
2 / 3 tensor([[0.0042, 0.0000, 0.0000, 0.0519, 0.0870]], device='cuda:2')
2 / 3 tensor([[0.0021, 0.0000, 0.0000, 0.0847, 0.0457]], device='cuda:1')
2 / 3 tensor([[0.0043, 0.0000, 0.0000, 0.1293, 0.0805]], device='cuda:3')
3 / 3 tensor([[0.0036, 0.0000, 0.0000, 0.1242, 0.0587]], device='cuda:2')
3 / 3 tensor([[0.0067, 0.0000, 0.0000, 0.0561, 0.0590]], device='cuda:3')
3 / 3 tensor([[0.0043, 0.0000, 0.0000, 0.0509, 0.0526]], device='cuda:1')
3 / 3 tensor([[0.0029, 0.0000, 0.0000, 0.0426, 0.0740]], device='cuda:0')
evaluation metric - class 1: 0.0038886237889528275
evaluation metric - class 2: 0.0
evaluation metric - class 3: 0.0
evaluation metric - class 4: 0.0959010124206543
evaluation metric - class 5: 0.06717472771803538
avg_metric 0.03339287278552851
saved new best metric model
current epoch: 1 current mean dice: 0.0334 best mean dice: 0.0334 at epoch 1
----------
epoch 2/2
learning rate is set to 0.0002
[2022-12-05 13:17:47] 1/5, train_loss: 2.6545
[2022-12-05 13:17:49] 2/5, train_loss: 2.3073
[2022-12-05 13:17:51] 3/5, train_loss: 2.5983
[2022-12-05 13:17:51] 4/5, train_loss: 2.3018
[2022-12-05 13:17:52] 5/5, train_loss: 2.3651
epoch 2 average loss: 2.4620, best mean dice: 0.0334 at epoch 1
1 / 3 tensor([[0.0055, 0.0000, 0.0000, 0.0073, 0.0628]], device='cuda:0')
1 / 3 tensor([[0.0035, 0.0000, 0.0000, 0.0003, 0.0458]], device='cuda:2')
1 / 3 tensor([[2.6806e-03, 3.8052e-05, 0.0000e+00, 9.4004e-05, 5.6904e-02]],
       device='cuda:1')
1 / 3 tensor([[0.0040, 0.0000, 0.0000, 0.0264, 0.0537]], device='cuda:3')
2 / 3 tensor([[0.0048, 0.0000, 0.0000, 0.0002, 0.0321]], device='cuda:0')
2 / 3 tensor([[0.0044, 0.0000, 0.0000, 0.0002, 0.0462]], device='cuda:2')
2 / 3 tensor([[0.0023, 0.0000, 0.0000, 0.0007, 0.0394]], device='cuda:1')
2 / 3 tensor([[0.0043, 0.0000, 0.0000, 0.0017, 0.0413]], device='cuda:3')
3 / 3 tensor([[0.0049, 0.0000, 0.0000, 0.0002, 0.0238]], device='cuda:2')
3 / 3 tensor([[7.3774e-03, 3.2602e-05, 0.0000e+00, 5.1178e-02, 6.5878e-02]],
       device='cuda:3')
3 / 3 tensor([[0.0065, 0.0000, 0.0000, 0.0303, 0.0616]], device='cuda:1')
Traceback (most recent call last):
  File "scripts/train.py", line 405, in <module>
    fire.Fire()
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/train.py", line 317, in run
    value = compute_meandice(y_pred=val_outputs, y=val_labels, include_background=not softmax)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/utils/deprecate_utils.py", line 109, in _wrapper
    return call_obj(*args, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/metrics/meandice.py", line 156, in compute_meandice
    return compute_dice(*args, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/metrics/meandice.py", line 143, in compute_dice
    intersection = torch.sum(y * y_pred, dim=reduce_axis)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/monai/data/meta_tensor.py", line 249, in __torch_function__
    ret = super().__torch_function__(func, types, args, kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/_tensor.py", line 1278, in __torch_function__
    ret = func(*args, **kwargs)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.09 GiB (GPU 0; 15.90 GiB total capacity; 9.61 GiB already allocated; 1.92 GiB free; 13.09 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46697 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46698 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 46699 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 46696) of binary: /anaconda/envs/monai_himl/bin/python
Traceback (most recent call last):
  File "/anaconda/envs/monai_himl/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
    )(*cmd_args)
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/anaconda/envs/monai_himl/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-12-05_13:28:29
  host      : localhost
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 46696)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

So it seems the OOM memory error occurs later with a smaller patch size.

Also previously I was using the torchrun commands detailed in this documentation here which uses the -m scripts.train flag and causes the script to fail, but the command you pointed me to seem to work!

Again is there any way to automatically populate hyper_parameters.yaml with smaller patches without manually editing the files?

dongyang0122 · 2022-12-05T15:12:19Z

dongyang0122
Dec 5, 2022
Collaborator

hi @peterhessey, thank you for sharing the information!

There is currently no easy way to automatically generate .yaml configurations with smaller patch sizes. Users need to either manually modify the .yaml file, or add additional options when launching model training commands.

0 replies

dongyang0122 · 2022-12-05T15:16:51Z

dongyang0122
Dec 5, 2022
Collaborator

I believe the OOM is caused by model validation steps during the training process. It is highly possible that some volumes after data pre-processing are very large. We currently are working towards resolving the OOM issues during validation. And the update will be release by this month. At current stage, you can modify the target re-sampling resolution/spacing in the transform configuration .yaml files (e.g., from 1 x 1 x 1 to 1.5 x 1.5 x 1.5 or 2.0 x 2.0 x 2.0).

0 replies

peterhessey · 2022-12-05T15:36:22Z

peterhessey
Dec 5, 2022
Author

Awesome, thank you for all your help @dongyang0122!

0 replies

Auto3dSeg CUDA out of memory issues. #1089

Uh oh!

Uh oh!

peterhessey Dec 1, 2022

Replies: 6 comments · 4 replies

Uh oh!

dongyang0122 Dec 1, 2022 Collaborator

Uh oh!

peterhessey Dec 2, 2022 Author

Uh oh!

peterhessey Dec 2, 2022 Author

Uh oh!

dongyang0122 Dec 2, 2022 Collaborator

Uh oh!

dongyang0122 Dec 2, 2022 Collaborator

Uh oh!

peterhessey Dec 5, 2022 Author

Uh oh!

dongyang0122 Dec 5, 2022 Collaborator

Uh oh!

dongyang0122 Dec 5, 2022 Collaborator

Uh oh!

peterhessey Dec 5, 2022 Author

peterhessey
Dec 1, 2022

Replies: 6 comments 4 replies

dongyang0122
Dec 1, 2022
Collaborator

peterhessey Dec 2, 2022
Author

peterhessey Dec 2, 2022
Author

dongyang0122
Dec 2, 2022
Collaborator

dongyang0122 Dec 2, 2022
Collaborator

peterhessey Dec 5, 2022
Author

dongyang0122
Dec 5, 2022
Collaborator

dongyang0122
Dec 5, 2022
Collaborator

peterhessey
Dec 5, 2022
Author