Trainer and Handlers for DDP #1076

hw-ju · 2022-11-29T00:32:09Z

hw-ju
Nov 29, 2022

Hi!
In the dynunet_pipeline tutorial using DDP, looks like there's no explicit synchronization, e.g. dist.barrier, between training and validation. Does the monai.engines.Trainer class (and Evaluator class) take care of such synchronization?

Checkpointsaver and StatsHandler handlers are only executed by rank 0? If true, does StatsHandler log metrics obtained from just rank 0 or does it log aggregated metrics from all ranks?

yiheng-wang-nv · 2022-11-29T10:59:46Z

yiheng-wang-nv
Nov 29, 2022
Collaborator

Hi @hw-ju , thanks for posting the questions.

Checkpointsaver and StatsHandler handlers are only executed by rank 0? If true, does StatsHandler log metrics obtained from just rank 0 or does it log aggregated metrics from all ranks

Yes, these handlers only need to be employed in rank 0, and the result is based on all data. In addition, I check the code of the dynunet pipeline, and currently these handlers are executed in all ranks. I will submit a PR to modify it.

For the sync related question, may need @wyli @Nic-Ma @ericspod to help answer it, thanks!

8 replies

wyli Nov 29, 2022
Collaborator

for when to use a barrier, perhaps this discussion answers your questions https://discuss.pytorch.org/t/why-the-second-barrier-is-used-in-the-ddp-tutorial/87849

I'm not sure about the cuda.synchronize command, the ignite example you are looking at is v0.3.0 which was released in early 2020 (probably tested with pytorch v1.3 or v1.4), I suspect it's still a viable example now

hw-ju Nov 30, 2022
Author

@wyli Thanks for pointing me to the pytorch discussion link, it's helpful!

You're right, that ignite example is not using the latest ignite, the reason I'm looking at it is that it's the only official example I can find using ignite together with native torch distributed API for using DDP (i.e., manually setup distributed proc group, wrap model with nn.parallel.DistributedDataParallel and execute the script with torch.distributed.launch tool), which is the same way how the tutorial dynunet pipeline sets up DDP.

For later version of ignite starting from v0.4.0, all the examples I find are using ignite.distributed (instead of native torch distributed API ) for using DDP. And it seems they don't use torch.cuda.synchronize anymore, e.g., they don't use torch.cuda.synchronize() within a custom event handler function run_validation, please see here https://github.com/pytorch/ignite/blob/master/examples/contrib/cifar10/main.py#L90.

Since the tutorial dynunet pipeline is using ignite style of training and native torch distributed API, I wonder if monai.Trainer or monai.Handler has extra implementation to take care of synchronization for DDP training and validation beyond the synchronization offered by the native pytorch DistributedDataParallel when using native torch distributed API (instead of ignite.distributed) to run DDP.

wyli Nov 30, 2022
Collaborator

I can see in the dynunet tutorial MeanDice handler is used, it has a metric aggregation step which requires results from all processes https://github.com/Project-MONAI/MONAI/blob/d0db5fd9d4da3bc7027d22c52bb9c405c3b9e879/monai/handlers/ignite_metric.py#L90
If that works fine during your runs, it means the delays among processes is within the time allowed (1800s by default). Apart from this and the ddp model you mentioned, I don't currently see any other parts require/ensure synchronization.

hw-ju Nov 30, 2022
Author

@wyli Great! Thanks a lot! :D

hw-ju Dec 21, 2022
Author

Hi @hw-ju , thanks for posting the questions.

Checkpointsaver and StatsHandler handlers are only executed by rank 0? If true, does StatsHandler log metrics obtained from just rank 0 or does it log aggregated metrics from all ranks

Yes, these handlers only need to be employed in rank 0, and the result is based on all data. In addition, I check the code of the dynunet pipeline, and currently these handlers are executed in all ranks. I will submit a PR to modify it.

For the sync related question, may need @wyli @Nic-Ma @ericspod to help answer it, thanks!

Hi @yiheng-wang-nv! Thanks for modifying the tutorial script! I have a quick question, in the modified script https://github.com/Project-MONAI/tutorials/pull/1078/files#, ignite.distributed.get_rank() is used, can it be safely replaced by torch.distributed.get_rank()?

hw-ju · 2023-01-27T03:35:20Z

hw-ju
Jan 27, 2023
Author

Hi @yiheng-wang-nv! Could you help with two questions below?

When run the /dynunet_pipeline/train.py on single GPU, I get the following extra output at the end of each epoch

2023-01-26 22:09:39,773 - Key metric: None best value: -1 at epoch: -1
2023-01-26 22:09:39,773 - Epoch[1] Complete. Time taken: 00:00:42.701

For example, we can see them in output below

 2023-01-26 22:09:30,366 - Epoch: 1/5, Iter: 1/23 -- train_loss: 2.4818 
2023-01-26 22:09:30,429 - Epoch: 1/5, Iter: 2/23 -- train_loss: 1.8322 
2023-01-26 22:09:30,476 - Epoch: 1/5, Iter: 3/23 -- train_loss: 1.4205 
2023-01-26 22:09:30,522 - Epoch: 1/5, Iter: 4/23 -- train_loss: 1.2811 
2023-01-26 22:09:30,594 - Epoch: 1/5, Iter: 5/23 -- train_loss: 1.2181 
2023-01-26 22:09:30,636 - Epoch: 1/5, Iter: 6/23 -- train_loss: 1.1845 
2023-01-26 22:09:30,678 - Epoch: 1/5, Iter: 7/23 -- train_loss: 1.1818 
2023-01-26 22:09:30,719 - Epoch: 1/5, Iter: 8/23 -- train_loss: 1.1293 
2023-01-26 22:09:30,781 - Epoch: 1/5, Iter: 9/23 -- train_loss: 1.1004 
2023-01-26 22:09:30,848 - Epoch: 1/5, Iter: 10/23 -- train_loss: 1.0621 
2023-01-26 22:09:30,916 - Epoch: 1/5, Iter: 11/23 -- train_loss: 1.0695 
2023-01-26 22:09:30,981 - Epoch: 1/5, Iter: 12/23 -- train_loss: 1.0460 
2023-01-26 22:09:31,043 - Epoch: 1/5, Iter: 13/23 -- train_loss: 1.0227 
2023-01-26 22:09:31,130 - Epoch: 1/5, Iter: 14/23 -- train_loss: 1.0128 
2023-01-26 22:09:31,172 - Epoch: 1/5, Iter: 15/23 -- train_loss: 0.9831 
2023-01-26 22:09:31,246 - Epoch: 1/5, Iter: 16/23 -- train_loss: 0.9760 
2023-01-26 22:09:31,312 - Epoch: 1/5, Iter: 17/23 -- train_loss: 0.9469 
2023-01-26 22:09:31,383 - Epoch: 1/5, Iter: 18/23 -- train_loss: 0.9304 
2023-01-26 22:09:31,454 - Epoch: 1/5, Iter: 19/23 -- train_loss: 0.9177 
2023-01-26 22:09:31,516 - Epoch: 1/5, Iter: 20/23 -- train_loss: 0.8778 
2023-01-26 22:09:31,577 - Epoch: 1/5, Iter: 21/23 -- train_loss: 0.8785 
2023-01-26 22:09:31,638 - Epoch: 1/5, Iter: 22/23 -- train_loss: 0.8407 
2023-01-26 22:09:31,701 - Epoch: 1/5, Iter: 23/23 -- train_loss: 0.8926 
2023-01-26 22:09:31,701 - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2023-01-26 22:09:39,642 - Got new best metric of val_mean_dice: 0.0
2023-01-26 22:09:39,643 - Epoch[1] Metrics -- val_mean_dice: 0.0000 
2023-01-26 22:09:39,643 - Key metric: val_mean_dice best value: 0.0 at epoch: 1
2023-01-26 22:09:39,744 - Epoch[1] Complete. Time taken: 00:00:08.022
2023-01-26 22:09:39,744 - Engine run complete. Time taken: 00:00:08.043
2023-01-26 22:09:39,773 - Key metric: None best value: -1 at epoch: -1
2023-01-26 22:09:39,773 - Epoch[1] Complete. Time taken: 00:00:42.701
2023-01-26 22:09:41,975 - Epoch: 2/5, Iter: 1/23 -- train_loss: 0.8395 
2023-01-26 22:09:42,088 - Epoch: 2/5, Iter: 2/23 -- train_loss: 0.8449 
2023-01-26 22:09:42,136 - Epoch: 2/5, Iter: 3/23 -- train_loss: 0.8083 
2023-01-26 22:09:42,184 - Epoch: 2/5, Iter: 4/23 -- train_loss: 0.7733 
2023-01-26 22:09:42,246 - Epoch: 2/5, Iter: 5/23 -- train_loss: 0.7907 
2023-01-26 22:09:42,338 - Epoch: 2/5, Iter: 6/23 -- train_loss: 0.7780 
2023-01-26 22:09:42,379 - Epoch: 2/5, Iter: 7/23 -- train_loss: 0.7505 
2023-01-26 22:09:42,421 - Epoch: 2/5, Iter: 8/23 -- train_loss: 0.7242 
2023-01-26 22:09:42,487 - Epoch: 2/5, Iter: 9/23 -- train_loss: 0.7244 
2023-01-26 22:09:42,549 - Epoch: 2/5, Iter: 10/23 -- train_loss: 0.7059 
2023-01-26 22:09:42,610 - Epoch: 2/5, Iter: 11/23 -- train_loss: 0.7084 
2023-01-26 22:09:42,682 - Epoch: 2/5, Iter: 12/23 -- train_loss: 0.6883 
2023-01-26 22:09:42,755 - Epoch: 2/5, Iter: 13/23 -- train_loss: 0.6935 
2023-01-26 22:09:42,830 - Epoch: 2/5, Iter: 14/23 -- train_loss: 0.6855 
2023-01-26 22:09:42,905 - Epoch: 2/5, Iter: 15/23 -- train_loss: 0.6419 
2023-01-26 22:09:42,969 - Epoch: 2/5, Iter: 16/23 -- train_loss: 0.6564 
2023-01-26 22:09:43,031 - Epoch: 2/5, Iter: 17/23 -- train_loss: 0.6395 
2023-01-26 22:09:43,092 - Epoch: 2/5, Iter: 18/23 -- train_loss: 0.6311 
2023-01-26 22:09:43,153 - Epoch: 2/5, Iter: 19/23 -- train_loss: 0.6220 
2023-01-26 22:09:43,214 - Epoch: 2/5, Iter: 20/23 -- train_loss: 0.6086 
2023-01-26 22:09:43,276 - Epoch: 2/5, Iter: 21/23 -- train_loss: 0.6378 
2023-01-26 22:09:43,338 - Epoch: 2/5, Iter: 22/23 -- train_loss: 0.6074 
2023-01-26 22:09:43,399 - Epoch: 2/5, Iter: 23/23 -- train_loss: 0.6151 
2023-01-26 22:09:43,400 - Engine run resuming from iteration 0, epoch 1 until 2 epochs
2023-01-26 22:09:48,496 - Got new best metric of val_mean_dice: 0.4782697260379791
2023-01-26 22:09:48,497 - Epoch[2] Metrics -- val_mean_dice: 0.4783 
2023-01-26 22:09:48,497 - Key metric: val_mean_dice best value: 0.4782697260379791 at epoch: 2
2023-01-26 22:09:48,581 - Epoch[2] Complete. Time taken: 00:00:05.158
2023-01-26 22:09:48,581 - Engine run complete. Time taken: 00:00:05.181
2023-01-26 22:09:48,613 - Key metric: None best value: -1 at epoch: -1
2023-01-26 22:09:48,613 - Epoch[2] Complete. Time taken: 00:00:08.839

Shall we add

if multi_gpu_flag:

before lines https://github.com/Project-MONAI/tutorials/blob/main/modules/dynunet_pipeline/train.py#L100 and https://github.com/Project-MONAI/tutorials/blob/main/modules/dynunet_pipeline/train.py#L230?

2 replies

yiheng-wang-nv Jan 30, 2023
Collaborator

Hi @hw-ju ,
for 1), it is because key_train_metric of DynUNetTrainer is None.
for 2), you are right. Let me add the missing flag, thanks!

hw-ju Feb 1, 2023
Author

@yiheng-wang-nv Thanks for your explanation :)!

Trainer and Handlers for DDP #1076

Uh oh!

Uh oh!

hw-ju Nov 29, 2022

Replies: 2 comments · 10 replies

Uh oh!

yiheng-wang-nv Nov 29, 2022 Collaborator

Uh oh!

wyli Nov 29, 2022 Collaborator

Uh oh!

Uh oh!

hw-ju Nov 30, 2022 Author

Uh oh!

wyli Nov 30, 2022 Collaborator

Uh oh!

hw-ju Nov 30, 2022 Author

Uh oh!

hw-ju Dec 21, 2022 Author

Uh oh!

hw-ju Jan 27, 2023 Author

Uh oh!

yiheng-wang-nv Jan 30, 2023 Collaborator

Uh oh!

hw-ju Feb 1, 2023 Author

hw-ju
Nov 29, 2022

Replies: 2 comments 10 replies

yiheng-wang-nv
Nov 29, 2022
Collaborator

wyli Nov 29, 2022
Collaborator

hw-ju Nov 30, 2022
Author

wyli Nov 30, 2022
Collaborator

hw-ju Nov 30, 2022
Author

hw-ju Dec 21, 2022
Author

hw-ju
Jan 27, 2023
Author

yiheng-wang-nv Jan 30, 2023
Collaborator

hw-ju Feb 1, 2023
Author