Skip to content

670 add bundle example for multi-gpu training #673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 29, 2022
Merged

Conversation

Nic-Ma
Copy link
Contributor

@Nic-Ma Nic-Ma commented Apr 25, 2022

Fixes #670 .

Description

This PR added multi-gpu training example in the spleen bundle.

Status

Ready

Checks

  • Notebook runs automatically ./runner [-p <regex_pattern>]

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 26, 2022

Depends on Project-MONAI/MONAI#4168.

Thanks.

@Nic-Ma Nic-Ma changed the title [WIP] 670 add bundle example for multi-gpu training 670 add bundle example for multi-gpu training Apr 26, 2022
@Nic-Ma Nic-Ma marked this pull request as ready for review April 26, 2022 16:35
@Nic-Ma Nic-Ma requested review from ericspod, rijobro and wyli April 26, 2022 16:36
@Nic-Ma Nic-Ma changed the title 670 add bundle example for multi-gpu training [WIP] 670 add bundle example for multi-gpu training Apr 27, 2022
@Nic-Ma Nic-Ma force-pushed the 670-support-multigpu branch from 67d1d33 to a7e0ed1 Compare April 27, 2022 11:40
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 27, 2022

Hi @wyli @ericspod ,

I added 2 methods to support multi-gpu training in this PR:
(1) Newly defined config train_multi_gpu.json which contains all the logic.
(2) A much simpler config multi_gpu_train.json which only overrides the existing train.json config.

Which method do you prefer?

Thanks in advance.

@ericspod
Copy link
Member

I think providing the override and an example of its usage would be a better idea. The override mechanism is quite powerful but I imagine for beginners it's not clear how it's meant to be used.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

Depends on: Project-MONAI/MONAI#4192.

Thanks.

@Nic-Ma Nic-Ma changed the title [WIP] 670 add bundle example for multi-gpu training 670 add bundle example for multi-gpu training Apr 28, 2022
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

Hi @ericspod @wyli ,

Thanks for your review and comments.
I updated the PR to make it clearer with the multiple configs method instead of the "implicit" args_file method:
--config_file [configs/train.json, configs/multi_gpu_train.json]
It depends on this minor PR: Project-MONAI/MONAI#4192.
Could you please help review it again?

Thanks in advance.

@Nic-Ma Nic-Ma changed the title 670 add bundle example for multi-gpu training [WIP] 670 add bundle example for multi-gpu training Apr 28, 2022
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

I will also add the evaluation config in this PR soon.

Thanks.

@ericspod
Copy link
Member

I approved and committed Project-MONAI/MONAI#4192

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

I approved and committed Project-MONAI/MONAI#4192

Thanks @ericspod ,

@wyli @ericspod Then this PR is ready for review now.
I added multi-gpu training, evaluate configs based on override, and also simplified the inference config as we have completed the configs of this bundle now.

Thanks.

@Nic-Ma Nic-Ma changed the title [WIP] 670 add bundle example for multi-gpu training 670 add bundle example for multi-gpu training Apr 28, 2022
@ericspod
Copy link
Member

I tried reviewing through ReviewNB, I'm not sure it's as coherent as I'd hoped.

When you have a structure like:

"train":{
...some defs...
}

The "train" value is interpreted as a dict by the parser correct? So in places where you have "train#loss" or something like it that sets that entry in that dictionary. I think that needs to be explained in a bit more detail to make sense to the reader and to understand what to use it for.

Copy link
Member

@ericspod ericspod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had commented on a few changes to text and a question on references to members of dictionaries (I think) like "validate#postprocessing" that we should explain at little more.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

I tried reviewing through ReviewNB, I'm not sure it's as coherent as I'd hoped.

When you have a structure like:

"train":{
...some defs...
}

The "train" value is interpreted as a dict by the parser correct? So in places where you have "train#loss" or something like it that sets that entry in that dictionary. I think that needs to be explained in a bit more detail to make sense to the reader and to understand what to use it for.

Good point, I added more description for it in the dataset and dataloader section.

Thanks.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Apr 28, 2022

Hi @ericspod ,

Thanks for your review and comments, I updated the PR according to them.
Could you please help review again?

Thanks in advance.

Copy link
Member

@ericspod ericspod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@wyli wyli merged commit b66354a into master Apr 29, 2022
@wyli wyli deleted the 670-support-multigpu branch April 29, 2022 11:52
boneseva pushed a commit to boneseva/MONAI-tutorials that referenced this pull request Apr 21, 2024
* [DLMED] draft config

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] update for test

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] update based on enhancement

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] update tutorial

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] simplify to override

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] update according to comments

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] remove test file

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] add evaluation config

Signed-off-by: Nic Ma <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [DLMED] simplify inference

Signed-off-by: Nic Ma <[email protected]>

* [DLMED] update according to comments

Signed-off-by: Nic Ma <[email protected]>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multi-gpu example in the spleen segmentation bundle
3 participants