Skip to content

Update the fast training with profiling tool #757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 5, 2022

Conversation

yuchen-xu
Copy link
Contributor

@yuchen-xu yuchen-xu commented Jun 16, 2022

Signed-off-by: Yuchen Xu [email protected]

Fixes #729 .

Description

Made changes in acceleration/fast_training_tutorial.ipynb and added a profiling option. When not profiling, the notebook runs like before. When profiling, the notebook must be run in the terminal as one piece (command provided), as Nsight systems does not support running from inside the script.

Status

Ready

Checks

Signed-off-by: Yuchen Xu <[email protected]>
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@yuchen-xu yuchen-xu changed the title draft for profiling [WIP] draft for profiling Jun 16, 2022
@yuchen-xu yuchen-xu marked this pull request as ready for review June 22, 2022 22:37
@Nic-Ma Nic-Ma requested a review from wyli June 23, 2022 02:07
@Nic-Ma
Copy link
Contributor

Nic-Ma commented Jun 23, 2022

I think the outputs folder can be removed?
@wyli Could you please help take a look at this PR first? You raised the feature request.

Thanks in advance.

@wyli
Copy link
Contributor

wyli commented Jun 23, 2022

thanks, it looks great. cc @AHarouni

some minor points perhaps beyond this profiling example:

@yuchen-xu
Copy link
Contributor Author

@Nic-Ma I think the outputs/ folder makes outputs easier to manage (so that they don't live in the same directory as all the other tutorials), and allows users to customize where they want the outputs to go. Open to more discussion/opinions.

@yuchen-xu
Copy link
Contributor Author

@wyli Thanks for the comments!

  • Could you clarify a little further what kinds of insights you are looking for?
  • The Thread Worker flag seems to be in development and not part of MONAI yet?
  • It would be interesting to try InstanceNorm indeed, but as you noted, they go beyond the profiling example - shall we open a new issue along the lines of "integrating latest features into fast training tutorial"?

@wyli
Copy link
Contributor

wyli commented Jun 23, 2022

sure, they're probably out of the scope of this PR, but to clarify:

  • Could you clarify a little further what kinds of insights you are looking for?

such as, which module is currently the main bottleneck and if we look for further performance gain, where should we focus on

  • The Thread Worker flag seems to be in development and not part of MONAI yet?

it's in v0.9

  • It would be interesting to try InstanceNorm indeed, but as you noted, they go beyond the profiling example - shall we open a new issue along the lines of "integrating latest features into fast training tutorial"?

perhaps if you use the latest docker, it's a one-line change, @yiheng-wang-nv please advice..

@yuchen-xu
Copy link
Contributor Author

@wyli Thanks. I'm working on those.

@yuchen-xu
Copy link
Contributor Author

yuchen-xu commented Jun 24, 2022

Hi @wyli

  • The use_thread_workers argument does not seem to be available in the MONAI v0.9.0rc2 docker (see attached image). Also, in the fast version, the main bottleneck seems to be CacheDataset (~40 seconds) rather than ThreadDataLoader (~0.002 seconds).

thread workers

- I was able to replace batch norm with instance norm and instance_nvfuser (the one you cited above). In the first few epochs, instance norm does seem to be faster than batch norm (~10% improvement from 10 seconds), and instance_nvfuser a little faster than that, but the two instance norms result in a worse metric. I will perform some more testing and include that in the updated version.

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Jun 25, 2022

Hi @yuchen-xu ,

  1. Please try with MONAI v0.9.0 docker instead of RC2. And no need to worry about the preparing cache time.
  2. If instance_nvfuser doesn't help, no need to update the tutorial for it. Please do some more experiments to confirm.

Thanks in advance.

@yuchen-xu
Copy link
Contributor Author

@wyli @Nic-Ma I performed experiments on the two changes, here are the findings:

  1. Thread Workers didn't seem to improve the speed of the tutorial. In the figure, "orig" refers to the original setup (use_thread_workers=False), "thread1/2/3/4" refers to use_thread_workers=True, num_workers=1/2/3/4. Clearly, "orig" is the fastest among these. This can be further verified by the second part of the image (tlpo = training load per operation), which is the average time it takes the next operation when called on an iterator on the train loader. Again, "orig" is fastest. Curiously, num_workers=1 makes it worse too.
    image

  2. Keeping use_thread_workers=False, it seems that replacing Batch Norm with either Instance Norm or Instance_NVFuser makes it faster. Performance on loss and Dice metric ends up around the same level, although the trends are different; see figures below. Also, Instance_NVFuser has a significantly slow first operation (taking ~10s under the fast regime), but is faster on later operations. I'm using instance_nvfuser, but will note the option to use instance norm too.
    fig1
    fig2

Each experiment is run with 600 epochs and repeated 3 times to control variance.

The next version (which I expect to push tomorrow) will also fix a minor issue in the generated graph of Dice metrics, such that the Dice and loss graphs will both show epochs going to the same number (right now the Dice graph has wrong values on the x-axis).

@wyli
Copy link
Contributor

wyli commented Jun 30, 2022

Nice analysis, thanks @yuchen-xu!

Nic-Ma
Nic-Ma previously approved these changes Jun 30, 2022
Copy link
Contributor

@Nic-Ma Nic-Ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the enhancement, it overall looks good to me.
Just some minor comments inline.

Thanks.

@Nic-Ma Nic-Ma changed the title [WIP] draft for profiling Update the fast training with profiling tool Jun 30, 2022
@Nic-Ma
Copy link
Contributor

Nic-Ma commented Jun 30, 2022

Hi @yuchen-xu ,

I think maybe there is something wrong during your last round training.
The epoch time curve of regular training looks strange, it jumped from 60s to 40s:
image
And the fast training is much slower to achieve mean_dice=0.94 than your previous round training, because it uses much more epochs (105 vs 60), so maybe we should not use instance norm?
image

Thanks.

@Nic-Ma Nic-Ma dismissed their stale review June 30, 2022 14:17

May need 1 more time update.

@yuchen-xu
Copy link
Contributor Author

@Nic-Ma Thanks for the review!
The regular epoch time is a little strange indeed, and I could see that happen if someone else is also using the GPU. I'll try to run it again.
As noted in an earlier post, instance_nvfuser norm has overall the fastest runtime, but is pretty close to batch norm in terms of metric trends. On the other hand, instance norm is slightly slower (but still faster than batch norm) and may have better metric trends. I'll try again with instance norm and see what we get.

@yuchen-xu
Copy link
Contributor Author

As it turns out, instance_nvfuser norm takes 105 epochs/258 secs to achieve the target Dice of 0.94, compared to 95 epochs/144 secs for instance norm and 60 epochs/106 secs for batch norm (original). While the two instance norms result in faster per-epoch time, they take significantly longer to achieve our goal metric. Therefore, we will continue using batch norm in the tutorial. @wyli

@wyli
Copy link
Contributor

wyli commented Jul 4, 2022

the notebook testing script will modify max_epochs variable to speed up the tests

tutorials/runner.sh

Lines 84 to 85 in 2da31b6

echo "MONAI tutorials testing utilities. When running the notebooks, we first search for variables, such as"
echo "\"max_epochs\" and set them to 1 to reduce testing time."

looks like this PR is not compatible with that. please help confirm

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Jul 4, 2022

Maybe there are 2 some layout issues:

  1. I think the description should start from a new line:

image

2. This `span` may not be necessary:

image

Others look good to me.

Thanks.

@yuchen-xu
Copy link
Contributor Author

@wyli Could you elaborate what you mean by "this PR is not compatible with that"? If you are talking about max_epochs being changed, it seems that runner.sh is running a copy of the notebook when replacing max_epochs, so the values in the original notebook are not changed; see

notebook=$(cat "$filename")
. If you are talking about the notebook taking forever to run when profiling = False (since it would still be trying to run for 600 epochs), I made an update to address that and ran it successfully locally.

@Nic-Ma Thanks for the comments, I have fixed them.

Copy link
Contributor

@Nic-Ma Nic-Ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update.
Looks good to me now, @wyli do you have any other comments?

Thanks.

@wyli wyli enabled auto-merge (squash) July 5, 2022 06:37
@wyli wyli merged commit 61dae66 into Project-MONAI:main Jul 5, 2022
@yuchen-xu yuchen-xu mentioned this pull request Jul 9, 2022
boneseva pushed a commit to boneseva/MONAI-tutorials that referenced this pull request Apr 21, 2024
* draft for profiling

Signed-off-by: Yuchen Xu <[email protected]>

* fixed coding style errors

* ready for discussion

* checkpoint missing nsys figures

* ready for review

* addressed Nic's comments

* fixed typo

* removed output files

* intermediate commit with experiments included

* ready for review

* various updates; ready for review

* addressed style comments

Co-authored-by: Yuchen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

adding a profiling flag
3 participants