-
Notifications
You must be signed in to change notification settings - Fork 739
Update the fast training with profiling tool #757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yuchen Xu <[email protected]>
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
I think the outputs folder can be removed? Thanks in advance. |
thanks, it looks great. cc @AHarouni some minor points perhaps beyond this profiling example:
|
@Nic-Ma I think the outputs/ folder makes outputs easier to manage (so that they don't live in the same directory as all the other tutorials), and allows users to customize where they want the outputs to go. Open to more discussion/opinions. |
@wyli Thanks for the comments!
|
sure, they're probably out of the scope of this PR, but to clarify:
such as, which module is currently the main bottleneck and if we look for further performance gain, where should we focus on
it's in v0.9
perhaps if you use the latest docker, it's a one-line change, @yiheng-wang-nv please advice.. |
@wyli Thanks. I'm working on those. |
Hi @wyli
|
Hi @yuchen-xu ,
Thanks in advance. |
@wyli @Nic-Ma I performed experiments on the two changes, here are the findings:
Each experiment is run with 600 epochs and repeated 3 times to control variance. The next version (which I expect to push tomorrow) will also fix a minor issue in the generated graph of Dice metrics, such that the Dice and loss graphs will both show epochs going to the same number (right now the Dice graph has wrong values on the x-axis). |
Nice analysis, thanks @yuchen-xu! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the enhancement, it overall looks good to me.
Just some minor comments inline.
Thanks.
Hi @yuchen-xu , I think maybe there is something wrong during your last round training. Thanks. |
@Nic-Ma Thanks for the review! |
As it turns out, instance_nvfuser norm takes 105 epochs/258 secs to achieve the target Dice of 0.94, compared to 95 epochs/144 secs for instance norm and 60 epochs/106 secs for batch norm (original). While the two instance norms result in faster per-epoch time, they take significantly longer to achieve our goal metric. Therefore, we will continue using batch norm in the tutorial. @wyli |
…torials into 729-ftt-profiling-tag
the notebook testing script will modify Lines 84 to 85 in 2da31b6
looks like this PR is not compatible with that. please help confirm |
@wyli Could you elaborate what you mean by "this PR is not compatible with that"? If you are talking about max_epochs being changed, it seems that runner.sh is running a copy of the notebook when replacing max_epochs, so the values in the original notebook are not changed; see Line 293 in 2da31b6
@Nic-Ma Thanks for the comments, I have fixed them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick update.
Looks good to me now, @wyli do you have any other comments?
Thanks.
* draft for profiling Signed-off-by: Yuchen Xu <[email protected]> * fixed coding style errors * ready for discussion * checkpoint missing nsys figures * ready for review * addressed Nic's comments * fixed typo * removed output files * intermediate commit with experiments included * ready for review * various updates; ready for review * addressed style comments Co-authored-by: Yuchen Xu <[email protected]>
Signed-off-by: Yuchen Xu [email protected]
Fixes #729 .
Description
Made changes in acceleration/fast_training_tutorial.ipynb and added a profiling option. When not profiling, the notebook runs like before. When profiling, the notebook must be run in the terminal as one piece (command provided), as Nsight systems does not support running from inside the script.
Status
Ready
Checks