-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: Disable debugger when checkpointing is enabled with distributed training #2264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
fd9dd85
58d6913
651b662
2fa1856
2b0944e
582b64d
4ab76aa
c9cd18e
3b2338d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -2219,7 +2219,21 @@ def _validate_and_set_debugger_configs(self): | |||||||||||
): | ||||||||||||
self.debugger_hook_config = DebuggerHookConfig(s3_output_path=self.output_path) | ||||||||||||
elif not self.debugger_hook_config: | ||||||||||||
self.debugger_hook_config = None | ||||||||||||
# set hook config to False if _region_supports_debugger is False | ||||||||||||
self.debugger_hook_config = False | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are we changing this variable to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The intent of the line above is to disable the debug hook config. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It doesn't make sense to change this to a boolean, since the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Documentation however indicates to customers that they need to pass
Debugger is initialized on behalf of the customer it is I'm making sure that when we check the value of post the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, you're right: sagemaker-python-sdk/src/sagemaker/tensorflow/estimator.py Lines 361 to 365 in b66cb98
Thank you for clarifying. |
||||||||||||
|
||||||||||||
# Disable debugger if checkpointing is enabled by the customer | ||||||||||||
if self.checkpoint_s3_uri and self.checkpoint_local_path and self.debugger_hook_config: | ||||||||||||
if self._framework_name in {"mxnet", "pytorch", "tensorflow"}: | ||||||||||||
if self.instance_count > 1 or ( | ||||||||||||
hasattr(self, "distribution") | ||||||||||||
and self.distribution is not None # pylint: disable=no-member | ||||||||||||
): | ||||||||||||
logger.info( | ||||||||||||
"SMDebug Does Not Currently Support \ | ||||||||||||
Distributed Training Jobs With Checkpointing Enabled" | ||||||||||||
) | ||||||||||||
self.debugger_hook_config = False | ||||||||||||
|
||||||||||||
def _stage_user_code_in_s3(self): | ||||||||||||
"""Upload the user training script to s3 and return the location. | ||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to understand it, so it's possible for
self.debugger_hook_config
to beNone
orFalse
. And you're changing it toFalse
intentionally to disable it?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value of
self.debugger_hook_config
is None.The if-statement above checks if the debugger is supported in the region in which it is invoked and if not it should be set to
False
instead ofNone
in my opinion.