-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Local mode improvements #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Switch local mode to use nvidia-docker2 + docker-compose instead of nvidia-docker + nvidia-docker-compose. Also get rid of the log output for billable seconds in local mode as it doesn't make sense. Finally, let the users know if we can't cleanup a container directory but not fail the training job.
Codecov Report
@@ Coverage Diff @@
## master #117 +/- ##
=========================================
- Coverage 90.4% 90.2% -0.21%
=========================================
Files 36 36
Lines 2407 2419 +12
=========================================
+ Hits 2176 2182 +6
- Misses 231 237 +6
Continue to review full report at Codecov.
|
src/sagemaker/local/image.py
Outdated
@@ -120,9 +120,16 @@ def train(self, input_data_config, hyperparameters): | |||
shutil.rmtree(data_dir) | |||
# Also free the container config files. | |||
for host in self.hosts: | |||
shutil.rmtree(os.path.join(self.container_root, host)) | |||
container_config_path = os.path.join(self.container_root, host) | |||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you factor out this cleanup logic into a function? We don't know how long it'll stay around, and it might evolve over time, and we should keep it consistent between the various places.
src/sagemaker/local/image.py
Outdated
container_config_path = os.path.join(self.container_root, host) | ||
try: | ||
shutil.rmtree(container_config_path) | ||
except OSError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are probably other ways an OSError can be thrown here, right? I don't love catching a fairly generic exception type and swallowing it without showing a stacktrace.
I think we could go two directions here - be more specific about what we catch, or just catch everything and print out this error message with a stacktrace / enough information to debug. I'm guessing there's not a great way to do the first, so maybe we should do the second? Are there any cases where we would want to rethrow an exception because we failed cleanup?
Also improved the error handling to avoid catching a lot of blanket errors.
src/sagemaker/local/image.py
Outdated
try: | ||
shutil.rmtree(path) | ||
except OSError as exc: | ||
# handle permission denied gracefully. Any other error we will raise the exception up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put a short explanation of the docker writing-files-as-root issue here in a comment.
Fixed: README grammar edits
Switch local mode to use nvidia-docker2 + docker-compose
instead of nvidia-docker + nvidia-docker-compose.
Also get rid of the log output for billable seconds in local mode as it
doesn't make sense.
Finally, let the users know if we can't cleanup a container directory
but not fail the training job.