Skip to content

Local mode improvements #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 2, 2018
Merged

Conversation

iquintero
Copy link
Contributor

Switch local mode to use nvidia-docker2 + docker-compose
instead of nvidia-docker + nvidia-docker-compose.

Also get rid of the log output for billable seconds in local mode as it
doesn't make sense.

Finally, let the users know if we can't cleanup a container directory
but not fail the training job.

Switch local mode to use nvidia-docker2 + docker-compose
instead of nvidia-docker + nvidia-docker-compose.

Also get rid of the log output for billable seconds in local mode as it
doesn't make sense.

Finally, let the users know if we can't cleanup a container directory
but not fail the training job.
@iquintero iquintero requested a review from owen-t April 2, 2018 00:04
@codecov-io
Copy link

codecov-io commented Apr 2, 2018

Codecov Report

Merging #117 into master will decrease coverage by 0.2%.
The diff coverage is 63.15%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #117      +/-   ##
=========================================
- Coverage    90.4%   90.2%   -0.21%     
=========================================
  Files          36      36              
  Lines        2407    2419      +12     
=========================================
+ Hits         2176    2182       +6     
- Misses        231     237       +6
Impacted Files Coverage Δ
src/sagemaker/local/local_session.py 77.52% <50%> (-0.64%) ⬇️
src/sagemaker/local/image.py 84.46% <64.7%> (-1.43%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6184b22...9a7db0e. Read the comment docs.

@@ -120,9 +120,16 @@ def train(self, input_data_config, hyperparameters):
shutil.rmtree(data_dir)
# Also free the container config files.
for host in self.hosts:
shutil.rmtree(os.path.join(self.container_root, host))
container_config_path = os.path.join(self.container_root, host)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you factor out this cleanup logic into a function? We don't know how long it'll stay around, and it might evolve over time, and we should keep it consistent between the various places.

container_config_path = os.path.join(self.container_root, host)
try:
shutil.rmtree(container_config_path)
except OSError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are probably other ways an OSError can be thrown here, right? I don't love catching a fairly generic exception type and swallowing it without showing a stacktrace.

I think we could go two directions here - be more specific about what we catch, or just catch everything and print out this error message with a stacktrace / enough information to debug. I'm guessing there's not a great way to do the first, so maybe we should do the second? Are there any cases where we would want to rethrow an exception because we failed cleanup?

Also improved the error handling to avoid catching a lot of blanket
errors.
try:
shutil.rmtree(path)
except OSError as exc:
# handle permission denied gracefully. Any other error we will raise the exception up.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a short explanation of the docker writing-files-as-root issue here in a comment.

@iquintero iquintero merged commit 7ff630f into aws:master Apr 2, 2018
apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants