Skip to content

Add support for TensorFlow script mode and Python 3 #475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 16, 2018

Conversation

icywang86rui
Copy link
Contributor

Issue #, if available:

Description of changes:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have updated the changelog with a description of my changes (if appropriate)
  • I have updated any necessary documentation (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Add script_mode flag to TensorFlow estimator
* Add model_dir and distributions to tf estimator
* Add unit tests
* Add integ tests
@codecov-io
Copy link

codecov-io commented Nov 15, 2018

Codecov Report

Merging #475 into master will increase coverage by 0.17%.
The diff coverage is 96.49%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #475      +/-   ##
==========================================
+ Coverage   93.82%   93.99%   +0.17%     
==========================================
  Files          58       58              
  Lines        4323     4365      +42     
==========================================
+ Hits         4056     4103      +47     
+ Misses        267      262       -5
Impacted Files Coverage Δ
src/sagemaker/mxnet/estimator.py 100% <ø> (ø) ⬆️
src/sagemaker/fw_utils.py 100% <100%> (ø) ⬆️
src/sagemaker/estimator.py 90.42% <100%> (+0.03%) ⬆️
src/sagemaker/tensorflow/estimator.py 94.65% <96.36%> (+0.81%) ⬆️
src/sagemaker/local/image.py 89.96% <0%> (+1.88%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11d3fcf...9cd35fb. Read the comment docs.

mvsusp
mvsusp previously approved these changes Nov 15, 2018
@@ -176,6 +185,8 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N
py_version (str): Python version you want to use for executing your model training code (default: 'py2').
framework_version (str): TensorFlow version you want to use for executing your model training code.
List of supported versions https://github.com/aws/sagemaker-python-sdk#tensorflow-sagemaker-estimators
model_dir (str): S3 location where the checkpoint data and models can be exported to during training
(default: None). If not specified a default S3 URI will be generated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain that model_dir is always passed in the training job as script mode parameter

@@ -185,21 +196,54 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N
Examples:
123.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:1.0
custom-image:latest.
script_mode (bool): If set to True will the estimator will use the Script Mode containers (default: False).
This will be ignored if py_version is set to 'py3'.
distribution (dict): A dictionary with information on how to run distributed training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, explain the format of the dict or link to docs explaining it

if run_tensorboard_locally:
LOGGER.warning(_SCRIPT_MODE_TENSORBOARD_WARNING.format(self.model_dir))
fit_super()
elif run_tensorboard_locally:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is simpler?

        if run_tensorboard_locally:
            tensorboard = Tensorboard(self)
            tensorboard.validate_requirements()
              try:
                 tensorboard.start()
                 fit_super()
             finally:
                 # sleep 20 secs for tensorboard start up if fit() quits instantly
                 time.sleep(20)
                 tensorboard.event.set()
                 tensorboard.join()
         else:
            if self._script_mode_enabled():
                LOGGER.warning(_SCRIPT_MODE_TENSORBOARD_WARNING.format(self.model_dir))
             fit_super()

or even

              try:
                  if run_tensorboard_locally:
                       tensorboard = Tensorboard(self)
                       tensorboard.validate_requirements()
                       tensorboard.start()
             finally:
                  if run_tensorboard_locally:
                       # sleep 20 secs for tensorboard start up if fit() quits instantly
                       time.sleep(20)
                       tensorboard.event.set()
                       tensorboard.join()
  
                  if self._script_mode_enabled():
                       LOGGER.warning(_SCRIPT_MODE_TENSORBOARD_WARNING.format(self.model_dir))
                  fit_super()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two doesn't read much better to me. I have combined the first two ifs. Let me know if you feel strongly about this. :)

@@ -328,7 +376,7 @@ def create_model(self, model_server_workers=None, role=None,
"""

role = role or self.role
if endpoint_type == 'tensorflow-serving':
if endpoint_type == 'tensorflow-serving' or self._script_mode_enabled():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

else:
self.checkpoint_path = os.path.join(self.output_path,
self._current_job_name, 'checkpoints')
self.checkpoint_path = self.checkpoint_path or self._default_s3_path('checkpoints')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this parameter not used in script mode? Let's not set it if that is the case to avoid future breaking changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this need to be set for the framework mode containers for now. We will need to remove them in the future.

@@ -18,6 +18,7 @@ CHANGELOG
* build: added pylint
* build: upgrade docker-compose to 1.23
* enhancement: Frameworks: update warning for not setting framework_version as we aren't planning a breaking change anymore
* feature: Estimator: add script mode and Python 3 support for TensorFlow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure this is in the correct changelog entry. maybe even warrants a bigger version bump?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this decision tomorrow with the pr to bump version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@@ -185,21 +196,54 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N
Examples:
123.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:1.0
custom-image:latest.
script_mode (bool): If set to True will the estimator will use the Script Mode containers (default: False).
This will be ignored if py_version is set to 'py3'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about the implicit script mode if Python 3 given the popularity of the Python 3 support for TF feature request. Maybe we should log a warning somewhere if py_version is set but not script_mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will do this in the follow up PR with the doc update

return request.param


def test_mnist(sagemaker_session, instance_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add one of these to the continuous testing suite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do that in a follow up pr. The test only works in pdx now. I need to get the container change out so the test can work in all the regions.

laurenyu
laurenyu previously approved these changes Nov 15, 2018
@icywang86rui icywang86rui changed the title TF script mode support Add support for TensorFlow script mode and Python 3 Nov 15, 2018
laurenyu
laurenyu previously approved these changes Nov 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants