-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add support for TensorFlow script mode and Python 3 #475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Add script_mode flag to TensorFlow estimator * Add model_dir and distributions to tf estimator * Add unit tests * Add integ tests
Codecov Report
@@ Coverage Diff @@
## master #475 +/- ##
==========================================
+ Coverage 93.82% 93.99% +0.17%
==========================================
Files 58 58
Lines 4323 4365 +42
==========================================
+ Hits 4056 4103 +47
+ Misses 267 262 -5
Continue to review full report at Codecov.
|
@@ -176,6 +185,8 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N | |||
py_version (str): Python version you want to use for executing your model training code (default: 'py2'). | |||
framework_version (str): TensorFlow version you want to use for executing your model training code. | |||
List of supported versions https://github.com/aws/sagemaker-python-sdk#tensorflow-sagemaker-estimators | |||
model_dir (str): S3 location where the checkpoint data and models can be exported to during training | |||
(default: None). If not specified a default S3 URI will be generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain that model_dir is always passed in the training job as script mode parameter
@@ -185,21 +196,54 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N | |||
Examples: | |||
123.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:1.0 | |||
custom-image:latest. | |||
script_mode (bool): If set to True will the estimator will use the Script Mode containers (default: False). | |||
This will be ignored if py_version is set to 'py3'. | |||
distribution (dict): A dictionary with information on how to run distributed training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, explain the format of the dict or link to docs explaining it
if run_tensorboard_locally: | ||
LOGGER.warning(_SCRIPT_MODE_TENSORBOARD_WARNING.format(self.model_dir)) | ||
fit_super() | ||
elif run_tensorboard_locally: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is simpler?
if run_tensorboard_locally:
tensorboard = Tensorboard(self)
tensorboard.validate_requirements()
try:
tensorboard.start()
fit_super()
finally:
# sleep 20 secs for tensorboard start up if fit() quits instantly
time.sleep(20)
tensorboard.event.set()
tensorboard.join()
else:
if self._script_mode_enabled():
LOGGER.warning(_SCRIPT_MODE_TENSORBOARD_WARNING.format(self.model_dir))
fit_super()
or even
try:
if run_tensorboard_locally:
tensorboard = Tensorboard(self)
tensorboard.validate_requirements()
tensorboard.start()
finally:
if run_tensorboard_locally:
# sleep 20 secs for tensorboard start up if fit() quits instantly
time.sleep(20)
tensorboard.event.set()
tensorboard.join()
if self._script_mode_enabled():
LOGGER.warning(_SCRIPT_MODE_TENSORBOARD_WARNING.format(self.model_dir))
fit_super()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two doesn't read much better to me. I have combined the first two ifs. Let me know if you feel strongly about this. :)
@@ -328,7 +376,7 @@ def create_model(self, model_server_workers=None, role=None, | |||
""" | |||
|
|||
role = role or self.role | |||
if endpoint_type == 'tensorflow-serving': | |||
if endpoint_type == 'tensorflow-serving' or self._script_mode_enabled(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
else: | ||
self.checkpoint_path = os.path.join(self.output_path, | ||
self._current_job_name, 'checkpoints') | ||
self.checkpoint_path = self.checkpoint_path or self._default_s3_path('checkpoints') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this parameter not used in script mode? Let's not set it if that is the case to avoid future breaking changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this need to be set for the framework mode containers for now. We will need to remove them in the future.
@@ -18,6 +18,7 @@ CHANGELOG | |||
* build: added pylint | |||
* build: upgrade docker-compose to 1.23 | |||
* enhancement: Frameworks: update warning for not setting framework_version as we aren't planning a breaking change anymore | |||
* feature: Estimator: add script mode and Python 3 support for TensorFlow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure this is in the correct changelog entry. maybe even warrants a bigger version bump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make this decision tomorrow with the pr to bump version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
@@ -185,21 +196,54 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N | |||
Examples: | |||
123.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:1.0 | |||
custom-image:latest. | |||
script_mode (bool): If set to True will the estimator will use the Script Mode containers (default: False). | |||
This will be ignored if py_version is set to 'py3'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about the implicit script mode if Python 3 given the popularity of the Python 3 support for TF feature request. Maybe we should log a warning somewhere if py_version
is set but not script_mode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will do this in the follow up PR with the doc update
return request.param | ||
|
||
|
||
def test_mnist(sagemaker_session, instance_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add one of these to the continuous testing suite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do that in a follow up pr. The test only works in pdx now. I need to get the container change out so the test can work in all the regions.
…r_raw less generic
Issue #, if available:
Description of changes:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.