Add support for TensorFlow script mode and Python 3 #475

icywang86rui · 2018-11-13T04:15:57Z

Issue #, if available:

Description of changes:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have updated the changelog with a description of my changes (if appropriate)
I have updated any necessary documentation (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Add script_mode flag to TensorFlow estimator * Add model_dir and distributions to tf estimator * Add unit tests * Add integ tests

src/sagemaker/tensorflow/estimator.py

tests/data/tensorflow_mnist/mnist.py

tests/unit/test_tf_estimator.py

src/sagemaker/tensorflow/estimator.py

tests/data/tensorflow_mnist/mnist.py

tests/integ/test_tf_script_mode.py

CHANGELOG.rst

src/sagemaker/tensorflow/estimator.py

src/sagemaker/fw_utils.py

codecov-io · 2018-11-15T22:00:35Z

Codecov Report

Merging #475 into master will increase coverage by 0.17%.
The diff coverage is 96.49%.

@@            Coverage Diff             @@
##           master     #475      +/-   ##
==========================================
+ Coverage   93.82%   93.99%   +0.17%     
==========================================
  Files          58       58              
  Lines        4323     4365      +42     
==========================================
+ Hits         4056     4103      +47     
+ Misses        267      262       -5

Impacted Files	Coverage Δ
src/sagemaker/mxnet/estimator.py	`100% <ø> (ø)`	⬆️
src/sagemaker/fw_utils.py	`100% <100%> (ø)`	⬆️
src/sagemaker/estimator.py	`90.42% <100%> (+0.03%)`	⬆️
src/sagemaker/tensorflow/estimator.py	`94.65% <96.36%> (+0.81%)`	⬆️
src/sagemaker/local/image.py	`89.96% <0%> (+1.88%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11d3fcf...9cd35fb. Read the comment docs.

mvsusp · 2018-11-15T21:59:58Z

src/sagemaker/tensorflow/estimator.py

@@ -176,6 +185,8 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N
            py_version (str): Python version you want to use for executing your model training code (default: 'py2').
            framework_version (str): TensorFlow version you want to use for executing your model training code.
                List of supported versions https://github.com/aws/sagemaker-python-sdk#tensorflow-sagemaker-estimators
+            model_dir (str): S3 location where the checkpoint data and models can be exported to during training
+                (default: None). If not specified a default S3 URI will be generated.


Please explain that model_dir is always passed in the training job as script mode parameter

mvsusp · 2018-11-15T22:00:44Z

src/sagemaker/tensorflow/estimator.py

@@ -185,21 +196,54 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N
                    Examples:
                        123.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:1.0
                        custom-image:latest.
+            script_mode (bool): If set to True will the estimator will use the Script Mode containers (default: False).
+                This will be ignored if py_version is set to 'py3'.
+            distribution (dict): A dictionary with information on how to run distributed training


Please, explain the format of the dict or link to docs explaining it

mvsusp · 2018-11-15T22:06:52Z

src/sagemaker/tensorflow/estimator.py

+            if run_tensorboard_locally:
+                LOGGER.warning(_SCRIPT_MODE_TENSORBOARD_WARNING.format(self.model_dir))
+            fit_super()
+        elif run_tensorboard_locally:


Maybe this is simpler?

if run_tensorboard_locally: tensorboard = Tensorboard(self) tensorboard.validate_requirements() try: tensorboard.start() fit_super() finally: # sleep 20 secs for tensorboard start up if fit() quits instantly time.sleep(20) tensorboard.event.set() tensorboard.join() else: if self._script_mode_enabled(): LOGGER.warning(_SCRIPT_MODE_TENSORBOARD_WARNING.format(self.model_dir)) fit_super()

or even

try: if run_tensorboard_locally: tensorboard = Tensorboard(self) tensorboard.validate_requirements() tensorboard.start() finally: if run_tensorboard_locally: # sleep 20 secs for tensorboard start up if fit() quits instantly time.sleep(20) tensorboard.event.set() tensorboard.join() if self._script_mode_enabled(): LOGGER.warning(_SCRIPT_MODE_TENSORBOARD_WARNING.format(self.model_dir)) fit_super()

These two doesn't read much better to me. I have combined the first two ifs. Let me know if you feel strongly about this. :)

mvsusp · 2018-11-15T22:07:45Z

src/sagemaker/tensorflow/estimator.py

@@ -328,7 +376,7 @@ def create_model(self, model_server_workers=None, role=None,
        """

        role = role or self.role
-        if endpoint_type == 'tensorflow-serving':
+        if endpoint_type == 'tensorflow-serving' or self._script_mode_enabled():


mvsusp · 2018-11-15T22:10:06Z

src/sagemaker/tensorflow/estimator.py

-            else:
-                self.checkpoint_path = os.path.join(self.output_path,
-                                                    self._current_job_name, 'checkpoints')
+        self.checkpoint_path = self.checkpoint_path or self._default_s3_path('checkpoints')


Is this parameter not used in script mode? Let's not set it if that is the case to avoid future breaking changes.

this need to be set for the framework mode containers for now. We will need to remove them in the future.

laurenyu · 2018-11-15T22:29:49Z

CHANGELOG.rst

@@ -18,6 +18,7 @@ CHANGELOG
 * build: added pylint
 * build: upgrade docker-compose to 1.23
 * enhancement: Frameworks: update warning for not setting framework_version as we aren't planning a breaking change anymore
+* feature: Estimator: add script mode and Python 3 support for TensorFlow


make sure this is in the correct changelog entry. maybe even warrants a bigger version bump?

We can make this decision tomorrow with the pr to bump version?

laurenyu · 2018-11-15T22:33:14Z

src/sagemaker/tensorflow/estimator.py

@@ -185,21 +196,54 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N
                    Examples:
                        123.dkr.ecr.us-west-2.amazonaws.com/my-custom-image:1.0
                        custom-image:latest.
+            script_mode (bool): If set to True will the estimator will use the Script Mode containers (default: False).
+                This will be ignored if py_version is set to 'py3'.


I'm a little worried about the implicit script mode if Python 3 given the popularity of the Python 3 support for TF feature request. Maybe we should log a warning somewhere if py_version is set but not script_mode?

Good point. I will do this in the follow up PR with the doc update

laurenyu · 2018-11-15T22:34:27Z

tests/integ/test_tf_script_mode.py

+    return request.param
+
+
+def test_mnist(sagemaker_session, instance_type):


should we add one of these to the continuous testing suite?

I will do that in a follow up pr. The test only works in pdx now. I need to get the container change out so the test can work in all the regions.

…r_raw less generic

icywang86rui added 2 commits November 12, 2018 20:10

Add script mode and py3 support for Tensorflow

429c474

* Add script_mode flag to TensorFlow estimator * Add model_dir and distributions to tf estimator * Add unit tests * Add integ tests

Modify CHANGELOG

12a258b

icywang86rui requested review from mvsusp and laurenyu November 13, 2018 04:15

laurenyu reviewed Nov 13, 2018

View reviewed changes

icywang86rui added 2 commits November 13, 2018 13:26

Address PR comments

d14db32

Fix failing tests

2de88a7

laurenyu reviewed Nov 13, 2018

View reviewed changes

src/sagemaker/tensorflow/estimator.py Outdated Show resolved Hide resolved

src/sagemaker/tensorflow/estimator.py Outdated Show resolved Hide resolved

More minor refactoring

e9e2592

mvsusp suggested changes Nov 14, 2018

View reviewed changes

icywang86rui added 2 commits November 14, 2018 11:42

Address pr comments

b09867d

Move logger configuration to beggeing of file

358a811

laurenyu reviewed Nov 14, 2018

View reviewed changes

CHANGELOG.rst Outdated Show resolved Hide resolved

src/sagemaker/tensorflow/estimator.py Outdated Show resolved Hide resolved

src/sagemaker/fw_utils.py Outdated Show resolved Hide resolved

icywang86rui added 4 commits November 14, 2018 14:27

Address pr comments

aca3958

Merge branch 'master' into tf-script-mode

311e164

Reduce batch size to avoid CUDA_ERROR_OUT_OF_MEMORY

cfea5ab

Address pr comments

75028a0

mvsusp previously approved these changes Nov 15, 2018

View reviewed changes

laurenyu reviewed Nov 15, 2018

View reviewed changes

More pr comments

8c7d644

icywang86rui dismissed mvsusp’s stale review via 8c7d644 November 15, 2018 22:38

Merge branch 'master' into tf-script-mode

9cd35fb

laurenyu previously approved these changes Nov 15, 2018

View reviewed changes

icywang86rui changed the title ~~TF script mode support~~ Add support for TensorFlow script mode and Python 3 Nov 15, 2018

icywang86rui added 4 commits November 15, 2018 15:59

Merge branch 'master' into tf-script-mode

fea9cfc

Merge branch 'master' into tf-script-mode

2b693c2

Merge branch 'master' into tf-script-mode

b5057a7

Skip flaky gpu test

b0794b8

icywang86rui dismissed laurenyu’s stale review via b0794b8 November 16, 2018 05:00

laurenyu previously approved these changes Nov 16, 2018

View reviewed changes

Make base job for test_tuning_kmeans_identical_dataset_algorithm_tune…

c762318

…r_raw less generic

icywang86rui dismissed laurenyu’s stale review via c762318 November 16, 2018 05:37

laurenyu approved these changes Nov 16, 2018

View reviewed changes

icywang86rui merged commit 835d1af into aws:master Nov 16, 2018

This was referenced Nov 19, 2018

Consistent API Across Pytorch and Tensorflow? aws/amazon-sagemaker-examples#473

Closed

Support Python 3 for TensorFlow scripts #19

Closed

		return request.param


		def test_mnist(sagemaker_session, instance_type):

Add support for TensorFlow script mode and Python 3 #475

Add support for TensorFlow script mode and Python 3 #475

Uh oh!

Conversation

icywang86rui commented Nov 13, 2018

Merge Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-io commented Nov 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-io commented Nov 15, 2018 •

edited

Loading