Support MXNet 1.3 with its training script format changes #446

laurenyu · 2018-10-27T00:34:01Z

Description of changes:
This adds support for MXNet 1.3, which will come with changes in the training script format.

A note on integ tests - because we're leaving the default MXNet version as 1.2.1, I left the tuning integ test using framework mode so there's at least one test (and it is included in the continuous testing) running that.

In other news, the underlying migration with our MXNet container code also means requirements.txt will be supported, which addresses #284.

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have updated the changelog with a description of my changes (if appropriate)
I have updated any necessary documentation (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov-io · 2018-10-29T17:23:47Z

Codecov Report

Merging #446 into master will decrease coverage by 0.12%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #446      +/-   ##
==========================================
- Coverage   93.75%   93.63%   -0.13%     
==========================================
  Files          55       55              
  Lines        4100     4114      +14     
==========================================
+ Hits         3844     3852       +8     
- Misses        256      262       +6

Impacted Files	Coverage Δ
src/sagemaker/estimator.py	`90.57% <100%> (+0.43%)`	⬆️
src/sagemaker/mxnet/estimator.py	`100% <100%> (ø)`	⬆️
src/sagemaker/tensorflow/estimator.py	`93.57% <100%> (ø)`	⬆️
src/sagemaker/chainer/estimator.py	`100% <100%> (ø)`	⬆️
src/sagemaker/pytorch/estimator.py	`100% <100%> (ø)`	⬆️
src/sagemaker/local/image.py	`88.12% <0%> (-1.88%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d091b4...6a44d88. Read the comment docs.

eslesar-aws

Finished an edit pass.

src/sagemaker/mxnet/README.rst

eslesar-aws · 2018-10-29T17:18:34Z

src/sagemaker/mxnet/README.rst

+'''''''''''''''''''''''''''
+Your MXNet training script must be a Python 2.7 or 3.5 compatible source file.
+
+The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as


"...environment variables, including the following:"

Question: Is this an exhaustive list of the available environment variables? Does one exist?

there's an exhaustive list at https://github.com/aws/sagemaker-containers#list-of-provided-environment-variables-by-sagemaker-containers - I'll add that somewhere

eslesar-aws · 2018-10-29T17:26:42Z

src/sagemaker/mxnet/README.rst

+
+The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as
+
+* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to.


"A string that represents the path where the training job writes the model artifacts to."

Comment--the next sentence makes this unclear. Is SM_MODEL_DIR a directory in the container? And model artifacts are uploaded to S3 from there?

yes, that's correct. I reworded it to make it hopefully clearer - let me know if you think it needs more tweaks

src/sagemaker/mxnet/README.rst

eslesar-aws · 2018-10-29T17:38:18Z

src/sagemaker/mxnet/README.rst

+* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Outut artifacts may include checkpoints, graphs, and other files to save, not including model artifacts.
+  These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.
+
+Supposing two input channels, 'train' and 'test', were used in the call to the MXNet estimator's ``fit`` method, the following will be set, following the format "SM_CHANNEL_[channel_name]":


Here's my attempt at restructuring this:

SM_CHANNEL_XXXX: A string that represents the path to the directory that contains the input data for the specified channel. For example, if two input channels, named 'train' and 'test', are specified in the call to the MXNet estimator's fit method, the environment variables SM_CHANNEL_TRAIN and SM_CHANNEL_TEST are set.

I like that a lot better. I changed the second sentence a little bit but added the extra bullet point and removed the paragraph/separate list

src/sagemaker/mxnet/README.rst

icywang86rui · 2018-10-29T22:27:27Z

src/sagemaker/__init__.py

@@ -35,4 +35,4 @@
 from sagemaker.session import s3_input  # noqa: F401
 from sagemaker.session import get_execution_role  # noqa: F401

-__version__ = '1.12.0'
+__version__ = '1.13.0'


should we bump it to 2.x since this is a breaking change?

well, this PR doesn't technically have breaking changes because we're not bumping the default version of MXNet. I was going to wait until the PR that makes framework_version required.

icywang86rui · 2018-10-30T02:58:05Z

src/sagemaker/mxnet/estimator.py

+            logger.warning(empty_framework_version_warning(MXNET_VERSION))
+        self.framework_version = framework_version or MXNET_VERSION
+
+        if self._script_mode_version():


Do we still launch the parameter server with single host training?

yes, one can still use the kvstore needed even with only one host

src/sagemaker/mxnet/estimator.py

src/sagemaker/estimator.py

icywang86rui · 2018-11-05T18:25:35Z

src/sagemaker/mxnet/README.rst

+
+For versions 1.3 and higher
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Your MXNet training script must be a Python 2.7 or 3.5 compatible source file.


I am wondering if there is a good way to factor this part out since TensorFlow script mode will have this exact same document in the readme.

the original splitting of the README deliberately chose to have repeated documentation - I'd love to revisit this idea later, but I don't think now is the time for another README overhaul :/

icywang86rui · 2018-11-05T18:28:07Z

tests/unit/test_mxnet.py

+def test_estimator_script_mode_launch_parameter_server(sagemaker_session):
+    mx = MXNet(entry_point=SCRIPT_PATH, role=ROLE, sagemaker_session=sagemaker_session,
+               train_instance_count=INSTANCE_COUNT, train_instance_type=INSTANCE_TYPE,
+               distributions=LAUNCH_PS_DISTRIBUTIONS_DICT, framework_version='1.3.0')


The default framework version is stored in a constant, right? Can we use that here?

the default is being left at 1.2.1 so that this isn't a breaking change. Also the point of these three new unit tests is to be deliberate about the framework version

I guess what i was thinking was do we need a latest_version constant. Are we going to change this framework version number here.

this test just needs a version >= 1.3.0; no need to change it with later versions

laurenyu added 2 commits October 25, 2018 16:29

Document script mode format

b7db5a0

Update MXNet estimator for changes with 1.3

fe2e9bf

laurenyu requested review from eslesar-aws and icywang86rui October 27, 2018 00:34

laurenyu added 2 commits October 26, 2018 17:44

Fix unit tests

2008468

update integ tests

f34b128

laurenyu changed the title ~~[DO NOT MERGE] Support MXNet 1.3 with its training script format changes~~ Support MXNet 1.3 with its training script format changes Oct 29, 2018

laurenyu added 5 commits October 29, 2018 09:30

update changelog

ca21338

bump SDK version

ffd20f6

update integ tests

51fe929

Merge branch 'master' into mx-13

9e9a3be

undo unnecessary docstring change

fe7bab3

laurenyu force-pushed the mx-13 branch from e5fbe2a to fe7bab3 Compare October 29, 2018 17:03

Fix unit tests

1fddc2e

eslesar-aws reviewed Oct 29, 2018

View reviewed changes

laurenyu added 3 commits October 29, 2018 11:16

Address PR feedback

072f19f

update info about dependencies

7175165

add documentation about parameter server

623c722

icywang86rui suggested changes Oct 30, 2018

View reviewed changes

fix lowest script mode version

3e2ab64

icywang86rui previously approved these changes Oct 30, 2018

View reviewed changes

fix integ tests

82257f2

laurenyu dismissed icywang86rui’s stale review via 82257f2 November 2, 2018 05:51

laurenyu added 6 commits November 2, 2018 11:20

update integ tests

eaac852

update integ tests

faec731

change launch_parameter_server to distributions

9d69562

change launch_parameter_server to distributions

e018f42

Merge branch 'master' into mx-13

01dcb9d

update changelog

1ff9836

icywang86rui reviewed Nov 5, 2018

View reviewed changes

laurenyu added 2 commits November 5, 2018 10:36

address PR comment

6a44d88

Merge branch 'master' into mx-13

c9ef5dc

icywang86rui approved these changes Nov 5, 2018

View reviewed changes

laurenyu merged commit 868f81b into aws:master Nov 5, 2018

laurenyu deleted the mx-13 branch November 5, 2018 21:46

laurenyu mentioned this pull request Nov 21, 2018

Add support for specifying env/requirements for sagemaker.mxnet.MXNet #284

Closed


		The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as

		* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to.

Support MXNet 1.3 with its training script format changes #446

Support MXNet 1.3 with its training script format changes #446

Uh oh!

Conversation

laurenyu commented Oct 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Checklist

Uh oh!

codecov-io commented Oct 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eslesar-aws left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

laurenyu commented Oct 27, 2018 •

edited

Loading

codecov-io commented Oct 29, 2018 •

edited

Loading