Skip to content

Support MXNet 1.3 with its training script format changes #446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Nov 5, 2018

Conversation

laurenyu
Copy link
Contributor

@laurenyu laurenyu commented Oct 27, 2018

Description of changes:
This adds support for MXNet 1.3, which will come with changes in the training script format.

A note on integ tests - because we're leaving the default MXNet version as 1.2.1, I left the tuning integ test using framework mode so there's at least one test (and it is included in the continuous testing) running that.

In other news, the underlying migration with our MXNet container code also means requirements.txt will be supported, which addresses #284.

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have updated the changelog with a description of my changes (if appropriate)
  • I have updated any necessary documentation (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@laurenyu laurenyu changed the title [DO NOT MERGE] Support MXNet 1.3 with its training script format changes Support MXNet 1.3 with its training script format changes Oct 29, 2018
@codecov-io
Copy link

codecov-io commented Oct 29, 2018

Codecov Report

Merging #446 into master will decrease coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #446      +/-   ##
==========================================
- Coverage   93.75%   93.63%   -0.13%     
==========================================
  Files          55       55              
  Lines        4100     4114      +14     
==========================================
+ Hits         3844     3852       +8     
- Misses        256      262       +6
Impacted Files Coverage Δ
src/sagemaker/estimator.py 90.57% <100%> (+0.43%) ⬆️
src/sagemaker/mxnet/estimator.py 100% <100%> (ø) ⬆️
src/sagemaker/tensorflow/estimator.py 93.57% <100%> (ø) ⬆️
src/sagemaker/chainer/estimator.py 100% <100%> (ø) ⬆️
src/sagemaker/pytorch/estimator.py 100% <100%> (ø) ⬆️
src/sagemaker/local/image.py 88.12% <0%> (-1.88%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3d091b4...6a44d88. Read the comment docs.

Copy link
Contributor

@eslesar-aws eslesar-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished an edit pass.

'''''''''''''''''''''''''''
Your MXNet training script must be a Python 2.7 or 3.5 compatible source file.

The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...environment variables, including the following:"

Question: Is this an exhaustive list of the available environment variables? Does one exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as

* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A string that represents the path where the training job writes the model artifacts to."

Comment--the next sentence makes this unclear. Is SM_MODEL_DIR a directory in the container? And model artifacts are uploaded to S3 from there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's correct. I reworded it to make it hopefully clearer - let me know if you think it needs more tweaks

* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Outut artifacts may include checkpoints, graphs, and other files to save, not including model artifacts.
These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.

Supposing two input channels, 'train' and 'test', were used in the call to the MXNet estimator's ``fit`` method, the following will be set, following the format "SM_CHANNEL_[channel_name]":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my attempt at restructuring this:

  • SM_CHANNEL_XXXX: A string that represents the path to the directory that contains the input data for the specified channel. For example, if two input channels, named 'train' and 'test', are specified in the call to the MXNet estimator's fit method, the environment variables SM_CHANNEL_TRAIN and SM_CHANNEL_TEST are set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that a lot better. I changed the second sentence a little bit but added the extra bullet point and removed the paragraph/separate list

@@ -35,4 +35,4 @@
from sagemaker.session import s3_input # noqa: F401
from sagemaker.session import get_execution_role # noqa: F401

__version__ = '1.12.0'
__version__ = '1.13.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we bump it to 2.x since this is a breaking change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, this PR doesn't technically have breaking changes because we're not bumping the default version of MXNet. I was going to wait until the PR that makes framework_version required.

logger.warning(empty_framework_version_warning(MXNET_VERSION))
self.framework_version = framework_version or MXNET_VERSION

if self._script_mode_version():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still launch the parameter server with single host training?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, one can still use the kvstore needed even with only one host

icywang86rui
icywang86rui previously approved these changes Oct 30, 2018

For versions 1.3 and higher
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Your MXNet training script must be a Python 2.7 or 3.5 compatible source file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if there is a good way to factor this part out since TensorFlow script mode will have this exact same document in the readme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original splitting of the README deliberately chose to have repeated documentation - I'd love to revisit this idea later, but I don't think now is the time for another README overhaul :/

def test_estimator_script_mode_launch_parameter_server(sagemaker_session):
mx = MXNet(entry_point=SCRIPT_PATH, role=ROLE, sagemaker_session=sagemaker_session,
train_instance_count=INSTANCE_COUNT, train_instance_type=INSTANCE_TYPE,
distributions=LAUNCH_PS_DISTRIBUTIONS_DICT, framework_version='1.3.0')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default framework version is stored in a constant, right? Can we use that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default is being left at 1.2.1 so that this isn't a breaking change. Also the point of these three new unit tests is to be deliberate about the framework version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what i was thinking was do we need a latest_version constant. Are we going to change this framework version number here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test just needs a version >= 1.3.0; no need to change it with later versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants