-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Support MXNet 1.3 with its training script format changes #446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #446 +/- ##
==========================================
- Coverage 93.75% 93.63% -0.13%
==========================================
Files 55 55
Lines 4100 4114 +14
==========================================
+ Hits 3844 3852 +8
- Misses 256 262 +6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished an edit pass.
src/sagemaker/mxnet/README.rst
Outdated
''''''''''''''''''''''''''' | ||
Your MXNet training script must be a Python 2.7 or 3.5 compatible source file. | ||
|
||
The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...environment variables, including the following:"
Question: Is this an exhaustive list of the available environment variables? Does one exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's an exhaustive list at https://github.com/aws/sagemaker-containers#list-of-provided-environment-variables-by-sagemaker-containers - I'll add that somewhere
src/sagemaker/mxnet/README.rst
Outdated
|
||
The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as | ||
|
||
* ``SM_MODEL_DIR``: A string representing the path to the directory to write model artifacts to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A string that represents the path where the training job writes the model artifacts to."
Comment--the next sentence makes this unclear. Is SM_MODEL_DIR a directory in the container? And model artifacts are uploaded to S3 from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's correct. I reworded it to make it hopefully clearer - let me know if you think it needs more tweaks
src/sagemaker/mxnet/README.rst
Outdated
* ``SM_OUTPUT_DATA_DIR``: A string representing the filesystem path to write output artifacts to. Outut artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. | ||
These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts. | ||
|
||
Supposing two input channels, 'train' and 'test', were used in the call to the MXNet estimator's ``fit`` method, the following will be set, following the format "SM_CHANNEL_[channel_name]": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my attempt at restructuring this:
SM_CHANNEL_XXXX
: A string that represents the path to the directory that contains the input data for the specified channel. For example, if two input channels, named 'train' and 'test', are specified in the call to the MXNet estimator'sfit
method, the environment variablesSM_CHANNEL_TRAIN
andSM_CHANNEL_TEST
are set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that a lot better. I changed the second sentence a little bit but added the extra bullet point and removed the paragraph/separate list
src/sagemaker/__init__.py
Outdated
@@ -35,4 +35,4 @@ | |||
from sagemaker.session import s3_input # noqa: F401 | |||
from sagemaker.session import get_execution_role # noqa: F401 | |||
|
|||
__version__ = '1.12.0' | |||
__version__ = '1.13.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we bump it to 2.x since this is a breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, this PR doesn't technically have breaking changes because we're not bumping the default version of MXNet. I was going to wait until the PR that makes framework_version
required.
src/sagemaker/mxnet/estimator.py
Outdated
logger.warning(empty_framework_version_warning(MXNET_VERSION)) | ||
self.framework_version = framework_version or MXNET_VERSION | ||
|
||
if self._script_mode_version(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still launch the parameter server with single host training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, one can still use the kvstore needed even with only one host
|
||
For versions 1.3 and higher | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
Your MXNet training script must be a Python 2.7 or 3.5 compatible source file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if there is a good way to factor this part out since TensorFlow script mode will have this exact same document in the readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original splitting of the README deliberately chose to have repeated documentation - I'd love to revisit this idea later, but I don't think now is the time for another README overhaul :/
def test_estimator_script_mode_launch_parameter_server(sagemaker_session): | ||
mx = MXNet(entry_point=SCRIPT_PATH, role=ROLE, sagemaker_session=sagemaker_session, | ||
train_instance_count=INSTANCE_COUNT, train_instance_type=INSTANCE_TYPE, | ||
distributions=LAUNCH_PS_DISTRIBUTIONS_DICT, framework_version='1.3.0') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default framework version is stored in a constant, right? Can we use that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default is being left at 1.2.1 so that this isn't a breaking change. Also the point of these three new unit tests is to be deliberate about the framework version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess what i was thinking was do we need a latest_version constant. Are we going to change this framework version number here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test just needs a version >= 1.3.0; no need to change it with later versions
Description of changes:
This adds support for MXNet 1.3, which will come with changes in the training script format.
A note on integ tests - because we're leaving the default MXNet version as 1.2.1, I left the tuning integ test using framework mode so there's at least one test (and it is included in the continuous testing) running that.
In other news, the underlying migration with our MXNet container code also means
requirements.txt
will be supported, which addresses #284.Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.