-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[DO NOT MERGE] Enable distributed training with Horovod for TensorFlow Script Mode #529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #529 +/- ##
==========================================
+ Coverage 92.79% 92.81% +0.01%
==========================================
Files 71 71
Lines 5373 5386 +13
==========================================
+ Hits 4986 4999 +13
Misses 387 387
Continue to review full report at Codecov.
|
please make the PR title an imperative statement |
tests/integ/test_tf_script_mode.py
Outdated
@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type): | |||
['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb']) | |||
|
|||
|
|||
@pytest.mark.skip(reason='The containers have not been updated in Prod yet.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we're not going to merge the PR until the containers are released? let's remove this skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. I will remove it.
src/sagemaker/estimator.py
Outdated
@@ -734,6 +734,7 @@ class Framework(EstimatorBase): | |||
|
|||
__framework_name__ = None | |||
LAUNCH_PS_ENV_NAME = 'sagemaker_parameter_server_enabled' | |||
USE_MPI_ENV_NAME = 'sagemaker_mpi_enabled' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what about LAUNCH_MPI_ENV_NAME
to be consistent with the other name?
src/sagemaker/tensorflow/README.rst
Outdated
'''''''''''''''''''' | ||
|
||
To run your training job in a distributed fashion you need to set ``train_instance_count`` to a number larger than 1. | ||
We support two different types of distributed training, parameter server and MPI. The ``distributions`` parameter is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support more than these two distributed training types. I guess the difference is that these two types require additional setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible for user to run other types of distributed training. But I wouldn't say those are supported. These two types are setup by our code and we are going to support and maintain that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to provide details about the 2 configurations options custom_mpi_options
and processes_per_host
, here is the draft that I wrote:
Please see distribution
and Training with Horovod
sections of https://github.com/uditbhatia/sagemaker-python-sdk/blob/horovod-documentation/src/sagemaker/tensorflow/README.rst
PLease note couple of links are broken as it is still a draft. But I hope this helps you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: Marcio's initial concern - I think this would read better if the "supporting two types of distributed training" were attributed to distributions
rather than "us" (aka SageMaker). So maybe change this to:
To run your training job in a distributed fashion you need to set
train_instance_count
to a number larger than 1. In addition, you will need to ensure that the correct processes are started during training. You can either do this yourself or use thedistributions
parameter.The
distributions
parameter can be used for:
- launching parameter server: blah blah blah explanation
- using MPI: other explanation blah blah blah
src/sagemaker/tensorflow/README.rst
Outdated
distributions={'mpi': {'enabled': True}}) | ||
tf_estimator.fit('s3://bucket/path/to/training/data') | ||
|
||
If MPI is enabled the container will construct and run MPI commands which executes your training script. You can find |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If MPI is enabled the container will construct and run MPI commands which executes your training script. You can find | |
If MPI is enabled the container will configure and execute `mpirun` with your training script. You can find |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k
parser = argparse.ArgumentParser() | ||
# Data, model, and output directories | ||
parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR']) | ||
parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the default here given that is passed through the hyperparameters.
Use os.environ.get instead to avoid errors running the script outside SageMaker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k
hvd.init() | ||
|
||
# Download and load MNIST dataset. | ||
mnist = learn.datasets.mnist.read_data_sets('MNIST-data-%d' % hvd.rank()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the size of the dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the training data is 164M. With the eval data and the label it's about 200M.
'''''''''''''''''''' | ||
|
||
To run your training job with multiple instances in a distributed fashion you need to set ``train_instance_count`` | ||
to a number larger than 1. We support two different types of distributed training, parameter server and Horovod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is running a script that uses MPI but not Horovod a use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is but if you use the tensorflow container it uses Horovod. I could be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, MPI without Horovod is a valid use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if that's the case, then I think this should be changed to something like "We support two different ways of handling distributed training: parameter servers and MPI. The use of MPI can be with or without Horovod." maybe include a link to Horovod documentation as well.
src/sagemaker/tensorflow/README.rst
Outdated
|
||
Training with ``MPI`` is configured by specifying following fields in ``distributions``: | ||
|
||
- ``enabled (bool)``: If set to `True`, the MPI setup is performed and ``mpirun`` command is executed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks for True
src/sagemaker/tensorflow/README.rst
Outdated
tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole', | ||
train_instance_count=1, train_instance_type='ml.p2.xlarge', | ||
framework_version='1.11', py_version='py3', | ||
distributions: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line up the arguments, and also s/: /=
src/sagemaker/tensorflow/README.rst
Outdated
"mpi":{ | ||
"enabled":True, | ||
"processes_per_host":2, | ||
"custom_mpi_options": "--NCCL_DEBUG INFO" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single quotes for strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
src/sagemaker/tensorflow/README.rst
Outdated
distributions={ | ||
"mpi":{ | ||
"enabled":True, | ||
"processes_per_host":2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces after the colons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
use the following setup: | ||
distributions (dict): A dictionary with information on how to run distributed training | ||
(default: None). Currently we support distributed training with parameter servers and MPI. To enable | ||
parameter server use the following setup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/server/servers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for "To enable parameter server" - s/server/servers
local_code = get_config_value('local.local_code', self.sagemaker_session.config) | ||
if self.sagemaker_session.local_mode and local_code: | ||
return '/opt/ml/shared/{}'.format(directory) | ||
elif mpi: | ||
return '/opt/ml/model' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this a constant?
tests/integ/test_tf_script_mode.py
Outdated
@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type): | |||
['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb']) | |||
|
|||
|
|||
@pytest.mark.skipif(integ.PYTHON_VERSION != 'py3', reason="Script Mode tests are only configured to run with Python 3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single quotes for strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single quotes for the reason string
src/sagemaker/tensorflow/README.rst
Outdated
Distributed Training | ||
'''''''''''''''''''' | ||
|
||
To run your training job with multiple instances in a distributed fashion you need to set ``train_instance_count`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...in a distributed fashion, set...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/sagemaker/tensorflow/README.rst
Outdated
Training with parameter servers | ||
""""""""""""""""""""""""""""""" | ||
|
||
If parameter server is enabled, the container will launch a parameter server thread in each instance first then execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If you specify parameter_server
as the value of the distributions
parameter, the container launches a parameter server thread on each instance in the training cluster, and then executes your training code. You can..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/sagemaker/tensorflow/README.rst
Outdated
Training with Horovod | ||
""""""""""""""""""""" | ||
|
||
Horovod is a distributed training framework based on MPI. You can find more details in `Horovod README <https://github.com/uber/horovod>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...more details at..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all small comments. otherwise lgtm.
Training with parameter servers | ||
""""""""""""""""""""""""""""""" | ||
|
||
If you specify parameter_server as the value of the distributions parameter, the container launches a parameter server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backticks around parameter_server
use the following setup: | ||
distributions (dict): A dictionary with information on how to run distributed training | ||
(default: None). Currently we support distributed training with parameter servers and MPI. To enable | ||
parameter server use the following setup: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for "To enable parameter server" - s/server/servers
tests/integ/test_tf_script_mode.py
Outdated
@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type): | |||
['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb']) | |||
|
|||
|
|||
@pytest.mark.skipif(integ.PYTHON_VERSION != 'py3', reason="Script Mode tests are only configured to run with Python 3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single quotes for the reason string
parser = argparse.ArgumentParser() | ||
# Data, model, and output directories | ||
parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR')) | ||
parser.add_argument('--model_dir', type=str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's strange to me that we would mix underscores and hyphens in our examples like this
'''''''''''''''''''' | ||
|
||
To run your training job with multiple instances in a distributed fashion you need to set ``train_instance_count`` | ||
to a number larger than 1. We support two different types of distributed training, parameter server and Horovod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, MPI without Horovod is a valid use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously approved this PR by mistake
Issue #, if available:
Description of changes:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.