Skip to content

[DO NOT MERGE] Enable distributed training with Horovod for TensorFlow Script Mode #529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

icywang86rui
Copy link
Contributor

Issue #, if available:

Description of changes:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have updated the changelog with a description of my changes (if appropriate)
  • I have updated any necessary documentation (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov-io
Copy link

codecov-io commented Dec 6, 2018

Codecov Report

Merging #529 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #529      +/-   ##
==========================================
+ Coverage   92.79%   92.81%   +0.01%     
==========================================
  Files          71       71              
  Lines        5373     5386      +13     
==========================================
+ Hits         4986     4999      +13     
  Misses        387      387
Impacted Files Coverage Δ
src/sagemaker/tensorflow/defaults.py 100% <100%> (ø) ⬆️
src/sagemaker/tensorflow/estimator.py 94.92% <100%> (+0.27%) ⬆️
src/sagemaker/estimator.py 90.35% <100%> (+0.08%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ffdeda...3cfdc57. Read the comment docs.

@laurenyu
Copy link
Contributor

laurenyu commented Dec 6, 2018

please make the PR title an imperative statement

@icywang86rui icywang86rui changed the title Horovod Enable distributed training with Horovod for TensorFlow Script Mode Dec 6, 2018
@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type):
['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb'])


@pytest.mark.skip(reason='The containers have not been updated in Prod yet.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we're not going to merge the PR until the containers are released? let's remove this skip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I will remove it.

@icywang86rui icywang86rui changed the title Enable distributed training with Horovod for TensorFlow Script Mode [DO NOT MERGE] Enable distributed training with Horovod for TensorFlow Script Mode Dec 6, 2018
@@ -734,6 +734,7 @@ class Framework(EstimatorBase):

__framework_name__ = None
LAUNCH_PS_ENV_NAME = 'sagemaker_parameter_server_enabled'
USE_MPI_ENV_NAME = 'sagemaker_mpi_enabled'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about LAUNCH_MPI_ENV_NAME to be consistent with the other name?

''''''''''''''''''''

To run your training job in a distributed fashion you need to set ``train_instance_count`` to a number larger than 1.
We support two different types of distributed training, parameter server and MPI. The ``distributions`` parameter is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support more than these two distributed training types. I guess the difference is that these two types require additional setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's possible for user to run other types of distributed training. But I wouldn't say those are supported. These two types are setup by our code and we are going to support and maintain that code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to provide details about the 2 configurations options custom_mpi_options and processes_per_host, here is the draft that I wrote:

Please see distribution and Training with Horovod sections of https://github.com/uditbhatia/sagemaker-python-sdk/blob/horovod-documentation/src/sagemaker/tensorflow/README.rst

PLease note couple of links are broken as it is still a draft. But I hope this helps you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: Marcio's initial concern - I think this would read better if the "supporting two types of distributed training" were attributed to distributions rather than "us" (aka SageMaker). So maybe change this to:

To run your training job in a distributed fashion you need to set train_instance_count to a number larger than 1. In addition, you will need to ensure that the correct processes are started during training. You can either do this yourself or use the distributions parameter.

The distributions parameter can be used for:

  • launching parameter server: blah blah blah explanation
  • using MPI: other explanation blah blah blah

distributions={'mpi': {'enabled': True}})
tf_estimator.fit('s3://bucket/path/to/training/data')

If MPI is enabled the container will construct and run MPI commands which executes your training script. You can find
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If MPI is enabled the container will construct and run MPI commands which executes your training script. You can find
If MPI is enabled the container will configure and execute `mpirun` with your training script. You can find

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k

parser = argparse.ArgumentParser()
# Data, model, and output directories
parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove the default here given that is passed through the hyperparameters.
Use os.environ.get instead to avoid errors running the script outside SageMaker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k

hvd.init()

# Download and load MNIST dataset.
mnist = learn.datasets.mnist.read_data_sets('MNIST-data-%d' % hvd.rank())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the size of the dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the training data is 164M. With the eval data and the label it's about 200M.

''''''''''''''''''''

To run your training job with multiple instances in a distributed fashion you need to set ``train_instance_count``
to a number larger than 1. We support two different types of distributed training, parameter server and Horovod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is running a script that uses MPI but not Horovod a use case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is but if you use the tensorflow container it uses Horovod. I could be wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, MPI without Horovod is a valid use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that's the case, then I think this should be changed to something like "We support two different ways of handling distributed training: parameter servers and MPI. The use of MPI can be with or without Horovod." maybe include a link to Horovod documentation as well.


Training with ``MPI`` is configured by specifying following fields in ``distributions``:

- ``enabled (bool)``: If set to `True`, the MPI setup is performed and ``mpirun`` command is executed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double backticks for True

tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole',
train_instance_count=1, train_instance_type='ml.p2.xlarge',
framework_version='1.11', py_version='py3',
distributions: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line up the arguments, and also s/: /=

"mpi":{
"enabled":True,
"processes_per_host":2,
"custom_mpi_options": "--NCCL_DEBUG INFO"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single quotes for strings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

distributions={
"mpi":{
"enabled":True,
"processes_per_host":2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces after the colons

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

use the following setup:
distributions (dict): A dictionary with information on how to run distributed training
(default: None). Currently we support distributed training with parameter servers and MPI. To enable
parameter server use the following setup:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/server/servers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for "To enable parameter server" - s/server/servers

local_code = get_config_value('local.local_code', self.sagemaker_session.config)
if self.sagemaker_session.local_mode and local_code:
return '/opt/ml/shared/{}'.format(directory)
elif mpi:
return '/opt/ml/model'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make this a constant?

@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type):
['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb'])


@pytest.mark.skipif(integ.PYTHON_VERSION != 'py3', reason="Script Mode tests are only configured to run with Python 3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single quotes for strings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single quotes for the reason string

@yangaws yangaws requested a review from eslesar-aws December 18, 2018 00:00
Distributed Training
''''''''''''''''''''

To run your training job with multiple instances in a distributed fashion you need to set ``train_instance_count``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...in a distributed fashion, set...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Training with parameter servers
"""""""""""""""""""""""""""""""

If parameter server is enabled, the container will launch a parameter server thread in each instance first then execute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If you specify parameter_server as the value of the distributions parameter, the container launches a parameter server thread on each instance in the training cluster, and then executes your training code. You can..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Training with Horovod
"""""""""""""""""""""

Horovod is a distributed training framework based on MPI. You can find more details in `Horovod README <https://github.com/uber/horovod>`__.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...more details at..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Copy link
Contributor

@laurenyu laurenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all small comments. otherwise lgtm.

Training with parameter servers
"""""""""""""""""""""""""""""""

If you specify parameter_server as the value of the distributions parameter, the container launches a parameter server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks around parameter_server

use the following setup:
distributions (dict): A dictionary with information on how to run distributed training
(default: None). Currently we support distributed training with parameter servers and MPI. To enable
parameter server use the following setup:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for "To enable parameter server" - s/server/servers

@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type):
['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb'])


@pytest.mark.skipif(integ.PYTHON_VERSION != 'py3', reason="Script Mode tests are only configured to run with Python 3")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single quotes for the reason string

parser = argparse.ArgumentParser()
# Data, model, and output directories
parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
parser.add_argument('--model_dir', type=str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's strange to me that we would mix underscores and hyphens in our examples like this

''''''''''''''''''''

To run your training job with multiple instances in a distributed fashion you need to set ``train_instance_count``
to a number larger than 1. We support two different types of distributed training, parameter server and Horovod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, MPI without Horovod is a valid use case.

Copy link
Contributor

@mvsusp mvsusp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously approved this PR by mistake

@mvsusp mvsusp closed this Dec 20, 2018
metrizable pushed a commit to metrizable/sagemaker-python-sdk that referenced this pull request Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants