[DO NOT MERGE] Enable distributed training with Horovod for TensorFlow Script Mode #529

icywang86rui · 2018-12-06T19:29:47Z

Issue #, if available:

Description of changes:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have updated the changelog with a description of my changes (if appropriate)
I have updated any necessary documentation (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov-io · 2018-12-06T19:33:12Z

Codecov Report

Merging #529 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #529      +/-   ##
==========================================
+ Coverage   92.79%   92.81%   +0.01%     
==========================================
  Files          71       71              
  Lines        5373     5386      +13     
==========================================
+ Hits         4986     4999      +13     
  Misses        387      387

Impacted Files	Coverage Δ
src/sagemaker/tensorflow/defaults.py	`100% <100%> (ø)`	⬆️
src/sagemaker/tensorflow/estimator.py	`94.92% <100%> (+0.27%)`	⬆️
src/sagemaker/estimator.py	`90.35% <100%> (+0.08%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ffdeda...3cfdc57. Read the comment docs.

laurenyu · 2018-12-06T20:23:47Z

please make the PR title an imperative statement

laurenyu · 2018-12-06T20:25:58Z

tests/integ/test_tf_script_mode.py

@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type):
                           ['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb'])


+@pytest.mark.skip(reason='The containers have not been updated in Prod yet.')


I assume we're not going to merge the PR until the containers are released? let's remove this skip

good point. I will remove it.

mvsusp · 2018-12-06T22:51:46Z

src/sagemaker/estimator.py

@@ -734,6 +734,7 @@ class Framework(EstimatorBase):

    __framework_name__ = None
    LAUNCH_PS_ENV_NAME = 'sagemaker_parameter_server_enabled'
+    USE_MPI_ENV_NAME = 'sagemaker_mpi_enabled'


nit: what about LAUNCH_MPI_ENV_NAME to be consistent with the other name?

mvsusp · 2018-12-06T22:53:28Z

src/sagemaker/tensorflow/README.rst

+''''''''''''''''''''
+
+To run your training job in a distributed fashion you need to set ``train_instance_count`` to a number larger than 1.
+We support two different types of distributed training, parameter server and MPI. The ``distributions`` parameter is


We support more than these two distributed training types. I guess the difference is that these two types require additional setup.

I think it's possible for user to run other types of distributed training. But I wouldn't say those are supported. These two types are setup by our code and we are going to support and maintain that code.

I think we need to provide details about the 2 configurations options custom_mpi_options and processes_per_host, here is the draft that I wrote:

Please see distribution and Training with Horovod sections of https://github.com/uditbhatia/sagemaker-python-sdk/blob/horovod-documentation/src/sagemaker/tensorflow/README.rst

PLease note couple of links are broken as it is still a draft. But I hope this helps you.

re: Marcio's initial concern - I think this would read better if the "supporting two types of distributed training" were attributed to distributions rather than "us" (aka SageMaker). So maybe change this to:

To run your training job in a distributed fashion you need to set train_instance_count to a number larger than 1. In addition, you will need to ensure that the correct processes are started during training. You can either do this yourself or use the distributions parameter.

The distributions parameter can be used for:

launching parameter server: blah blah blah explanation

using MPI: other explanation blah blah blah

mvsusp · 2018-12-06T22:59:53Z

src/sagemaker/tensorflow/README.rst

+                            distributions={'mpi': {'enabled': True}})
+  tf_estimator.fit('s3://bucket/path/to/training/data')
+
+If MPI is enabled the container will construct and run MPI commands which executes your training script. You can find


Suggested change

If MPI is enabled the container will construct and run MPI commands which executes your training script. You can find

If MPI is enabled the container will configure and execute `mpirun` with your training script. You can find

mvsusp · 2018-12-06T23:01:47Z

tests/data/tensorflow_mnist/horovod_mnist.py

+    parser = argparse.ArgumentParser()
+    # Data, model, and output directories
+    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
+    parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])


Let's remove the default here given that is passed through the hyperparameters.
Use os.environ.get instead to avoid errors running the script outside SageMaker.

mvsusp · 2018-12-06T23:02:25Z

tests/data/tensorflow_mnist/horovod_mnist.py

+    hvd.init()
+
+    # Download and load MNIST dataset.
+    mnist = learn.datasets.mnist.read_data_sets('MNIST-data-%d' % hvd.rank())


What is the size of the dataset?

the training data is 164M. With the eval data and the label it's about 200M.

laurenyu · 2018-12-07T19:05:33Z

src/sagemaker/tensorflow/README.rst

+''''''''''''''''''''
+
+To run your training job with multiple instances in a distributed fashion you need to set ``train_instance_count``
+to a number larger than 1. We support two different types of distributed training, parameter server and Horovod.


is running a script that uses MPI but not Horovod a use case?

I think it is but if you use the tensorflow container it uses Horovod. I could be wrong.

Yes, MPI without Horovod is a valid use case.

if that's the case, then I think this should be changed to something like "We support two different ways of handling distributed training: parameter servers and MPI. The use of MPI can be with or without Horovod." maybe include a link to Horovod documentation as well.

laurenyu · 2018-12-07T19:05:54Z

src/sagemaker/tensorflow/README.rst

+
+Training with ``MPI`` is configured by specifying following fields in ``distributions``:
+
+- ``enabled (bool)``: If set to `True`, the MPI setup is performed and ``mpirun`` command is executed.


double backticks for True

laurenyu · 2018-12-07T19:09:17Z

src/sagemaker/tensorflow/README.rst

+    tf_estimator = TensorFlow(entry_point='tf-train.py', role='SageMakerRole',
+                            train_instance_count=1, train_instance_type='ml.p2.xlarge',
+                            framework_version='1.11', py_version='py3',
+                            distributions: {


line up the arguments, and also s/: /=

laurenyu · 2018-12-10T20:11:21Z

src/sagemaker/tensorflow/README.rst

+                                "mpi":{
+                                    "enabled":True,
+                                    "processes_per_host":2,
+                                    "custom_mpi_options": "--NCCL_DEBUG INFO"


single quotes for strings

laurenyu · 2018-12-10T20:11:35Z

src/sagemaker/tensorflow/README.rst

+                            distributions={
+                                "mpi":{
+                                    "enabled":True,
+                                    "processes_per_host":2,


spaces after the colons

laurenyu · 2018-12-10T20:13:06Z

src/sagemaker/tensorflow/estimator.py

-                use the following setup:
+            distributions (dict): A dictionary with information on how to run distributed training
+                (default: None). Currently we support distributed training with parameter servers and MPI. To enable
+                parameter server use the following setup:


s/server/servers

for "To enable parameter server" - s/server/servers

laurenyu · 2018-12-10T20:14:18Z

src/sagemaker/tensorflow/estimator.py

        local_code = get_config_value('local.local_code', self.sagemaker_session.config)
        if self.sagemaker_session.local_mode and local_code:
            return '/opt/ml/shared/{}'.format(directory)
+        elif mpi:
+            return '/opt/ml/model'


should we make this a constant?

laurenyu · 2018-12-10T20:16:30Z

tests/integ/test_tf_script_mode.py

@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type):
                           ['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb'])


+@pytest.mark.skipif(integ.PYTHON_VERSION != 'py3', reason="Script Mode tests are only configured to run with Python 3")


single quotes for strings

single quotes for the reason string

eslesar-aws · 2018-12-18T00:03:36Z

src/sagemaker/tensorflow/README.rst

+Distributed Training
+''''''''''''''''''''
+
+To run your training job with multiple instances in a distributed fashion you need to set ``train_instance_count``


"...in a distributed fashion, set...

eslesar-aws · 2018-12-18T00:08:52Z

src/sagemaker/tensorflow/README.rst

+Training with parameter servers
+"""""""""""""""""""""""""""""""
+
+If parameter server is enabled, the container will launch a parameter server thread in each instance first then execute


"If you specify parameter_server as the value of the distributions parameter, the container launches a parameter server thread on each instance in the training cluster, and then executes your training code. You can..."

eslesar-aws · 2018-12-18T00:11:39Z

src/sagemaker/tensorflow/README.rst

+Training with Horovod
+"""""""""""""""""""""
+
+Horovod is a distributed training framework based on MPI. You can find more details in `Horovod README <https://github.com/uber/horovod>`__.


"...more details at..."

laurenyu

all small comments. otherwise lgtm.

laurenyu · 2018-12-18T16:17:00Z

src/sagemaker/tensorflow/README.rst

+Training with parameter servers
+"""""""""""""""""""""""""""""""
+
+If you specify parameter_server as the value of the distributions parameter, the container launches a parameter server


backticks around parameter_server

laurenyu · 2018-12-18T16:23:04Z

src/sagemaker/tensorflow/estimator.py

-                use the following setup:
+            distributions (dict): A dictionary with information on how to run distributed training
+                (default: None). Currently we support distributed training with parameter servers and MPI. To enable
+                parameter server use the following setup:


for "To enable parameter server" - s/server/servers

laurenyu · 2018-12-18T16:23:34Z

tests/integ/test_tf_script_mode.py

@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type):
                           ['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb'])


+@pytest.mark.skipif(integ.PYTHON_VERSION != 'py3', reason="Script Mode tests are only configured to run with Python 3")


single quotes for the reason string

laurenyu · 2018-12-18T16:23:58Z

tests/data/tensorflow_mnist/horovod_mnist.py

+    parser = argparse.ArgumentParser()
+    # Data, model, and output directories
+    parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))
+    parser.add_argument('--model_dir', type=str)


nit: it's strange to me that we would mix underscores and hyphens in our examples like this

mvsusp · 2018-12-18T17:13:19Z

src/sagemaker/tensorflow/README.rst

+''''''''''''''''''''
+
+To run your training job with multiple instances in a distributed fashion you need to set ``train_instance_count``
+to a number larger than 1. We support two different types of distributed training, parameter server and Horovod.


Yes, MPI without Horovod is a valid use case.

mvsusp

I previously approved this PR by mistake

icywang86rui added 2 commits December 6, 2018 11:24

Add horovod support

1eb85ad

Add newline at eof

c63fe06

icywang86rui requested review from mvsusp and laurenyu December 6, 2018 19:29

icywang86rui changed the title ~~Horovod~~ Enable distributed training with Horovod for TensorFlow Script Mode Dec 6, 2018

laurenyu reviewed Dec 6, 2018

View reviewed changes

Do not skip integ test

b91208a

icywang86rui changed the title ~~Enable distributed training with Horovod for TensorFlow Script Mode~~ [DO NOT MERGE] Enable distributed training with Horovod for TensorFlow Script Mode Dec 6, 2018

Edit README to include distributed training with MPI

a1b426a

mvsusp reviewed Dec 6, 2018

View reviewed changes

icywang86rui added 5 commits December 6, 2018 16:37

PR commentsw

10fe7bf

Add processes_per_host and custom_mpi_options

c5f68b1

Add missing period

858079e

Use distribution in README

ffc0812

Use distributions in README

7587e52

icywang86rui requested a review from jesterhazy December 7, 2018 18:57

laurenyu reviewed Dec 7, 2018

View reviewed changes

Fix README

f1f8583

laurenyu reviewed Dec 10, 2018

View reviewed changes

Imporve documentation

64449a5

yangaws requested a review from eslesar-aws December 18, 2018 00:00

eslesar-aws reviewed Dec 18, 2018

View reviewed changes

Address comments from Eric

c857afe

laurenyu reviewed Dec 18, 2018

View reviewed changes

mvsusp approved these changes Dec 18, 2018

View reviewed changes

mvsusp suggested changes Dec 18, 2018

View reviewed changes

mvsusp added 2 commits December 18, 2018 09:22

Merge remote-tracking branch 'origin/master' into horovod

245b75f

Updated TF version

e3aeb6e

mvsusp and others added 2 commits December 18, 2018 11:26

Fix empty mpi distribution use case

2bcd290

Add check for necessary files in model.tar.gz

3cfdc57

mvsusp closed this Dec 20, 2018

metrizable pushed a commit to metrizable/sagemaker-python-sdk that referenced this pull request Dec 1, 2020

change: increase wait for sklearn processing pipeline test (aws#529)

ffab145

		@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type):
		['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb'])


		@pytest.mark.skip(reason='The containers have not been updated in Prod yet.')

	If MPI is enabled the container will construct and run MPI commands which executes your training script. You can find
	If MPI is enabled the container will configure and execute `mpirun` with your training script. You can find


		Training with ``MPI`` is configured by specifying following fields in ``distributions``:

		- ``enabled (bool)``: If set to `True`, the MPI setup is performed and ``mpirun`` command is executed.

		@@ -74,6 +75,28 @@ def test_mnist_distributed(sagemaker_session, instance_type):
		['graph.pbtxt', 'model.ckpt-0.index', 'model.ckpt-0.meta', 'saved_model.pb'])


		@pytest.mark.skipif(integ.PYTHON_VERSION != 'py3', reason="Script Mode tests are only configured to run with Python 3")

[DO NOT MERGE] Enable distributed training with Horovod for TensorFlow Script Mode #529

[DO NOT MERGE] Enable distributed training with Horovod for TensorFlow Script Mode #529

Uh oh!

Conversation

icywang86rui commented Dec 6, 2018

Merge Checklist

Uh oh!

codecov-io commented Dec 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

laurenyu commented Dec 6, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

codecov-io commented Dec 6, 2018 •

edited

Loading