-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Support of Horovod and TF 1.12 for TensorFlow Script Mode. TFS 1.12 support #567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 26 commits
Commits
Show all changes
69 commits
Select commit
Hold shift + click to select a range
1eb85ad
Add horovod support
icywang86rui c63fe06
Add newline at eof
icywang86rui b91208a
Do not skip integ test
icywang86rui a1b426a
Edit README to include distributed training with MPI
icywang86rui 10fe7bf
PR commentsw
icywang86rui c5f68b1
Add processes_per_host and custom_mpi_options
icywang86rui 858079e
Add missing period
icywang86rui ffc0812
Use distribution in README
icywang86rui 7587e52
Use distributions in README
icywang86rui f1f8583
Fix README
icywang86rui 64449a5
Imporve documentation
yangaws c857afe
Address comments from Eric
yangaws 245b75f
Merge remote-tracking branch 'origin/master' into horovod
mvsusp e3aeb6e
Updated TF version
mvsusp 2bcd290
Fix empty mpi distribution use case
mvsusp 3cfdc57
Add check for necessary files in model.tar.gz
yangaws 561414f
Add benchmarks as submodule
mvsusp b5d4a1c
Add benchmarks as submodule
mvsusp 7843392
Handle PR comments
mvsusp 1c4e8c5
Update version
mvsusp 41175a2
Handle PR comments
mvsusp c137be1
Run TF tests against latest container instead of default.
nadiaya d44d590
Merge branch 'wru-horovod' of github.com:mvsusp/sagemaker-python-sdk …
nadiaya 05ee7c1
Merge branch 'master' into wru-horovod
yangaws 1680073
Fix urllib.parse import errors for python 2.
nadiaya 0232e4a
Merge branch 'wru-horovod' of github.com:mvsusp/sagemaker-python-sdk …
nadiaya c19021a
Fix horovod integ test tar file extract error
yangaws 6722c07
Merge branch 'master' into wru-horovod
yangaws 24a5d61
fix flake8
yangaws 160e646
Removed unnecessary tests
mvsusp 7f93812
Merge branch 'master' into wru-horovod
uditbhatia 333ebf7
Removing duplicated/unused TF import
uditbhatia 1f80caf
Merge branch 'master' into wru-horovod
uditbhatia a1ec1b4
Add horovod support
icywang86rui 9e8d88a
Add newline at eof
icywang86rui fafc9bb
Do not skip integ test
icywang86rui df313d8
Edit README to include distributed training with MPI
icywang86rui 3fd1bf0
PR commentsw
icywang86rui e3051da
Add processes_per_host and custom_mpi_options
icywang86rui ead6229
Add missing period
icywang86rui d41e163
Use distribution in README
icywang86rui 2aff9fc
Use distributions in README
icywang86rui 3915406
Fix README
icywang86rui a07c0d6
Imporve documentation
yangaws 308a31c
Address comments from Eric
yangaws 56d6d07
Updated TF version
mvsusp 3145ffd
Fix empty mpi distribution use case
mvsusp dd838ef
Add check for necessary files in model.tar.gz
yangaws 15bfe00
Add benchmarks as submodule
mvsusp 8e9734e
Add benchmarks as submodule
mvsusp b22671d
Handle PR comments
mvsusp 20e906e
Update version
mvsusp 430cd0a
Handle PR comments
mvsusp bd9c92d
Run TF tests against latest container instead of default.
nadiaya 2fcdaea
Fix urllib.parse import errors for python 2.
nadiaya 3d06e11
Fix horovod integ test tar file extract error
yangaws c78eb31
fix flake8
yangaws abce1dd
Removed unnecessary tests
mvsusp 3342e94
Removing duplicated/unused TF import
uditbhatia cb7610f
Capitalizing the mpi_distribution ps_distribution constant
uditbhatia dca2173
resolving conflists
uditbhatia f91b29c
Merge branch 'master' into wru-horovod
uditbhatia 30995dd
Restoring version default to 1.12
uditbhatia 7c93fdc
Accomodating the mvs pr comments
uditbhatia a8d2cb0
Updating changelog
uditbhatia 55c1998
chaing the TF_VERSION field to 1.11 from 1.12 in defaults.py
uditbhatia c913c40
Merge branch 'master' into wru-horovod
uditbhatia 23d8074
Fixing flake 8 errors after merge from master and updating changelog
uditbhatia 177f37f
Bumping up the python SDK version to 1.17.3 (as per instructions in M…
uditbhatia File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -171,6 +171,7 @@ class TensorFlow(Framework): | |
"""Handle end-to-end training and deployment of user-provided TensorFlow code.""" | ||
|
||
__framework_name__ = 'tensorflow' | ||
LATEST_VERSION = '1.12' | ||
|
||
def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=None, py_version='py2', | ||
framework_version=None, model_dir=None, requirements_file='', image_name=None, | ||
|
@@ -200,14 +201,21 @@ def __init__(self, training_steps=None, evaluation_steps=None, checkpoint_path=N | |
script_mode (bool): If set to True will the estimator will use the Script Mode containers (default: False). | ||
This will be ignored if py_version is set to 'py3'. | ||
distributions (dict): A dictionary with information on how to run distributed training | ||
(default: None). Currently we only support distributed training with parameter servers. To enable it | ||
use the following setup: | ||
(default: None). Currently we support distributed training with parameter servers and MPI. To enable | ||
parameter servers use the following setup: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Follow up: |
||
{ | ||
'parameter_server': | ||
{ | ||
'enabled': True | ||
} | ||
} | ||
To enable MPI: | ||
{ | ||
'mpi': | ||
{ | ||
'enabled': True | ||
} | ||
} | ||
**kwargs: Additional kwargs passed to the Framework constructor. | ||
""" | ||
if framework_version is None: | ||
|
@@ -419,13 +427,24 @@ def hyperparameters(self): | |
hyperparameters = super(TensorFlow, self).hyperparameters() | ||
|
||
self.checkpoint_path = self.checkpoint_path or self._default_s3_path('checkpoints') | ||
mpi_enabled = False | ||
|
||
if self._script_mode_enabled(): | ||
self.model_dir = self.model_dir or self._default_s3_path('model') | ||
additional_hyperparameters = {'model_dir': self.model_dir} | ||
additional_hyperparameters = {} | ||
|
||
if 'parameter_server' in self.distributions: | ||
enabled = self.distributions['parameter_server'].get('enabled', False) | ||
additional_hyperparameters[self.LAUNCH_PS_ENV_NAME] = enabled | ||
ps_enabled = self.distributions['parameter_server'].get('enabled', False) | ||
additional_hyperparameters[self.LAUNCH_PS_ENV_NAME] = ps_enabled | ||
|
||
if 'mpi' in self.distributions: | ||
mpi_dict = self.distributions['mpi'] | ||
mpi_enabled = mpi_dict.get('enabled', False) | ||
additional_hyperparameters[self.LAUNCH_MPI_ENV_NAME] = mpi_enabled | ||
additional_hyperparameters[self.MPI_NUM_PROCESSES_PER_HOST] = mpi_dict.get('processes_per_host', 1) | ||
additional_hyperparameters[self.MPI_CUSTOM_MPI_OPTIONS] = mpi_dict.get('custom_mpi_options', '') | ||
|
||
self.model_dir = self.model_dir or self._default_s3_path('model', mpi=mpi_enabled) | ||
mvsusp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
additional_hyperparameters['model_dir'] = self.model_dir | ||
else: | ||
additional_hyperparameters = {'checkpoint_path': self.checkpoint_path, | ||
'training_steps': self.training_steps, | ||
|
@@ -435,10 +454,12 @@ def hyperparameters(self): | |
hyperparameters.update(Framework._json_encode_hyperparameters(additional_hyperparameters)) | ||
return hyperparameters | ||
|
||
def _default_s3_path(self, directory): | ||
def _default_s3_path(self, directory, mpi=False): | ||
local_code = get_config_value('local.local_code', self.sagemaker_session.config) | ||
if self.sagemaker_session.local_mode and local_code: | ||
return '/opt/ml/shared/{}'.format(directory) | ||
elif mpi: | ||
return '/opt/ml/model' | ||
else: | ||
return os.path.join(self.output_path, self._current_job_name, directory) | ||
|
||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
#!/usr/bin/env bash | ||
|
||
python benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_batches=500 --model vgg16 --variable_update horovod --horovod_device gpu --use_fp16 --summary_verbosity 1 --save_summaries_steps 10 --train_dir /opt/ml/model --eval_dir /opt/ml/model --batch_size 32 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
import json | ||
import os | ||
import horovod.tensorflow as hvd | ||
|
||
hvd.init() | ||
|
||
with open(os.path.join('/opt/ml/model/rank-%s' % hvd.rank()), 'w+') as f: | ||
basic_info = {'rank': hvd.rank(), 'size': hvd.size()} | ||
|
||
print(basic_info) | ||
json.dump(basic_info, f) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
# Copyright 2017-2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"). You | ||
# may not use this file except in compliance with the License. A copy of | ||
# the License is located at | ||
# | ||
# http://aws.amazon.com/apache2.0/ | ||
# | ||
# or in the "license" file accompanying this file. This file is | ||
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF | ||
# ANY KIND, either express or implied. See the License for the specific | ||
# language governing permissions and limitations under the License. | ||
from __future__ import absolute_import | ||
|
||
import argparse | ||
import os | ||
|
||
import tensorflow as tf | ||
import horovod.tensorflow as hvd | ||
|
||
layers = tf.contrib.layers | ||
learn = tf.contrib.learn | ||
|
||
tf.logging.set_verbosity(tf.logging.INFO) | ||
|
||
|
||
def _parse_args(): | ||
parser = argparse.ArgumentParser() | ||
# Data, model, and output directories | ||
parser.add_argument('--output-data-dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR')) | ||
parser.add_argument('--model_dir', type=str) | ||
|
||
return parser.parse_known_args() | ||
|
||
|
||
def conv_model(feature, target, mode): | ||
"""2-layer convolution model.""" | ||
# Convert the target to a one-hot tensor of shape (batch_size, 10) and | ||
# with a on-value of 1 for each one-hot vector of length 10. | ||
target = tf.one_hot(tf.cast(target, tf.int32), 10, 1, 0) | ||
|
||
# Reshape feature to 4d tensor with 2nd and 3rd dimensions being | ||
# image width and height final dimension being the number of color channels. | ||
feature = tf.reshape(feature, [-1, 28, 28, 1]) | ||
|
||
# First conv layer will compute 32 features for each 5x5 patch | ||
with tf.variable_scope('conv_layer1'): | ||
h_conv1 = layers.conv2d( | ||
feature, 32, kernel_size=[5, 5], activation_fn=tf.nn.relu) | ||
h_pool1 = tf.nn.max_pool( | ||
h_conv1, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') | ||
|
||
# Second conv layer will compute 64 features for each 5x5 patch. | ||
with tf.variable_scope('conv_layer2'): | ||
h_conv2 = layers.conv2d( | ||
h_pool1, 64, kernel_size=[5, 5], activation_fn=tf.nn.relu) | ||
h_pool2 = tf.nn.max_pool( | ||
h_conv2, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') | ||
# reshape tensor into a batch of vectors | ||
h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64]) | ||
|
||
# Densely connected layer with 1024 neurons. | ||
h_fc1 = layers.dropout( | ||
layers.fully_connected( | ||
h_pool2_flat, 1024, activation_fn=tf.nn.relu), | ||
keep_prob=0.5, | ||
is_training=mode == tf.contrib.learn.ModeKeys.TRAIN) | ||
|
||
# Compute logits (1 per class) and compute loss. | ||
logits = layers.fully_connected(h_fc1, 10, activation_fn=None) | ||
loss = tf.losses.softmax_cross_entropy(target, logits) | ||
|
||
return tf.argmax(logits, 1), loss | ||
|
||
|
||
def main(_): | ||
args, unknown = _parse_args() | ||
|
||
# Horovod: initialize Horovod. | ||
hvd.init() | ||
|
||
# Download and load MNIST dataset. | ||
mnist = learn.datasets.mnist.read_data_sets('MNIST-data-%d' % hvd.rank()) | ||
|
||
# Build model... | ||
with tf.name_scope('input'): | ||
image = tf.placeholder(tf.float32, [None, 784], name='image') | ||
label = tf.placeholder(tf.float32, [None], name='label') | ||
predict, loss = conv_model(image, label, tf.contrib.learn.ModeKeys.TRAIN) | ||
|
||
# Horovod: adjust learning rate based on number of GPUs. | ||
opt = tf.train.RMSPropOptimizer(0.001 * hvd.size()) | ||
|
||
# Horovod: add Horovod Distributed Optimizer. | ||
opt = hvd.DistributedOptimizer(opt) | ||
|
||
global_step = tf.contrib.framework.get_or_create_global_step() | ||
train_op = opt.minimize(loss, global_step=global_step) | ||
|
||
hooks = [ | ||
# Horovod: BroadcastGlobalVariablesHook broadcasts initial variable states | ||
# from rank 0 to all other processes. This is necessary to ensure consistent | ||
# initialization of all workers when training is started with random weights | ||
# or restored from a checkpoint. | ||
hvd.BroadcastGlobalVariablesHook(0), | ||
|
||
tf.train.StopAtStepHook(last_step=200 // hvd.size()), | ||
|
||
tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': loss}, | ||
every_n_iter=10), | ||
] | ||
|
||
# Horovod: pin GPU to be used to process local rank (one GPU per process) | ||
config = tf.ConfigProto() | ||
config.gpu_options.allow_growth = True | ||
config.gpu_options.visible_device_list = str(hvd.local_rank()) | ||
|
||
# Horovod: save checkpoints only on worker 0 to prevent other workers from | ||
# corrupting them. | ||
checkpoint_dir = os.path.join(args.model_dir, 'checkpoints') if hvd.rank() == 0 else None | ||
|
||
# The MonitoredTrainingSession takes care of session initialization, | ||
# restoring from a checkpoint, saving to a checkpoint, and closing when done | ||
# or an error occurs. | ||
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir, | ||
hooks=hooks, | ||
config=config) as mon_sess: | ||
while not mon_sess.should_stop(): | ||
# Run a training step synchronously. | ||
image_, label_ = mnist.train.next_batch(100) | ||
mon_sess.run(train_op, feed_dict={image: image_, label: label_}) | ||
|
||
|
||
if __name__ == "__main__": | ||
tf.app.run() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.