Skip to content

Add distributed training support #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 6, 2018

Conversation

icywang86rui
Copy link
Contributor

@icywang86rui icywang86rui commented Oct 26, 2018

Issue #, if available:

Description of changes:
This change sets up parameter servers for users in distributed training context. We read wether user wants to use parameter servers from hyperparameter 'sagemaker_parameter_server_enabled'. Use that to construct TF_CONFIG and launch ps and worker processes.

Compressed docker image sizes:

  • cpu-py3 - 401MB
  • gpu-py3 - 1651MB

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Launch user specified number of parameter servers
* Add integ tests
* Add unit tests
* Add distributed sagemaker integ test

def _build_tf_config(hosts, current_host, ps_num=0, ps_task=False):
"""
Builds a dictionary containing cluster information based on number of hosts and number of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: combine the """ with the first line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

hosts (list[str]): List of host name in the cluster
current_host (str): Current host name
ps_num (int): Number of parameter servers
ps_task (bool): Set to True if this config is built for a ps server process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's note the default for these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

ps = hosts[:ps_num] if len(hosts) > 1 and ps_num > 0 else None

def host_addresses(hosts, port=2222):
return ["{}:{}".format(host, port) for host in hosts]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use single quotes for strings

return tf_config


def _get_env_vars_with_tf_config(env, ps_task):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think prefixing with "get" is considered a little unpythonic or whatever - anyway, you can make it just _env_vars_with_tf_config

Copy link
Contributor

@mvsusp mvsusp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this.

  • Can you please add the container sizes in the PR description?
  • Please add doctrings in all public methods (without _ prefix)
  • I thought we had code coverage configured.
  • How long does the integ test take?

@icywang86rui
Copy link
Contributor Author

@mvsusp I will add the docstrings and double check the coverage test setup. Will document the container sizes as well. The integ tests only takes a few minutes.

Copy link
Contributor

@laurenyu laurenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure you're using single quotes for strings

&& rm $framework_installable_local \
&& rm $framework_support_installable_local

ENV SAGEMAKER_TRAINING_MODULE sagemaker_tensorflow_container.training:main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a linter like https://github.com/hadolint to help you here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run this locally, only thing comes up now is the pin version stuff. I will add it to the build in the next pr.

If parameter server is enabled we are going to launch the parameter server
process on all the hosts. This chagne also includes some tests fixes.
Copy link
Contributor

@laurenyu laurenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks generally fine to me, but looks like there are still a few comments from Marcio unaddressed

@icywang86rui icywang86rui merged commit 032cf60 into aws:script-mode Nov 6, 2018
Elizaaaaa pushed a commit to Elizaaaaa/sagemaker-tensorflow-container that referenced this pull request Nov 4, 2019
* Implement distributed support

* Launch parameter server if user set sagemaker_parameter_server_enabled to be True
* Add integ tests
* Add unit tests
* Add distributed sagemaker integ test
* Add 1.11.0 and modify Dockerfile to reduce image size
Elizaaaaa pushed a commit to Elizaaaaa/sagemaker-tensorflow-container that referenced this pull request Nov 4, 2019
* Implement distributed support

* Launch parameter server if user set sagemaker_parameter_server_enabled to be True
* Add integ tests
* Add unit tests
* Add distributed sagemaker integ test
* Add 1.11.0 and modify Dockerfile to reduce image size
Elizaaaaa pushed a commit to Elizaaaaa/sagemaker-tensorflow-container that referenced this pull request Nov 4, 2019
* Implement distributed support

* Launch parameter server if user set sagemaker_parameter_server_enabled to be True
* Add integ tests
* Add unit tests
* Add distributed sagemaker integ test
* Add 1.11.0 and modify Dockerfile to reduce image size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants