-
Notifications
You must be signed in to change notification settings - Fork 162
Add distributed training support #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Launch user specified number of parameter servers * Add integ tests * Add unit tests * Add distributed sagemaker integ test
|
||
def _build_tf_config(hosts, current_host, ps_num=0, ps_task=False): | ||
""" | ||
Builds a dictionary containing cluster information based on number of hosts and number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: combine the """ with the first line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
hosts (list[str]): List of host name in the cluster | ||
current_host (str): Current host name | ||
ps_num (int): Number of parameter servers | ||
ps_task (bool): Set to True if this config is built for a ps server process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's note the default for these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
ps = hosts[:ps_num] if len(hosts) > 1 and ps_num > 0 else None | ||
|
||
def host_addresses(hosts, port=2222): | ||
return ["{}:{}".format(host, port) for host in hosts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use single quotes for strings
return tf_config | ||
|
||
|
||
def _get_env_vars_with_tf_config(env, ps_task): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think prefixing with "get" is considered a little unpythonic or whatever - anyway, you can make it just _env_vars_with_tf_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this.
- Can you please add the container sizes in the PR description?
- Please add doctrings in all public methods (without _ prefix)
- I thought we had code coverage configured.
- How long does the integ test take?
@mvsusp I will add the docstrings and double check the coverage test setup. Will document the container sizes as well. The integ tests only takes a few minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure you're using single quotes for strings
&& rm $framework_installable_local \ | ||
&& rm $framework_support_installable_local | ||
|
||
ENV SAGEMAKER_TRAINING_MODULE sagemaker_tensorflow_container.training:main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a linter like https://github.com/hadolint to help you here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run this locally, only thing comes up now is the pin version stuff. I will add it to the build in the next pr.
If parameter server is enabled we are going to launch the parameter server process on all the hosts. This chagne also includes some tests fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks generally fine to me, but looks like there are still a few comments from Marcio unaddressed
* Implement distributed support * Launch parameter server if user set sagemaker_parameter_server_enabled to be True * Add integ tests * Add unit tests * Add distributed sagemaker integ test * Add 1.11.0 and modify Dockerfile to reduce image size
* Implement distributed support * Launch parameter server if user set sagemaker_parameter_server_enabled to be True * Add integ tests * Add unit tests * Add distributed sagemaker integ test * Add 1.11.0 and modify Dockerfile to reduce image size
* Implement distributed support * Launch parameter server if user set sagemaker_parameter_server_enabled to be True * Add integ tests * Add unit tests * Add distributed sagemaker integ test * Add 1.11.0 and modify Dockerfile to reduce image size
Issue #, if available:
Description of changes:
This change sets up parameter servers for users in distributed training context. We read wether user wants to use parameter servers from hyperparameter 'sagemaker_parameter_server_enabled'. Use that to construct TF_CONFIG and launch ps and worker processes.
Compressed docker image sizes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.