-
Notifications
You must be signed in to change notification settings - Fork 1.2k
documentation: adding details about mpi options, other small updates #2135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
* ``"enabled"``: Set to ``True`` to launch the training job with MPI. | ||
|
||
* ``"processes_per_host"``: Specifies the number of processes MPI should launch on each host. | ||
In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker Python SDK maintains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should say "SageMaker modelparallel
library maintains ..." instead of SageMaker Python SDK maintains...
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt-16>`_. | ||
|
||
.. important:: | ||
``process_per_host`` must be less than the number of GPUs per instance, and typically will be equal to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...must be less than or equal to the number of GPUs per instance"
such as an ml.p3.16xlarge. | ||
|
||
The following image illustrates how 2-way data parallelism and 4-way model parallelism is distributed across 8 GPUs: | ||
the models is partitioned across 4 GPUs, and each partition is added to 2 GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
models -> model
Unlike the original DDP wrapper, when you use ``DistributedModel``, | ||
model parameters and buffers are not immediately broadcast across | ||
processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the | ||
``smp.step-decorated`` function when the partition is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only smp.step
should be in code style
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Description of changes:
mpi
optionsjoin()
description.Testing done:
tox -e black-check,flake8,pylint,docstyle,sphinx,doc8 --parallel all
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.