Skip to content

Commit 9ef1642

Browse files
author
Talia Chopra
committed
documentation: adding details about mpi options, other small updates
1 parent b28bb31 commit 9ef1642

File tree

4 files changed

+60
-6
lines changed

4 files changed

+60
-6
lines changed

doc/api/training/smd_model_parallel_general.rst

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55

66
.. _sm-sdk-modelparallel-params:
77

8-
SageMaker Python SDK ``modelparallel`` parameters
9-
=================================================
8+
Required SageMaker Python SDK parameters
9+
========================================
1010

1111
The TensorFlow and PyTorch ``Estimator`` objects contains a ``distribution`` parameter,
1212
which is used to enable and specify parameters for the
1313
initialization of the SageMaker distributed model parallel library. The library internally uses MPI,
14-
so in order to use model parallelism, MPI must be enabled using the ``distribution`` parameter.
14+
so in order to use model parallelism, MPI must also be enabled using the ``distribution`` parameter.
1515

1616
The following is an example of how you can launch a new PyTorch training job with the library.
1717

@@ -55,6 +55,9 @@ The following is an example of how you can launch a new PyTorch training job wit
5555
5656
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
5757
58+
``smdistributed`` Parameters
59+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
60+
5861
You can use the following parameters to initialize the library using the ``parameters``
5962
in the ``smdistributed`` of ``distribution``.
6063

@@ -302,6 +305,41 @@ table are optional.
302305
| | | | SageMaker. |
303306
+-------------------+-------------------------+-----------------+-----------------------------------+
304307

308+
``mpi`` Parameters
309+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
310+
For the ``"mpi"`` key, a dict must be passed which contains:
311+
312+
* ``"enabled"``: Set to ``True`` to launch the training job with MPI.
313+
314+
* ``"processes_per_host"``: Specifies the number of processes MPI should launch on each host.
315+
In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker Python SDK maintains
316+
a one-to-one mapping between processes and GPUs across model and data parallelism.
317+
This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process.
318+
If you are using PyTorch, you must restrict each process to its own device using
319+
``torch.cuda.set_device(smp.local_rank())``. To learn more, see
320+
`Modify a PyTorch Training Script
321+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt-16>`_.
322+
323+
.. important::
324+
``process_per_host`` must be less than the number of GPUs per instance, and typically will be equal to
325+
the number of GPUs per instance.
326+
327+
For example, if you use one instance with 4-way model parallelism and 2-way data parallelism,
328+
then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs,
329+
such as an ml.p3.16xlarge.
330+
331+
The following image illustrates how 2-way data parallelism and 4-way model parallelism is distributed across 8 GPUs:
332+
the models is partitioned across 4 GPUs, and each partition is added to 2 GPUs.
333+
334+
.. image:: smp_versions/model-data-parallel.png
335+
:width: 650
336+
:alt: 2-way data parallelism and 4-way model parallelism distributed across 8 GPUs
337+
338+
339+
* ``"custom_mpi_options"``: Use this key to pass any custom MPI options you might need.
340+
To avoid Docker warnings from contaminating your training logs, we recommend the following flag.
341+
```--mca btl_vader_single_copy_mechanism none```
342+
305343

306344
.. _ranking-basics:
307345

Loading

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,9 @@ The following SageMaker distribute model parallel APIs are common across all fra
118118
- https://www.tensorflow.org/api_docs/python/tf/function\
119119
- https://www.tensorflow.org/guide/function\
120120

121+
Each ``smp.step`` decorated function must have a return value that depends on the
122+
output of ``smp.DistributedModel``.
123+
121124
**Common parameters**
122125

123126
- ``non_split_inputs`` (``list``): The list of arguments to the decorated function

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@ This API document assumes you use the following import statements in your traini
3131
model in the training script can be wrapped with
3232
``smp.DistributedModel``.
3333

34-
3534
**Example:**
3635

3736
.. code:: python
@@ -89,6 +88,17 @@ This API document assumes you use the following import statements in your traini
8988
the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
9089
a ``smp.step``-decorated function.
9190

91+
**Using DDP**
92+
93+
If DDP is enabled, do not not place a PyTorch
94+
``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
95+
the ``DistributedModel`` wrapper will also handle data parallelism.
96+
97+
Unlike the original DDP wrapper, when you use ``DistributedModel``,
98+
model parameters and buffers are not immediately broadcast across
99+
processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
100+
``smp.step-decorated`` function when the partition is done.
101+
92102
**Parameters**
93103

94104
- ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
@@ -248,11 +258,14 @@ This API document assumes you use the following import statements in your traini
248258
.. function:: join( )
249259

250260
**Available for PyTorch 1.7 only**
261+
251262
A context manager to be used in conjunction with an instance of
252-
``smp.DistributedModel``to be able to train with uneven inputs across
263+
``smp.DistributedModel`` to be able to train with uneven inputs across
253264
participating processes. This is only supported when ``ddp=True`` for
254265
``smp.DistributedModel``. This will use the join with the wrapped
255-
``DistributedDataParallel`` instance. Please see: `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__.
266+
``DistributedDataParallel`` instance. For more information, see:
267+
`join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
268+
in the PyTorch documentation.
256269

257270

258271
.. class:: smp.DistributedOptimizer

0 commit comments

Comments
 (0)