Skip to content

documentation: adding details about mpi options, other small updates #2135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 11, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 41 additions & 3 deletions doc/api/training/smd_model_parallel_general.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@

.. _sm-sdk-modelparallel-params:

SageMaker Python SDK ``modelparallel`` parameters
=================================================
Required SageMaker Python SDK parameters
========================================

The TensorFlow and PyTorch ``Estimator`` objects contains a ``distribution`` parameter,
which is used to enable and specify parameters for the
initialization of the SageMaker distributed model parallel library. The library internally uses MPI,
so in order to use model parallelism, MPI must be enabled using the ``distribution`` parameter.
so in order to use model parallelism, MPI must also be enabled using the ``distribution`` parameter.

The following is an example of how you can launch a new PyTorch training job with the library.

Expand Down Expand Up @@ -55,6 +55,9 @@ The following is an example of how you can launch a new PyTorch training job wit

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')

``smdistributed`` Parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can use the following parameters to initialize the library using the ``parameters``
in the ``smdistributed`` of ``distribution``.

Expand Down Expand Up @@ -302,6 +305,41 @@ table are optional.
| | | | SageMaker. |
+-------------------+-------------------------+-----------------+-----------------------------------+

``mpi`` Parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For the ``"mpi"`` key, a dict must be passed which contains:

* ``"enabled"``: Set to ``True`` to launch the training job with MPI.

* ``"processes_per_host"``: Specifies the number of processes MPI should launch on each host.
In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker distributed model parallel library maintains
a one-to-one mapping between processes and GPUs across model and data parallelism.
This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process.
If you are using PyTorch, you must restrict each process to its own device using
``torch.cuda.set_device(smp.local_rank())``. To learn more, see
`Modify a PyTorch Training Script
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt-16>`_.

.. important::
``process_per_host`` must be less than or equal to the number of GPUs per instance, and typically will be equal to
the number of GPUs per instance.

For example, if you use one instance with 4-way model parallelism and 2-way data parallelism,
then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs,
such as an ml.p3.16xlarge.

The following image illustrates how 2-way data parallelism and 4-way model parallelism is distributed across 8 GPUs:
the model is partitioned across 4 GPUs, and each partition is added to 2 GPUs.

.. image:: smp_versions/model-data-parallel.png
:width: 650
:alt: 2-way data parallelism and 4-way model parallelism distributed across 8 GPUs


* ``"custom_mpi_options"``: Use this key to pass any custom MPI options you might need.
To avoid Docker warnings from contaminating your training logs, we recommend the following flag.
```--mca btl_vader_single_copy_mechanism none```


.. _ranking-basics:

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,9 @@ The following SageMaker distribute model parallel APIs are common across all fra
- https://www.tensorflow.org/api_docs/python/tf/function\
- https://www.tensorflow.org/guide/function\

Each ``smp.step`` decorated function must have a return value that depends on the
output of ``smp.DistributedModel``.

**Common parameters**

- ``non_split_inputs`` (``list``): The list of arguments to the decorated function
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ This API document assumes you use the following import statements in your traini
model in the training script can be wrapped with
``smp.DistributedModel``.


**Example:**

.. code:: python
Expand Down Expand Up @@ -89,6 +88,17 @@ This API document assumes you use the following import statements in your traini
the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
a ``smp.step``-decorated function.

**Using DDP**

If DDP is enabled, do not not place a PyTorch
``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
the ``DistributedModel`` wrapper will also handle data parallelism.

Unlike the original DDP wrapper, when you use ``DistributedModel``,
model parameters and buffers are not immediately broadcast across
processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
``smp.step``-decorated function when the partition is done.

**Parameters**

- ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
Expand Down Expand Up @@ -248,11 +258,14 @@ This API document assumes you use the following import statements in your traini
.. function:: join( )

**Available for PyTorch 1.7 only**

A context manager to be used in conjunction with an instance of
``smp.DistributedModel``to be able to train with uneven inputs across
``smp.DistributedModel`` to be able to train with uneven inputs across
participating processes. This is only supported when ``ddp=True`` for
``smp.DistributedModel``. This will use the join with the wrapped
``DistributedDataParallel`` instance. Please see: `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__.
``DistributedDataParallel`` instance. For more information, see:
`join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
in the PyTorch documentation.


.. class:: smp.DistributedOptimizer
Expand Down