Skip to content

doc: Enhance smddp 1.2.2 doc #2852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 13, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# You can set these variables from the command line.
SPHINXOPTS = -W
SPHINXBUILD = python -msphinx
SPHINXBUILD = python -msphinx
SPHINXPROJ = sagemaker
SOURCEDIR = .
BUILDDIR = _build
Expand Down
232 changes: 132 additions & 100 deletions doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@
PyTorch Guide to SageMaker's distributed data parallel library
##############################################################

.. admonition:: Contents
Use this guide to learn about the SageMaker distributed
data parallel library API for PyTorch.

- :ref:`pytorch-sdp-modify`
- :ref:`pytorch-sdp-api`
.. contents:: Topics
:depth: 3
:local:

.. _pytorch-sdp-modify:

Expand Down Expand Up @@ -55,7 +57,7 @@ API offered for PyTorch.


- Modify the ``torch.utils.data.distributed.DistributedSampler`` to
include the cluster’s information. Set``num_replicas`` to the
include the cluster’s information. Set ``num_replicas`` to the
total number of GPUs participating in training across all the nodes
in the cluster. This is called ``world_size``. You can get
``world_size`` with
Expand Down Expand Up @@ -110,7 +112,7 @@ you will have for distributed training with the distributed data parallel librar
def main():

    # Scale batch size by world size
    batch_size //= dist.get_world_size() // 8
    batch_size //= dist.get_world_size()
    batch_size = max(batch_size, 1)

    # Prepare dataset
Expand Down Expand Up @@ -153,9 +155,132 @@ you will have for distributed training with the distributed data parallel librar
PyTorch API
===========

.. rubric:: Supported versions
.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None)

``smdistributed.dataparallel``'s implementation of distributed data
parallelism for PyTorch. In most cases, wrapping your PyTorch Module
with ``smdistributed.dataparallel``'s ``DistributedDataParallel`` (DDP) is
all you need to do to use ``smdistributed.dataparallel``.

Creation of this DDP class requires ``smdistributed.dataparallel``
already initialized
with ``smdistributed.dataparallel.torch.distributed.init_process_group()``.

This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the
batch dimension. The module is replicated on each machine and each
device, and each such replica handles a portion of the input. During the
backwards pass, gradients from each node are averaged.

The batch size should be larger than the number of GPUs used locally.
Example usage
of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``:

.. code:: python

import torch
import smdistributed.dataparallel.torch.distributed as dist
from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP

dist.init_process_group()

# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(dist.get_local_rank())

# Build model and optimizer
model = ...
optimizer = torch.optim.SGD(model.parameters(),
                            lr=1e-3 * dist.get_world_size())
# Wrap model with smdistributed.dataparallel's DistributedDataParallel
model = DDP(model)

**Parameters:**

- ``module (torch.nn.Module)(required):`` PyTorch NN Module to be
parallelized
- ``device_ids (list[int])(optional):`` CUDA devices. This should only
be provided when the input module resides on a single CUDA device.
For single-device modules,
the ``ith module replica is placed on device_ids[i]``. For
multi-device modules and CPU modules, device_ids must be None or an
empty list, and input data for the forward pass must be placed on the
correct device. Defaults to ``None``.
- ``output_device (int)(optional):`` Device location of output for
single-device CUDA modules. For multi-device modules and CPU modules,
it must be None, and the module itself dictates the output location.
(default: device_ids[0] for single-device modules).  Defaults
to ``None``.
- ``broadcast_buffers (bool)(optional):`` Flag that enables syncing
(broadcasting) buffers of the module at beginning of the forward
function. ``smdistributed.dataparallel`` does not support broadcast
buffer yet. Please set this to ``False``.
- ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process
group is not supported in ``smdistributed.dataparallel``. This
parameter exists for API parity with torch.distributed only. Only
supported value is
``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults
to ``None.``
- ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will
bucket parameters into multiple buckets so that gradient reduction of
each bucket can potentially overlap with backward
computation. ``bucket_cap_mb`` controls the bucket size in
MegaBytes (MB) (default: 25).

.. note::

This module assumes all parameters are registered in the model by the
time it is created. No parameters should be added nor removed later.

.. note::

This module assumes all parameters are registered in the model of
each distributed processes are in the same order. The module itself
will conduct gradient all-reduction following the reverse order of
the registered parameters of the model. In other words, it is users’
responsibility to ensure that each distributed process has the exact
same model and thus the exact same parameter registration order.

.. note::

You should never change the set of your model’s parameters after
wrapping up your model with DistributedDataParallel. In other words,
when wrapping up your model with DistributedDataParallel, the
constructor of DistributedDataParallel will register the additional
gradient reduction functions on all the parameters of the model
itself at the time of construction. If you change the model’s
parameters after the DistributedDataParallel construction, this is
not supported and unexpected behaviors can happen, since some
parameters’ gradient reduction functions might not get called.

.. method:: no_sync()

``smdistributed.dataparallel`` supports the `PyTorch DDP no_sync() <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.no_sync>`_
context manager. It enables gradient accumulation by skipping AllReduce
during training iterations inside the context.

.. note::

The ``no_sync()`` context manager is available from smdistributed-dataparallel v1.2.2.
To find the release note, see :ref:`sdp_1.2.2_release_note`.

**PyTorch 1.7.1, 1.8.1**
**Example:**

.. code:: python

# Gradients are accumulated while inside no_sync context
with model.no_sync():
...
loss.backward()

# First iteration upon exiting context
# Incoming gradients are added to the accumulated gradients and then synchronized via AllReduce
...
loss.backward()

# Update weights and reset gradients to zero after accumulation is finished
optimizer.step()
optimizer.zero_grad()


.. function:: smdistributed.dataparallel.torch.distributed.is_available()
Expand Down Expand Up @@ -409,99 +534,6 @@ PyTorch API
otherwise.


.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None)

``smdistributed.dataparallel's`` implementation of distributed data
parallelism for PyTorch. In most cases, wrapping your PyTorch Module
with ``smdistributed.dataparallel's`` ``DistributedDataParallel (DDP)`` is
all you need to do to use ``smdistributed.dataparallel``.

Creation of this DDP class requires ``smdistributed.dataparallel``
already initialized
with ``smdistributed.dataparallel.torch.distributed.init_process_group()``.

This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the
batch dimension. The module is replicated on each machine and each
device, and each such replica handles a portion of the input. During the
backwards pass, gradients from each node are averaged.

The batch size should be larger than the number of GPUs used locally.
Example usage
of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``:

.. code:: python

import torch
import smdistributed.dataparallel.torch.distributed as dist
from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP

dist.init_process_group()

# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(dist.get_local_rank())

# Build model and optimizer
model = ...
optimizer = torch.optim.SGD(model.parameters(),
                            lr=1e-3 * dist.get_world_size())
# Wrap model with smdistributed.dataparallel's DistributedDataParallel
model = DDP(model)

**Parameters:**

- ``module (torch.nn.Module)(required):`` PyTorch NN Module to be
parallelized
- ``device_ids (list[int])(optional):`` CUDA devices. This should only
be provided when the input module resides on a single CUDA device.
For single-device modules,
the ``ith module replica is placed on device_ids[i]``. For
multi-device modules and CPU modules, device_ids must be None or an
empty list, and input data for the forward pass must be placed on the
correct device. Defaults to ``None``.
- ``output_device (int)(optional):`` Device location of output for
single-device CUDA modules. For multi-device modules and CPU modules,
it must be None, and the module itself dictates the output location.
(default: device_ids[0] for single-device modules).  Defaults
to ``None``.
- ``broadcast_buffers (bool)(optional):`` Flag that enables syncing
(broadcasting) buffers of the module at beginning of the forward
function. ``smdistributed.dataparallel`` does not support broadcast
buffer yet. Please set this to ``False``.
- ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process
group is not supported in ``smdistributed.dataparallel``. This
parameter exists for API parity with torch.distributed only. Only
supported value is
``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults
to ``None.``
- ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will
bucket parameters into multiple buckets so that gradient reduction of
each bucket can potentially overlap with backward
computation. ``bucket_cap_mb`` controls the bucket size in
MegaBytes (MB) (default: 25).

.. rubric:: Notes

- This module assumes all parameters are registered in the model by the
time it is created. No parameters should be added nor removed later.
- This module assumes all parameters are registered in the model of
each distributed processes are in the same order. The module itself
will conduct gradient all-reduction following the reverse order of
the registered parameters of the model. In other words, it is users’
responsibility to ensure that each distributed process has the exact
same model and thus the exact same parameter registration order.
- You should never change the set of your model’s parameters after
wrapping up your model with DistributedDataParallel. In other words,
when wrapping up your model with DistributedDataParallel, the
constructor of DistributedDataParallel will register the additional
gradient reduction functions on all the parameters of the model
itself at the time of construction. If you change the model’s
parameters after the DistributedDataParallel construction, this is
not supported and unexpected behaviors can happen, since some
parameters’ gradient reduction functions might not get called.


.. class:: smdistributed.dataparallel.torch.distributed.ReduceOp

An enum-like class for supported reduction operations
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -155,10 +155,6 @@ script you will have for distributed training with the library.
TensorFlow API
==============

.. rubric:: Supported versions

**TensorFlow 2.3.1, 2.4.1, 2.5.0**

.. function:: smdistributed.dataparallel.tensorflow.init()

Initialize ``smdistributed.dataparallel``. Must be called at the
Expand Down
Loading