Skip to content

documentation: add docs for Sagemaker Model Parallel 1.3, released with PT 1.8 #2219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Mar 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
3b52234
documentation: add SMP 1.3 docs
rahul003 Mar 16, 2021
70125f8
Merge branch 'master' into smp13
rahul003 Mar 16, 2021
df708ef
Merge branch 'master' into smp13
Mar 17, 2021
29d7497
Fix rst formatting
rahul003 Mar 17, 2021
6e17191
Merge branch 'smp13' of https://github.com/rahul003/sagemaker-python-…
rahul003 Mar 17, 2021
1bc5502
Merge branch 'master' into smp13
ajaykarpur Mar 17, 2021
a984d04
Merge branch 'master' into smp13
rahul003 Mar 18, 2021
a47d3b3
Format
rahul003 Mar 18, 2021
ff1002b
Add line about state dict
rahul003 Mar 19, 2021
791008c
Add 1.7.1 as a supported version
rahul003 Mar 19, 2021
5e383c4
Documentation: adding :noindex: to older versioned files. Updating th…
Mar 19, 2021
6151b4e
Merge remote-tracking branch 'rahul/smp13' into HEAD
Mar 19, 2021
06eb3bc
Removing white spaces
Mar 19, 2021
ea77752
Merge branch 'master' into smp13
TEChopra1000 Mar 19, 2021
b3bd544
Documentation: updating release notes to match aws style guidance
TEChopra1000 Mar 20, 2021
cd5657e
Merge branch 'master' into smp13
TEChopra1000 Mar 22, 2021
0f0d95e
Merge branch 'master' into smp13
TEChopra1000 Mar 23, 2021
d0bb197
Merge branch 'master' into smp13
ajaykarpur Mar 23, 2021
ee46bf5
Merge branch 'master' into smp13
TEChopra1000 Mar 23, 2021
a1befd4
Merge branch 'master' into smp13
TEChopra1000 Mar 24, 2021
8484537
Merge branch 'master' into smp13
TEChopra1000 Mar 25, 2021
30d01ff
Merge branch 'master' into smp13
TEChopra1000 Mar 25, 2021
c594a44
Merge branch 'master' into smp13
TEChopra1000 Mar 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api/training/smd_model_parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Select a version to see the API documentation for version. To use the library, r
.. toctree::
:maxdepth: 1

smp_versions/v1_3_0.rst
smp_versions/v1_2_0.rst
smp_versions/v1_1_0.rst

Expand Down
140 changes: 83 additions & 57 deletions doc/api/training/smd_model_parallel_general.rst

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,3 +1,42 @@
# Sagemaker Distributed Model Parallel 1.3.0 Release Notes

- New Features
- Bug Fixes
- Known Issues

## New Features

### PyTorch

#### Add support for PyTorch 1.8

- Adds a new method to DistributedModel ``register_comm_hook`` (for PyTorch 1.8 and newer only). This method behaves the same as the corresponding method with the same name in
`torch.DistributedDataParallel` API. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.

#### Others
- Adds a configuration ``active_microbatches`` to the SageMaker SDK API for launching jobs, to control the number of active microbatches during training. This helps limit memory usage in cases where the number of microbatches is high. Refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.

- Adds a configuration ``deterministic_server`` to the SageMaker SDK API for launching jobs, which ensures that the execution server for pipeline parallelism processes requests in a deterministic order across data parallel ranks. Refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.

- Parameter passing is now supported in ``module.forward`` methods for DistributedModel and its submodules. This removes the restriction of having to pass `nn.Parameter` to the `__init__` call and making it a member of the module to use it.
## Bug Fixes

### PyTorch

- Fixed a case where training hangs due to a module having computation which requires grads that is not used by the final output of the module. Now such a situtation raises an error with suggestions on making such computation compatible.

- Fixed an issue with buffers which caused the buffers to not be on the correct device after a model is partitioned, and not be synchronized across steps (when ``broadcast_buffers`` is True). This could have caused correctness issues in models with buffers.

## Known Issues

### PyTorch

- ``mp_barrier`` and ``get_mp_process_group`` are wrongly marked as deprecated methods. Ignore the deprecation warning.

- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.

- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.

# Sagemaker Distributed Model Parallel 1.2.0 Release Notes

- New Features
Expand All @@ -11,7 +50,7 @@
#### Add support for PyTorch 1.7.1

- Adds support for `gradient_as_bucket_view` (PyTorch 1.7.1 only), `find_unused_parameters` (PyTorch 1.7.1 only) and `broadcast_buffers` options to `smp.DistributedModel`. These options behave the same as the corresponding options (with the same names) in
`torch.DistributedDataParallel` API. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
`torch.DistributedDataParallel` API. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.

- Adds support for `join` (PyTorch 1.7.1 only) context manager, which is to be used in conjunction with an instance of `smp.DistributedModel` to be able to train with uneven inputs across participating processes.

Expand All @@ -36,7 +75,7 @@ regular dicts.

### PyTorch

- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6.0. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636.
- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6.0. See the related discussion: https://github.com/pytorch/pytorch/issues/50636.


# Sagemaker Distributed Model Parallel 1.1.0 Release Notes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,9 @@ This API document assumes you use the following import statements in your traini
Returns the ``state_dict`` that contains optimizer state for the entire model.
It first collects the ``local_state_dict`` and gathers and merges
the ``local_state_dict`` from all ``mp_rank``s to create a full
``state_dict``.
``state_dict``. Please note that this needs to be called on all ranks with
``dp_rank()==0`` to ensure the gather happens properly.
If it is only called on all such ranks, it can hang.

.. function:: load_state_dict( )
:noindex:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,12 @@ The following SageMaker distribute model parallel APIs are common across all fra


.. function:: smp.init( )
:noindex:

Initialize the library. Must be called at the beginning of training script.

.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
:noindex:

A decorator that must be placed over a function that represents a single
forward and backward pass (for training use cases), or a single forward
Expand Down Expand Up @@ -162,7 +164,7 @@ The following SageMaker distribute model parallel APIs are common across all fra


.. class:: StepOutput

:noindex:

A class that encapsulates all versions of a ``tf.Tensor``
or \ ``torch.Tensor`` across all microbatches.
Expand Down Expand Up @@ -191,27 +193,32 @@ The following SageMaker distribute model parallel APIs are common across all fra
post-processing operations on tensors.

.. data:: StepOutput.outputs
:noindex:

Returns a list of the underlying tensors, indexed by microbatch.

.. function:: StepOutput.reduce_mean( )
:noindex:

Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.

.. function:: StepOutput.reduce_sum( )
:noindex:

Returns a ``tf.Tensor`` /
``torch.Tensor`` that sums the constituent
``tf.Tensor``\ s/\ ``torch.Tensor``\ s.

.. function:: StepOutput.concat( )
:noindex:

Returns a
``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
batch dimension using ``tf.concat`` / ``torch.cat``.

.. function:: StepOutput.stack( )
:noindex:

Applies ``tf.stack`` / ``torch.stack``
operation to the list of constituent ``tf.Tensor``\ s /
Expand All @@ -220,13 +227,15 @@ The following SageMaker distribute model parallel APIs are common across all fra
**TensorFlow-only methods**

.. function:: StepOutput.merge( )
:noindex:

Returns a ``tf.Tensor`` that
concatenates the constituent ``tf.Tensor``\ s along the batch
dimension. This is commonly used for merging the model predictions
across microbatches.

.. function:: StepOutput.accumulate(method="variable", var=None)
:noindex:

Functionally the same as ``StepOutput.reduce_mean()``. However, it is
more memory-efficient, especially for large numbers of microbatches,
Expand All @@ -252,6 +261,7 @@ The following SageMaker distribute model parallel APIs are common across all fra
ignored.

.. _mpi_basics:
:noindex:

MPI Basics
^^^^^^^^^^
Expand All @@ -274,7 +284,8 @@ The library exposes the following basic MPI primitives to its Python API:
- ``smp.get_dp_group()``: The list of ranks that hold different
replicas of the same model partition.

.. _communication_api:
.. _communication_api:
:noindex:

Communication API
^^^^^^^^^^^^^^^^^
Expand All @@ -288,6 +299,7 @@ should involve.
**Helper structures**

.. data:: smp.CommGroup
:noindex:

An ``enum`` that takes the values
``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
Expand All @@ -306,6 +318,7 @@ should involve.
themselves.

.. data:: smp.RankType
:noindex:

An ``enum`` that takes the values
``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
Expand All @@ -321,6 +334,7 @@ should involve.
**Communication primitives:**

.. function:: smp.broadcast(obj, group)
:noindex:

Sends the object to all processes in the
group. The receiving process must call ``smp.recv_from`` to receive the
Expand Down Expand Up @@ -353,6 +367,7 @@ should involve.
    smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)

.. function:: smp.send(obj, dest_rank, rank_type)
:noindex:

Sends the object ``obj`` to
``dest_rank``, which is of a type specified by ``rank_type``.
Expand All @@ -376,6 +391,7 @@ should involve.
``recv_from`` call.

.. function:: smp.recv_from(src_rank, rank_type)
:noindex:

Receive an object from a peer process. Can be used with a matching
``smp.send`` or a ``smp.broadcast`` call.
Expand All @@ -401,6 +417,7 @@ should involve.
``broadcast`` call, and the object is received.

.. function:: smp.allgather(obj, group)
:noindex:

A collective call that gathers all the
submitted objects across all ranks in the specified ``group``. Returns a
Expand Down Expand Up @@ -434,6 +451,7 @@ should involve.
    out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]

.. function:: smp.barrier(group=smp.WORLD)
:noindex:

A statement that hangs until all
processes in the specified group reach the barrier statement, similar to
Expand All @@ -455,12 +473,14 @@ should involve.
processes outside that ``mp_group``.

.. function:: smp.dp_barrier()
:noindex:

Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
Waits for the processes in the same \ ``dp_group`` as
the current process to reach the same point in execution.

.. function:: smp.mp_barrier()
:noindex:

Same as passing ``smp.MP_GROUP`` to
``smp.barrier()``. Waits for the processes in the same ``mp_group`` as
Expand Down
Loading