Skip to content

documentation: add docs for Sagemaker Model Parallel 1.3, released with PT 1.8 #2219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Mar 27, 2021
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
3b52234
documentation: add SMP 1.3 docs
rahul003 Mar 16, 2021
70125f8
Merge branch 'master' into smp13
rahul003 Mar 16, 2021
df708ef
Merge branch 'master' into smp13
Mar 17, 2021
29d7497
Fix rst formatting
rahul003 Mar 17, 2021
6e17191
Merge branch 'smp13' of https://github.com/rahul003/sagemaker-python-…
rahul003 Mar 17, 2021
1bc5502
Merge branch 'master' into smp13
ajaykarpur Mar 17, 2021
a984d04
Merge branch 'master' into smp13
rahul003 Mar 18, 2021
a47d3b3
Format
rahul003 Mar 18, 2021
ff1002b
Add line about state dict
rahul003 Mar 19, 2021
791008c
Add 1.7.1 as a supported version
rahul003 Mar 19, 2021
5e383c4
Documentation: adding :noindex: to older versioned files. Updating th…
Mar 19, 2021
6151b4e
Merge remote-tracking branch 'rahul/smp13' into HEAD
Mar 19, 2021
06eb3bc
Removing white spaces
Mar 19, 2021
ea77752
Merge branch 'master' into smp13
TEChopra1000 Mar 19, 2021
b3bd544
Documentation: updating release notes to match aws style guidance
TEChopra1000 Mar 20, 2021
cd5657e
Merge branch 'master' into smp13
TEChopra1000 Mar 22, 2021
0f0d95e
Merge branch 'master' into smp13
TEChopra1000 Mar 23, 2021
d0bb197
Merge branch 'master' into smp13
ajaykarpur Mar 23, 2021
ee46bf5
Merge branch 'master' into smp13
TEChopra1000 Mar 23, 2021
a1befd4
Merge branch 'master' into smp13
TEChopra1000 Mar 24, 2021
8484537
Merge branch 'master' into smp13
TEChopra1000 Mar 25, 2021
30d01ff
Merge branch 'master' into smp13
TEChopra1000 Mar 25, 2021
c594a44
Merge branch 'master' into smp13
TEChopra1000 Mar 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 83 additions & 57 deletions doc/api/training/smd_model_parallel_general.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ table are optional.
| | or ``"simple"`` | | schedule. |
| | | | |
+---------------------------+-------------------------+-------------------+-----------------------+
| ``optimize`` | ``"memory"`` or | ``"memory"`` | Whether the library |
| ``optimize`` | ``"memory"`` or | ``"memory"`` | Whether the library |
| | ``"speed"`` | | should optimize |
| | | | for speed or |
| | | | memory during |
Expand Down Expand Up @@ -261,49 +261,75 @@ table are optional.
.. table::
:widths: 10 20 10 60

+-------------------+-------------------------+-----------------+-----------------------------------+
| **Parameter** | **Type / Valid values** | **Default** | **Description** |
| | | | |
+-------------------+-------------------------+-----------------+-----------------------------------+
| ``memory_weight`` | float (between | 0.2 if | The weight of |
| | 0.0 and 1.0) | ``optimize`` is | memory |
| | | ``"speed"``, | balancing in |
| | | else 0.8 | the |
| | | | auto-partitioni |
| | | | ng |
| | | | objective, as |
| | | | opposed to |
| | | | balancing |
| | | | computational |
| | | | load. If 0.0, |
| | | | the library only tries |
| | | | to balance |
| | | | computation; if |
| | | | 1.0 the library only |
| | | | tries to |
| | | | balance the |
| | | | memory use. Any |
| | | | value in |
| | | | between |
| | | | interpolates |
| | | | between these |
| | | | extremes. |
+-------------------+-------------------------+-----------------+-----------------------------------+
| ``ddp`` | bool | ``False`` | Must be set to |
| | | | ``True`` if |
| | | | hybrid |
| | | | model/data |
| | | | parallelism is |
| | | | used |
| | | | with ``DistributedDataParallel``. |
| | | | ``DistributedDataParallel`` |
| | | | is used with |
| | | | NCCL backend, |
| | | | and uses the |
| | | | ``MASTER_PORT`` |
| | | | provided by |
| | | | SageMaker. |
+-------------------+-------------------------+-----------------+-----------------------------------+
+--------------------------+-------------------------+--------------------+--------------------------------------+
| **Parameter** | **Type / Valid values** | **Default** | **Description** |
| | | | |
+--------------------------+-------------------------+--------------------+--------------------------------------+
| ``memory_weight`` | float (between | 0.2 if | The weight of |
| | 0.0 and 1.0) | ``optimize`` is | memory |
| | | ``"speed"``, | balancing in |
| | | else 0.8 | the |
| | | | auto-partitioni |
| | | | ng |
| | | | objective, as |
| | | | opposed to |
| | | | balancing |
| | | | computational |
| | | | load. If 0.0, |
| | | | the library only tries |
| | | | to balance |
| | | | computation; if |
| | | | 1.0 the library only |
| | | | tries to |
| | | | balance the |
| | | | memory use. Any |
| | | | value in |
| | | | between |
| | | | interpolates |
| | | | between these |
| | | | extremes. |
+--------------------------+-------------------------+--------------------+--------------------------------------+
| ``ddp`` | bool | ``False`` | Must be set to |
| | | | ``True`` if |
| | | | hybrid |
| | | | model/data |
| | | | parallelism is |
| | | | used |
| | | | with ``DistributedDataParallel``. |
| | | | ``DistributedDataParallel`` |
| | | | is used with |
| | | | NCCL backend, |
| | | | and uses the |
| | | | ``MASTER_PORT`` |
| | | | provided by |
| | | | SageMaker. |
+--------------------------+-------------------------+--------------------+--------------------------------------+
| ``active_microbatches`` | int | ``partitions`` + 2 | This is the maximum number of |
| (Only >= v1.3) | | | microbatches that are simultaneously |
| | | | in execution during pipelining. |
| | | | Jointly scaling batch |
| | | | size and number of microbatches |
| | | | can often mitigate the pipeline |
| | | | bubble overhead, but that can |
| | | | lead to increased memory usage |
| | | | if too many microbatches are |
| | | | simultaneously in execution. |
| | | | In such cases setting the |
| | | | number of active |
| | | | microbatches to a lower number |
| | | | can help control memory usage. |
| | | | By default this is set to two |
| | | | plus the number of |
| | | | partitions of the model. |
+--------------------------+-------------------------+--------------------+--------------------------------------+
| ``deterministic_server`` | bool | ``False`` | Setting this to true |
| (Only >= v1.3) | | | ensures that the execution |
| | | | server for pipelining |
| | | | executes requests in the |
| | | | same order across all |
| | | | data parallel ranks. |
+--------------------------+-------------------------+--------------------+--------------------------------------+


``mpi`` Parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -354,7 +380,7 @@ the process. For instance, if a training job is launched with 4
across all instances, and the ranks of these processes range from 0 to
31.

The ``local_rank`` of a process is the rank of the process among the
The ``local_rank`` of a process is the rank of the process among the
processes in the same instance. This can range from 0 up to the number
of GPUs in the instance, but can be lower if fewer processes than GPUs are
launched in the instance. For instance, in the preceding
Expand All @@ -363,25 +389,25 @@ since there are 8 GPUs in a ``p3dn.24xlarge`` instance.

When the library is used together with data parallelism (Horovod for TensorFlow
and DDP for PyTorch), the library partitions the set of processes into
disjoint \ ``mp_group``\ s. An ``mp_group`` is a subset of all processes
disjoint \ ``mp_group``\ s. An ``mp_group`` is a subset of all processes
that together hold a single, partitioned model replica. For instance, if
a single node job is launched with 8 local processes, and
``partitions`` is 2 (meaning the model will be split into 2), there are
four \ ``mp_group``\ s. The specific sets of processes that form the
``mp_group``\ s can be adjusted by the ``placement_strategy`` option. In
this example, if ``placement_strategy`` is ``spread``, then the four
``partitions`` is 2 (meaning the model will be split into 2), there are
four \ ``mp_group``\ s. The specific sets of processes that form the
``mp_group``\ s can be adjusted by the ``placement_strategy`` option. In
this example, if ``placement_strategy`` is ``spread``, then the four
``mp_group``\ s are ``[0, 4], [1, 5], [2, 6], [3, 7]``. An
``mp_rank`` is the rank of a process within its own ``mp_group``. In the
previous example, the ``mp_rank`` of process 1 is 0, and ``mp_rank`` of
``mp_rank`` is the rank of a process within its own ``mp_group``. In the
previous example, the ``mp_rank`` of process 1 is 0, and ``mp_rank`` of
process 6 is 1.

Analogously, the library defines ``dp_group``\ s as the sets of processes that
all hold the same model partition, and perform data parallelism among
each other. In the example above, there are two ``dp_group``\ s,
``[0, 1, 2, 3]`` and ``[4, 5, 6, 7]``,
``[0, 1, 2, 3]`` and ``[4, 5, 6, 7]``,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this happened, Talia ran this through an rst formatter


since each process within the ``dp_group`` holds the same partition of
since each process within the ``dp_group`` holds the same partition of
the model, and makes allreduce calls among themselves. Allreduce for
data parallelism does not take place *across* ``dp_group``\ s.
``dp_rank`` is defined as the rank of a process within its ``dp_group``.
In the preceding example, the \ ``dp_rank`` of process 6 is 2.
data parallelism does not take place *across* ``dp_group``\ s.
``dp_rank`` is defined as the rank of a process within its ``dp_group``.
In the preceding example, the \ ``dp_rank`` of process 6 is 2.
Original file line number Diff line number Diff line change
@@ -1,3 +1,42 @@
# Sagemaker Distributed Model Parallel 1.3.0 Release Notes

- New Features
- Bug Fixes
- Known Issues

## New Features

### PyTorch

#### Add support for PyTorch 1.8

- Adds a new method to DistributedModel ``register_comm_hook`` (for PyTorch 1.8 and newer only). This method behaves the same as the corresponding method with the same name in
`torch.DistributedDataParallel` API. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.

#### Others
- Adds a configuration ``active_microbatches`` to the SageMaker SDK API for launching jobs, to control the number of active microbatches during training. This helps limit memory usage in cases where the number of microbatches is high. Please refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.

- Adds a configuration ``deterministic_server`` to the SageMaker SDK API for launching jobs, which ensures that the execution server for pipeline parallelism processes requests in a deterministic order across data parallel ranks. Please refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.

- Parameter passing is now supported in ``module.forward`` methods for DistributedModel and its submodules. This removes the restriction of having to pass `nn.Parameter` to the `__init__` call and making it a member of the module to use it.
## Bug Fixes

### PyTorch

- Fixed a case where training hangs due to a module having computation which requires grads that is not used by the final output of the module. Now such a situtation raises an error with suggestions on making such computation compatible.

- Fixed an issue with buffers which caused the buffers to not be on the correct device after a model is partitioned, and not be synchronized across steps (when ``broadcast_buffers`` is True). This could have caused correctness issues in models with buffers.

## Known Issues

### PyTorch

- ``mp_barrier`` and ``get_mp_process_group`` are wrongly marked as deprecated methods. Please ignore the deprecation warning.

- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, please only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.

- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.

# Sagemaker Distributed Model Parallel 1.2.0 Release Notes

- New Features
Expand Down
Loading