aws · TEChopra1000 · Mar 27, 2021 · Mar 16, 2021 · Mar 16, 2021 · Mar 17, 2021
@@ -91,7 +91,7 @@ table are optional.
    |                           | or ``"simple"``         |                   | schedule.             |
    |                           |                         |                   |                       |
    +---------------------------+-------------------------+-------------------+-----------------------+
-   | ``optimize``              | ``"memory"`` or         | ``"memory"``      | Whether the library   |
+   | ``optimize``              | ``"memory"`` or         | ``"memory"``      | Whether the library   |
    |                           | ``"speed"``             |                   | should optimize       |
    |                           |                         |                   | for speed or          |
    |                           |                         |                   | memory during         |
@@ -261,49 +261,75 @@ table are optional.
 .. table::
    :widths: 10 20 10 60
 
-   +-------------------+-------------------------+-----------------+-----------------------------------+
-   | **Parameter**     | **Type / Valid values** | **Default**     | **Description**                   |
-   |                   |                         |                 |                                   |
-   +-------------------+-------------------------+-----------------+-----------------------------------+
-   | ``memory_weight`` | float (between          | 0.2 if          | The weight of                     |
-   |                   | 0.0 and 1.0)            | ``optimize`` is | memory                            |
-   |                   |                         | ``"speed"``,    | balancing in                      |
-   |                   |                         | else 0.8        | the                               |
-   |                   |                         |                 | auto-partitioni                   |
-   |                   |                         |                 | ng                                |
-   |                   |                         |                 | objective, as                     |
-   |                   |                         |                 | opposed to                        |
-   |                   |                         |                 | balancing                         |
-   |                   |                         |                 | computational                     |
-   |                   |                         |                 | load. If 0.0,                     |
-   |                   |                         |                 | the library only tries            |
-   |                   |                         |                 | to balance                        |
-   |                   |                         |                 | computation; if                   |
-   |                   |                         |                 | 1.0 the library only              |
-   |                   |                         |                 | tries to                          |
-   |                   |                         |                 | balance the                       |
-   |                   |                         |                 | memory use. Any                   |
-   |                   |                         |                 | value in                          |
-   |                   |                         |                 | between                           |
-   |                   |                         |                 | interpolates                      |
-   |                   |                         |                 | between these                     |
-   |                   |                         |                 | extremes.                         |
-   +-------------------+-------------------------+-----------------+-----------------------------------+
-   | ``ddp``           | bool                    | ``False``       | Must be set to                    |
-   |                   |                         |                 | ``True`` if                       |
-   |                   |                         |                 | hybrid                            |
-   |                   |                         |                 | model/data                        |
-   |                   |                         |                 | parallelism is                    |
-   |                   |                         |                 | used                              |
-   |                   |                         |                 | with ``DistributedDataParallel``. |
-   |                   |                         |                 | ``DistributedDataParallel``       |
-   |                   |                         |                 | is used with                      |
-   |                   |                         |                 | NCCL backend,                     |
-   |                   |                         |                 | and uses the                      |
-   |                   |                         |                 | ``MASTER_PORT``                   |
-   |                   |                         |                 | provided by                       |
-   |                   |                         |                 | SageMaker.                        |
-   +-------------------+-------------------------+-----------------+-----------------------------------+
+   +--------------------------+-------------------------+--------------------+--------------------------------------+
+   | **Parameter**            | **Type / Valid values** | **Default**        | **Description**                      |
+   |                          |                         |                    |                                      |
+   +--------------------------+-------------------------+--------------------+--------------------------------------+
+   | ``memory_weight``        | float (between          | 0.2 if             | The weight of                        |
+   |                          | 0.0 and 1.0)            | ``optimize`` is    | memory                               |
+   |                          |                         | ``"speed"``,       | balancing in                         |
+   |                          |                         | else 0.8           | the                                  |
+   |                          |                         |                    | auto-partitioni                      |
+   |                          |                         |                    | ng                                   |
+   |                          |                         |                    | objective, as                        |
+   |                          |                         |                    | opposed to                           |
+   |                          |                         |                    | balancing                            |
+   |                          |                         |                    | computational                        |
+   |                          |                         |                    | load. If 0.0,                        |
+   |                          |                         |                    | the library only tries               |
+   |                          |                         |                    | to balance                           |
+   |                          |                         |                    | computation; if                      |
+   |                          |                         |                    | 1.0 the library only                 |
+   |                          |                         |                    | tries to                             |
+   |                          |                         |                    | balance the                          |
+   |                          |                         |                    | memory use. Any                      |
+   |                          |                         |                    | value in                             |
+   |                          |                         |                    | between                              |
+   |                          |                         |                    | interpolates                         |
+   |                          |                         |                    | between these                        |
+   |                          |                         |                    | extremes.                            |
+   +--------------------------+-------------------------+--------------------+--------------------------------------+
+   | ``ddp``                  | bool                    | ``False``          | Must be set to                       |
+   |                          |                         |                    | ``True`` if                          |
+   |                          |                         |                    | hybrid                               |
+   |                          |                         |                    | model/data                           |
+   |                          |                         |                    | parallelism is                       |
+   |                          |                         |                    | used                                 |
+   |                          |                         |                    | with ``DistributedDataParallel``.    |
+   |                          |                         |                    | ``DistributedDataParallel``          |
+   |                          |                         |                    | is used with                         |
+   |                          |                         |                    | NCCL backend,                        |
+   |                          |                         |                    | and uses the                         |
+   |                          |                         |                    | ``MASTER_PORT``                      |
+   |                          |                         |                    | provided by                          |
+   |                          |                         |                    | SageMaker.                           |
+   +--------------------------+-------------------------+--------------------+--------------------------------------+
+   | ``active_microbatches``  | int                     | ``partitions`` + 2 | This is the maximum number of        |
+   | (Only >= v1.3)           |                         |                    | microbatches that are simultaneously |
+   |                          |                         |                    | in execution during pipelining.      |
+   |                          |                         |                    | Jointly scaling batch                |
+   |                          |                         |                    | size and number of microbatches      |
+   |                          |                         |                    | can often mitigate the pipeline      |
+   |                          |                         |                    | bubble overhead, but that can        |
+   |                          |                         |                    | lead to increased memory usage       |
+   |                          |                         |                    | if too many microbatches are         |
+   |                          |                         |                    | simultaneously in execution.         |
+   |                          |                         |                    | In such cases setting the            |
+   |                          |                         |                    | number of active                     |
+   |                          |                         |                    | microbatches to a lower number       |
+   |                          |                         |                    | can help control memory usage.       |
+   |                          |                         |                    | By default this is set to two        |
+   |                          |                         |                    | plus the number of                   |
+   |                          |                         |                    | partitions of the model.             |
+   +--------------------------+-------------------------+--------------------+--------------------------------------+
+   | ``deterministic_server`` | bool                    | ``False``          | Setting this to true                 |
+   | (Only >= v1.3)           |                         |                    | ensures that the execution           |
+   |                          |                         |                    | server for pipelining                |
+   |                          |                         |                    | executes requests in the             |
+   |                          |                         |                    | same order across all                |
+   |                          |                         |                    | data parallel ranks.                 |
+   +--------------------------+-------------------------+--------------------+--------------------------------------+
+
 
 ``mpi`` Parameters
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -354,7 +380,7 @@ the process. For instance, if a training job is launched with 4
 across all instances, and the ranks of these processes range from 0 to
 31.
 
-The ``local_rank`` of a process is the rank of the process among the
+The ``local_rank`` of a process is the rank of the process among the
 processes in the same instance. This can range from 0 up to the number
 of GPUs in the instance, but can be lower if fewer processes than GPUs are
 launched in the instance. For instance, in the preceding
@@ -363,25 +389,25 @@ since there are 8 GPUs in a ``p3dn.24xlarge`` instance.
 
 When the library is used together with data parallelism (Horovod for TensorFlow
 and DDP for PyTorch), the library partitions the set of processes into
-disjoint \ ``mp_group``\ s. An ``mp_group`` is a subset of all processes
+disjoint \ ``mp_group``\ s. An ``mp_group`` is a subset of all processes
 that together hold a single, partitioned model replica. For instance, if
 a single node job is launched with 8 local processes, and
-``partitions`` is 2 (meaning the model will be split into 2), there are
-four \ ``mp_group``\ s. The specific sets of processes that form the
-``mp_group``\ s can be adjusted by the ``placement_strategy`` option. In
-this example, if ``placement_strategy`` is ``spread``, then the four
+``partitions`` is 2 (meaning the model will be split into 2), there are
+four \ ``mp_group``\ s. The specific sets of processes that form the
+``mp_group``\ s can be adjusted by the ``placement_strategy`` option. In
+this example, if ``placement_strategy`` is ``spread``, then the four
 ``mp_group``\ s are ``[0, 4], [1, 5], [2, 6], [3, 7]``. An
-``mp_rank`` is the rank of a process within its own ``mp_group``. In the
-previous example, the ``mp_rank`` of process 1 is 0, and ``mp_rank`` of
+``mp_rank`` is the rank of a process within its own ``mp_group``. In the
+previous example, the ``mp_rank`` of process 1 is 0, and ``mp_rank`` of
 process 6 is 1.
 
 Analogously, the library defines ``dp_group``\ s as the sets of processes that
 all hold the same model partition, and perform data parallelism among
 each other. In the example above, there are two ``dp_group``\ s,
-``[0, 1, 2, 3]`` and ``[4, 5, 6, 7]``,
+``[0, 1, 2, 3]`` and ``[4, 5, 6, 7]``,
 
-since each process within the ``dp_group`` holds the same partition of
+since each process within the ``dp_group`` holds the same partition of
 the model, and makes allreduce calls among themselves. Allreduce for
-data parallelism does not take place *across* ``dp_group``\ s.
-``dp_rank`` is defined as the rank of a process within its ``dp_group``.
-In the preceding example, the \ ``dp_rank`` of process 6 is 2.
+data parallelism does not take place *across* ``dp_group``\ s.
+``dp_rank`` is defined as the rank of a process within its ``dp_group``.
+In the preceding example, the \ ``dp_rank`` of process 6 is 2.
@@ -1,3 +1,42 @@
+# Sagemaker Distributed Model Parallel 1.3.0 Release Notes
+
+- New Features
+- Bug Fixes
+- Known Issues
+
+## New Features
+
+### PyTorch
+
+#### Add support for PyTorch 1.8
+
+- Adds a new method to DistributedModel ``register_comm_hook`` (for PyTorch 1.8 and newer only). This method behaves the same as the corresponding method with the same name in
+`torch.DistributedDataParallel` API. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
+
+#### Others
+- Adds a configuration ``active_microbatches`` to the SageMaker SDK API for launching jobs, to control the number of active microbatches during training. This helps limit memory usage in cases where the number of microbatches is high. Please refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.
+
+- Adds a configuration ``deterministic_server`` to the SageMaker SDK API for launching jobs, which ensures that the execution server for pipeline parallelism processes requests in a deterministic order across data parallel ranks. Please refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.
+
+- Parameter passing is now supported in ``module.forward`` methods for DistributedModel and its submodules. This removes the restriction of having to pass `nn.Parameter` to the `__init__` call and making it a member of the module to use it.
+## Bug Fixes
+
+### PyTorch
+
+- Fixed a case where training hangs due to a module having computation which requires grads that is not used by the final output of the module. Now such a situtation raises an error with suggestions on making such computation compatible.
+
+- Fixed an issue with buffers which caused the buffers to not be on the correct device after a model is partitioned, and not be synchronized across steps (when ``broadcast_buffers`` is True). This could have caused correctness issues in models with buffers.
+
+## Known Issues
+
+### PyTorch
+
+- ``mp_barrier`` and ``get_mp_process_group`` are wrongly marked as deprecated methods. Please ignore the deprecation warning.
+
+- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, please only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.
+
+- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
+
 # Sagemaker Distributed Model Parallel 1.2.0 Release Notes
 
 - New Features