aws
diff --git a/‎doc/api/training/smd_model_parallel.rst
Lines changed: 1 addition & 0 deletions b/‎doc/api/training/smd_model_parallel.rst
Lines changed: 1 addition & 0 deletions
diff --git a/‎doc/api/training/smd_model_parallel_general.rst
Lines changed: 83 additions & 57 deletions b/‎doc/api/training/smd_model_parallel_general.rst
Lines changed: 83 additions & 57 deletions
diff --git a/‎doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md
Lines changed: 41 additions & 2 deletions b/‎doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md
Lines changed: 41 additions & 2 deletions
diff --git a/‎doc/api/training/smp_versions/v1.1.0/smd_model_parallel_pytorch.rst
Lines changed: 3 additions & 1 deletion b/‎doc/api/training/smp_versions/v1.1.0/smd_model_parallel_pytorch.rst
Lines changed: 3 additions & 1 deletion
diff --git a/‎doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst
Lines changed: 22 additions & 2 deletions b/‎doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst
Lines changed: 22 additions & 2 deletions
@@ -34,6 +34,7 @@ Select a version to see the API documentation for version. To use the library, r
 .. toctree::
    :maxdepth: 1
 
+   smp_versions/v1_3_0.rst
    smp_versions/v1_2_0.rst
    smp_versions/v1_1_0.rst
 
 
@@ -1,3 +1,42 @@
+# Sagemaker Distributed Model Parallel 1.3.0 Release Notes
+
+- New Features
+- Bug Fixes
+- Known Issues
+
+## New Features
+
+### PyTorch
+
+#### Add support for PyTorch 1.8
+
+- Adds a new method to DistributedModel ``register_comm_hook`` (for PyTorch 1.8 and newer only). This method behaves the same as the corresponding method with the same name in
+`torch.DistributedDataParallel` API. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
+
+#### Others
+- Adds a configuration ``active_microbatches`` to the SageMaker SDK API for launching jobs, to control the number of active microbatches during training. This helps limit memory usage in cases where the number of microbatches is high. Refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.
+
+- Adds a configuration ``deterministic_server`` to the SageMaker SDK API for launching jobs, which ensures that the execution server for pipeline parallelism processes requests in a deterministic order across data parallel ranks. Refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.
+
+- Parameter passing is now supported in ``module.forward`` methods for DistributedModel and its submodules. This removes the restriction of having to pass `nn.Parameter` to the `__init__` call and making it a member of the module to use it.
+## Bug Fixes
+
+### PyTorch
+
+- Fixed a case where training hangs due to a module having computation which requires grads that is not used by the final output of the module. Now such a situtation raises an error with suggestions on making such computation compatible.
+
+- Fixed an issue with buffers which caused the buffers to not be on the correct device after a model is partitioned, and not be synchronized across steps (when ``broadcast_buffers`` is True). This could have caused correctness issues in models with buffers.
+
+## Known Issues
+
+### PyTorch
+
+- ``mp_barrier`` and ``get_mp_process_group`` are wrongly marked as deprecated methods. Ignore the deprecation warning.
+
+- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.
+
+- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
+
 # Sagemaker Distributed Model Parallel 1.2.0 Release Notes
 
 - New Features
@@ -11,7 +50,7 @@
 #### Add support for PyTorch 1.7.1
 
 - Adds support for `gradient_as_bucket_view` (PyTorch 1.7.1 only), `find_unused_parameters` (PyTorch 1.7.1 only) and `broadcast_buffers` options to `smp.DistributedModel`. These options behave the same as the corresponding options (with the same names) in
-`torch.DistributedDataParallel` API. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
+`torch.DistributedDataParallel` API. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
 
 - Adds support for `join` (PyTorch 1.7.1 only) context manager, which is to be used in conjunction with an instance of `smp.DistributedModel` to be able to train with uneven inputs across participating processes.
 
@@ -36,7 +75,7 @@ regular dicts.
 
 ### PyTorch
 
-- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6.0. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636.
+- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6.0. See the related discussion: https://github.com/pytorch/pytorch/issues/50636.
 
 
 # Sagemaker Distributed Model Parallel 1.1.0 Release Notes
 
@@ -265,7 +265,9 @@ This API document assumes you use the following import statements in your traini
       Returns the ``state_dict`` that contains optimizer state for the entire model.
       It first collects the ``local_state_dict`` and gathers and merges
       the ``local_state_dict`` from all ``mp_rank``s to create a full
-      ``state_dict``.
+      ``state_dict``. Please note that this needs to be called on all ranks with
+      ``dp_rank()==0`` to ensure the gather happens properly.
+      If it is only called on all such ranks, it can hang.
 
    .. function::  load_state_dict( )
       :noindex:
 
@@ -24,10 +24,12 @@ The following SageMaker distribute model parallel APIs are common across all fra
 
 
 .. function:: smp.init( )
+   :noindex:
 
    Initialize the library. Must be called at the beginning of training script.
 
 .. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
+   :noindex:
 
    A decorator that must be placed over a function that represents a single
    forward and backward pass (for training use cases), or a single forward
@@ -162,7 +164,7 @@ The following SageMaker distribute model parallel APIs are common across all fra
 
 
 .. class:: StepOutput
-
+   :noindex:
 
    A class that encapsulates all versions of a ``tf.Tensor``
    or \ ``torch.Tensor`` across all microbatches.
@@ -191,27 +193,32 @@ The following SageMaker distribute model parallel APIs are common across all fra
    post-processing operations on tensors.
 
    .. data:: StepOutput.outputs
+      :noindex:
 
       Returns a list of the underlying tensors, indexed by microbatch.
 
    .. function:: StepOutput.reduce_mean( )
+      :noindex:
 
       Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
       ``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
 
    .. function:: StepOutput.reduce_sum( )
+      :noindex:
 
       Returns a ``tf.Tensor`` /
       ``torch.Tensor`` that sums the constituent
       ``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
 
    .. function:: StepOutput.concat( )
+      :noindex:
 
       Returns a
       ``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
       batch dimension using ``tf.concat`` / ``torch.cat``.
 
    .. function:: StepOutput.stack( )
+      :noindex:
 
       Applies ``tf.stack`` / ``torch.stack``
       operation to the list of constituent ``tf.Tensor``\ s /
@@ -220,13 +227,15 @@ The following SageMaker distribute model parallel APIs are common across all fra
    **TensorFlow-only methods**
 
    .. function:: StepOutput.merge( )
+      :noindex:
 
       Returns a ``tf.Tensor`` that
       concatenates the constituent ``tf.Tensor``\ s along the batch
       dimension. This is commonly used for merging the model predictions
       across microbatches.
 
    .. function:: StepOutput.accumulate(method="variable", var=None)
+      :noindex:
 
       Functionally the same as ``StepOutput.reduce_mean()``. However, it is
       more memory-efficient, especially for large numbers of microbatches,
@@ -252,6 +261,7 @@ The following SageMaker distribute model parallel APIs are common across all fra
          ignored.
 
 .. _mpi_basics:
+   :noindex:
 
 MPI Basics
 ^^^^^^^^^^
@@ -274,7 +284,8 @@ The library exposes the following basic MPI primitives to its Python API:
 -  ``smp.get_dp_group()``: The list of ranks that hold different
    replicas of the same model partition.
 
-   .. _communication_api:
+.. _communication_api:
+   :noindex:
 
 Communication API
 ^^^^^^^^^^^^^^^^^
@@ -288,6 +299,7 @@ should involve.
 **Helper structures**
 
 .. data:: smp.CommGroup
+   :noindex:
 
    An ``enum`` that takes the values
    ``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
@@ -306,6 +318,7 @@ should involve.
       themselves.
 
 .. data:: smp.RankType
+   :noindex:
 
    An ``enum`` that takes the values
    ``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
@@ -321,6 +334,7 @@ should involve.
 **Communication primitives:**
 
 .. function:: smp.broadcast(obj, group)
+   :noindex:
 
    Sends the object to all processes in the
    group. The receiving process must call ``smp.recv_from`` to receive the
@@ -353,6 +367,7 @@ should involve.
           smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
 
 .. function:: smp.send(obj, dest_rank, rank_type)
+   :noindex:
 
    Sends the object ``obj`` to
    ``dest_rank``, which is of a type specified by ``rank_type``.
@@ -376,6 +391,7 @@ should involve.
       ``recv_from`` call.
 
 .. function:: smp.recv_from(src_rank, rank_type)
+   :noindex:
 
    Receive an object from a peer process. Can be used with a matching
    ``smp.send`` or a ``smp.broadcast`` call.
@@ -401,6 +417,7 @@ should involve.
       ``broadcast`` call, and the object is received.
 
 .. function:: smp.allgather(obj, group)
+   :noindex:
 
    A collective call that gathers all the
    submitted objects across all ranks in the specified ``group``. Returns a
@@ -434,6 +451,7 @@ should involve.
           out = smp.allgather(obj2, smp.CommGroup.MP_GROUP)  # returns [obj1, obj2]
 
 .. function:: smp.barrier(group=smp.WORLD)
+   :noindex:
 
    A statement that hangs until all
    processes in the specified group reach the barrier statement, similar to
@@ -455,12 +473,14 @@ should involve.
       processes outside that ``mp_group``.
 
 .. function:: smp.dp_barrier()
+   :noindex:
 
    Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
    Waits for the processes in the same \ ``dp_group`` as
    the current process to reach the same point in execution.
 
 .. function:: smp.mp_barrier()
+   :noindex:
 
    Same as passing ``smp.MP_GROUP`` to
    ``smp.barrier()``. Waits for the processes in the same ``mp_group`` as