Updated release notes and API docs for smd model parallel 1.3.1

Ajay P · Ajay P · commit 0035dff700ed · 2021-04-05T12:28:39.000-07:00
diff --git a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md
@@ -1,3 +1,33 @@
+# Sagemaker Distributed Model Parallel 1.3.1 Release Notes
+
+- New Features
+- Bug Fixes
+- Known Issues
+
+## New Features
+
+### TensorFlow
+
+- Exposes a new decorator ``register_post_partition_hook``. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html) for more information.
+
+## Bug Fixes
+
+### PyTorch
+
+- Improved memory efficiency when using active microbatches by clearing activations at end of each microbatch.
+
+### TensorFlow
+
+- Fixed issue that caused hangs when training some models with XLA enabled.
+
+## Known Issues
+
+### PyTorch
+
+- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, please only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.
+
+- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
+
 # Sagemaker Distributed Model Parallel 1.3.0 Release Notes
 
 - New Features
diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst
@@ -102,13 +102,6 @@ TensorFlow API
                             max_to_keep=None,
                             checkpoint_name="ckpt")
 
-
-   **Important:** ``smp.CheckpointManager.restore()`` must be called after
-   the first training step. This is because the first call of the
-   ``smp.step`` function constructs and partitions the model, which must
-   take place before the checkpoint restore. Calling it before the first
-   ``smp.step`` call might result in hangs or unexpected behavior.
-
    **Parameters**
 
    -  ``checkpoint``: A `tf.train.Checkpoint
@@ -154,7 +147,22 @@ TensorFlow API
    .. code:: python
 
       for step, inputs in enumerate(train_ds):
-          if step == 1:                    # NOTE: restore occurs on the second step
+          if step == 0:                    
               ckpt_manager.restore()
           loss = train_step(inputs)
 
+.. function:: register_post_partition_hook(hook)
+
+    Registers a callable ``hook`` to
+    be executed after the model is partitioned. This is useful in situations
+    where an operation needs to be executed after the model partition during
+    the first call to ``smp.step``, but before the actual execution of the
+    first forward pass.
+
+    .. code:: python 
+
+    @smp.register_post_partition_hook
+    def test_eager():
+        # All statements here will be executed right after partition but before the first forward pass
+        tf.print("Entered hook through eager context")
+