aws · jeniyat · Mar 18, 2022 · Feb 26, 2022 · Feb 28, 2022 · Mar 3, 2022
@@ -26,3 +26,6 @@ The SageMaker Distributed Model Parallel Library
    :maxdepth: 3
 
    smd_model_parallel
+   smp_versions/latest
+   smd_model_parallel_general
+   smd_model_parallel_release_notes/smd_model_parallel_change_log
@@ -9,15 +9,6 @@ allowing you to increase prediction accuracy by creating larger models with more
 You can use the library to automatically partition your existing TensorFlow and PyTorch workloads
 across multiple GPUs with minimal code changes. The library's API can be accessed through the Amazon SageMaker SDK.
 
-See the following sections to learn more about the SageMaker model parallel library APIs.
-
-.. toctree::
-   :maxdepth: 3
-
-   smp_versions/latest
-   smd_model_parallel_general
-
-
 .. tip::
 
   We recommended using this API documentation with the conceptual guide at
@@ -48,14 +39,3 @@ See the following sections to learn more about the SageMaker model parallel libr
    `Extend or Adapt A Docker Container that Contains the Model Parallel Library
    <https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
    for more information.
-
-Release Notes
-=============
-
-New features, bug fixes, and improvements are regularly made to the SageMaker
-distributed model parallel library.
-
-.. toctree::
-   :maxdepth: 1
-
-   smd_model_parallel_release_notes/smd_model_parallel_change_log
@@ -1,6 +1,62 @@
-Sagemaker Distributed Model Parallel 1.6.0 Release Notes
+#############
+Release Notes
+#############
+
+New features, bug fixes, and improvements are regularly made to the SageMaker
+distributed model parallel library.
+
+SageMaker Distributed Model Parallel 1.7.0 Release Notes
 ========================================================
 
+*Date: March. 07. 2022*
+
+**Currency Updates**
+
+* Support for PyTorch 1.10.2
+* Support for Hugging Face Transformers 4.16.2
+
+**Improvements**
+
+* Additional support for the :ref:`smdmp-pytorch-tensor-parallel`.
+
+  * Added support for FP32 residual addition to avoid overflow (NaN loss values)
+    for large models with more than 100 billion parameters when using FP16.
+    This is integrated to the following module:
+
+      * :class:`smp.nn.DistributedTransformerOutputLayer`
+
+
+  * Added support for the following two `NVIDIA Megatron fused kernels
+    <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_:
+
+    * Fusion of attention masking and softmax (``fused_softmax``)
+    * Fusion of bias addition and Gelu activation (``fused_bias_gelu``)
+
+    To learn more about these options and how to use them,
+    see the :class:`smp.tensor_parallelism` context manager.
+
+
+
+**Migration to AWS Deep Learning Containers**
+
+This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
+
+
+* PyTorch 1.10.2
+
+  .. code::
+
+    763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
+
+
+----
+
+Release History
+===============
+
+SageMaker Distributed Model Parallel 1.6.0 Release Notes
+--------------------------------------------------------
+
 *Date: December. 20. 2021*
 
 **New Features**
@@ -9,10 +65,10 @@ Sagemaker Distributed Model Parallel 1.6.0 Release Notes
 
   - Added extended memory-saving features for PyTorch 1.8.1:
 
-    - Tensor parallelism
-    - Optimizer state sharding
-    - Activation checkpointing
-    - Activation offloading
+    - `Tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html>`_
+    - `Optimizer state sharding <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html>`_
+    - `Activation checkpointing <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html>`_
+    - `Activation offloading <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html>`_
 
     For more information, see the following documentation:
 
@@ -30,12 +86,9 @@ AWS Deep Learning Container(s):
 
     763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04
 
-----
 
-Release History
-===============
 
-Sagemaker Distributed Model Parallel 1.5.0 Release Notes
+SageMaker Distributed Model Parallel 1.5.0 Release Notes
 --------------------------------------------------------
 
 *Date: November. 03. 2021*
@@ -59,7 +112,7 @@ AWS Deep Learning Containers:
 
 ----
 
-Sagemaker Distributed Model Parallel 1.4.0 Release Notes
+SageMaker Distributed Model Parallel 1.4.0 Release Notes
 --------------------------------------------------------
 
 *Date: June. 29. 2021*
@@ -90,7 +143,7 @@ AWS Deep Learning Containers:
 
 ----
 
-Sagemaker Distributed Model Parallel 1.3.1 Release Notes
+SageMaker Distributed Model Parallel 1.3.1 Release Notes
 --------------------------------------------------------
 
 -  New Features
@@ -143,7 +196,7 @@ Sagemaker Distributed Model Parallel 1.3.1 Release Notes
 
 ----
 
-Sagemaker Distributed Model Parallel 1.3.0 Release Notes
+SageMaker Distributed Model Parallel 1.3.0 Release Notes
 --------------------------------------------------------
 
 -  New Features
@@ -235,7 +288,7 @@ Sagemaker Distributed Model Parallel 1.3.0 Release Notes
 
 ----
 
-Sagemaker Distributed Model Parallel 1.2.0 Release Notes
+SageMaker Distributed Model Parallel 1.2.0 Release Notes
 --------------------------------------------------------
 
 -  New Features
@@ -312,7 +365,7 @@ Sagemaker Distributed Model Parallel 1.2.0 Release Notes
 
 ----
 
-Sagemaker Distributed Model Parallel 1.1.0 Release Notes
+SageMaker Distributed Model Parallel 1.1.0 Release Notes
 --------------------------------------------------------
 
 -  New Features

@@ -3,6 +3,7 @@
 .. toctree::
     :maxdepth: 1
 
+    v1_6_0.rst
     v1_5_0.rst
     v1_4_0.rst
     v1_3_0.rst

@@ -10,7 +10,7 @@ depending on which version of the library you need to use.
 To use the library, reference the
 **Common API** documentation alongside the framework specific API documentation.
 
-Version 1.6.0 (Latest)
+Version 1.7.0 (Latest)
 ======================
 
 To use the library, reference the Common API documentation alongside the framework specific API documentation.

@@ -16,7 +16,7 @@ you need to add the following import statement at the top of your training scrip
    <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
    to learn how to use the following API in your PyTorch training script.
 
-.. py:class:: smp.DistributedModel()
+.. class:: smp.DistributedModel
 
    A sub-class of ``torch.nn.Module`` which specifies the model to be
    partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is

@@ -205,7 +205,7 @@ Tensor Parallelism Module APIs
       if ``add_lm_head`` is ``True``, the output passes through a single
       LM head, which is a linear module without bias whose weight is
       tied to the word embeddings.
-   -  See ``DistributedTransformerLayer`` for a description of the rest
+   -  See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the rest
       of the arguments.
    -  **Methods:**
 
@@ -344,11 +344,11 @@ Tensor Parallelism Module APIs
       followed by the residual-connection and layer normalization.
    -  **Arguments:**
 
-      -  See ``DistributedTransformerLayer`` for a description of the
+      -  See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
          arguments.
-      -  If ``cross_attention`` is ``True``, computes the attentions
+      -  ``cross_attention``: If ``True``, it computes the attentions
          with respect to the ``cross_states`` tensor of the ``forward``
-         method input tuple.
+         method input tuple. (Default: ``False``)
 
    -  **Methods:**
 
@@ -363,7 +363,7 @@ Tensor Parallelism Module APIs
                ``[N, S, H]``, where ``N`` is batch size, ``S`` is
                sequence length, and ``H`` is ``hidden_size``.
                ``attention_mask`` is assumed to be a tensor of
-               dimensions ``[N, 1, 1, S]``, \***\* where ``N`` is the
+               dimensions ``[N, 1, 1, S]``, where ``N`` is the
                batch size, and ``S`` is the sequence length.
             -  If ``cross_attention=True``, ``inputs`` must be a tuple
                ``(hidden_states, cross_states, attention_mask)``, where
@@ -383,27 +383,30 @@ Tensor Parallelism Module APIs
             -  A single tensor that is the output of the attention
                layer.
 
-.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096,  hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
+.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096,  hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)
 
    -  Distributed implementation of a single transformer output layer. A
-      single ``DistributedTransformerLayer`` with
+      single :class:`smp.nn.DistributedTransformerLayer` with
       ``add_cross_attention=False`` consists of a single
       ``DistributedAttentionLayer`` immediately followed by a single
       ``DistributedTransformerOutputLayer``. The latter linearly maps
       the last channel of the input tensor from ``hidden_size`` to
       ``intermediate_size``, and then maps it back to ``hidden_size``.
    -  **Arguments:**
 
-      -  See ``DistributedTransformerLayer`` for a description of the
+      -  See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
          arguments.
+      - ``fp32_residual_addition``: Set to ``True`` if you want to avoid overflow
+        (NaN loss values) for large models with more than 100 billion parameters
+        when using FP16. (Default: False)
 
 .. class:: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
 
    -  Distributed implementation of a single Embedding Layer. Currently
       only supports splitting across the embedding_dim.
    -  **Arguments:**
 
-      -  See ``DistributedEmbedding`` for a description of the
+      -  See :class:`smp.nn.DistributedEmbedding` for descriptions of the
          arguments.
 
 .. _enabling-tp:
@@ -447,7 +450,7 @@ following API:
 
    -  A context manager that enables or disables tensor parallelism for
       any supported module that is created inside. If there are nested
-      contexts, the innermost will override the rest. If there are
+      contexts, the innermost overrides the rest. If there are
       multiple supported modules created within the context, where one
       is the submodule of the other, only the outermost module will be
       distributed. If a supported module shares weights with another
@@ -465,7 +468,25 @@ following API:
              with smp.tensor_parallelism(enabled=False):
                  self.m1 = nn.Linear(20, 20)               # will not be distributed
 
-   - Keyword arguments `kwargs` can be used to modify the configurations of the distributed modules created inside the context. If a keyword argument provided here matches any `__init__` method arguments of a `DistributedModule` that substitutes a module created inside the `smp.tensor_parallelism` context, this keyword will override the value defined in the `init_hook`.
+   - ``kwargs`` - Keyword arguments that can be used to modify the configurations of
+     the distributed modules created inside the context.
+     If a keyword argument provided through it matches any ``__init__`` method arguments
+     of a ``DistributedModule`` that substitutes a module created inside
+     the ``smp.tensor_parallelism`` context, this keyword will override
+     the value defined in the ``init_hook``.
+
+     - (*For v1.7.0 and later*) Through the following additional keyword arguments,
+       the library supports `NVIDIA Megatron’s fused kernels
+       <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_
+
+       - ``fused_softmax`` (bool) - Fusion of attention masking and softmax.
+         By default, it is set to ``True``. You can deactivate it by setting
+         ``fused_softmax=False`` in the ``smp.tensor_parallelism`` context manager.
+       - ``fused_bias_gelu`` (bool) - Fusion of bias addition and Gelu activation.
+         By default, it is set to ``False``. You can activate it by setting
+         ``fused_bias_gelu=True`` in the ``smp.tensor_parallelism`` context manager.
+
+
 
 .. function:: smp.set_tensor_parallelism(module, enabled=True, **kwargs)