incorp eng feedback, update docstrings, improve xref

mchoi8739 · mchoi8739 · commit efc0e48e02f7 · 2022-03-14T20:50:14.000-07:00
diff --git a/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst b/doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst
@@ -17,34 +17,27 @@ Sagemaker Distributed Model Parallel 1.7.0 Release Notes
 * Support for PyTorch 1.10.2
 * Support for Hugging Face Transformers 4.16.2
 
-**New Features**
-
-Additional tensor parallelism features for PyTorch:
+**Improvements**
 
-* Support for query key layer scaling to avoid overflow for large model
+* Additional support for the :ref:`smdmp-pytorch-tensor-parallel`.
 
-  * This feature is integrated to the following modules:
+  * Added support for FP32 residual addition to avoid overflow (NaN loss values)
+    for large models with more than 100 billion parameters when using FP16.
+    This is integrated to the following module:
 
-    * :class:`smp.nn.DistributedTransformerLMHead`
-    * :class:`smp.nn.DistributedTransformer`
-    * :class:`smp.nn.DistributedTransformerLayer`
-    * :class:`smp.nn.DistributedAttentionLayer`
+      * :class:`smp.nn.DistributedTransformerOutputLayer`
 
-* Support for FP32 residual addition to avoid overflow (NaN loss values)
-  for large models when using FP16
 
-  * This feature is integrated to the following module:
+  * Added support for the following two `NVIDIA Megatron fused kernels
+    <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_:
 
-    * :class:`smp.nn.DistributedTransformerOutputLayer`
+    * Fusion of attention masking and softmax (``fused_softmax``)
+    * Fusion of bias addition and Gelu activation (``fused_bias_gelu``)
 
-**Improvements**
+    To learn more about these options and how to use them,
+    see the :class:`smp.tensor_parallelism` context manager.
 
-* Added support for a custom CUDA kernel for softmax to improve throughput
-* Added support for the following `NVIDIA Megatron’s fused kernels
-  <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_:
 
-  * Fusion of attention masking and softmax
-  * Fusion of bias addition and Gelu activation
 
 **Migration to AWS Deep Learning Containers**
 
diff --git a/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst b/doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst
@@ -191,7 +191,7 @@ Tensor Parallelism Module APIs
       -  ``out_features``: The total number of output channels for the
          linear layer across all tensor-parallel ranks.
 
-.. class:: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True,  initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
+.. class:: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True,  initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
 
    -  Constructs a distributed transformer model, including embeddings
       and a single LM head. A word embedding of size
@@ -205,7 +205,7 @@ Tensor Parallelism Module APIs
       if ``add_lm_head`` is ``True``, the output passes through a single
       LM head, which is a linear module without bias whose weight is
       tied to the word embeddings.
-   -  See ``DistributedTransformerLayer`` for a description of the rest
+   -  See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the rest
       of the arguments.
    -  **Methods:**
 
@@ -223,7 +223,7 @@ Tensor Parallelism Module APIs
          -  ``attention_mask`` is assumed to be a 0-1 tensor of shape
             ``[N, S]``, where 1 represents a masked position.
 
-.. class:: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
+.. class:: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
 
    -  A sequence of ``smp.nn.DistributedTransformerLayer``\ s, whose
       number is given by ``num_layers`` argument. For the other
@@ -234,7 +234,7 @@ Tensor Parallelism Module APIs
       the ``DistributedTransformer``, in addition to the intermediate
       attention and transformer-output layers.
 
-.. class:: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
+.. class:: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
 
    -  Tensor-parallel implementation of a single transformer layer.
       Number of attention heads, hidden size, and intermediate size
@@ -336,15 +336,15 @@ Tensor Parallelism Module APIs
                and the next three tensors are the same as the input
                arguments.
 
-.. class:: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
+.. class:: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True)
 
    -  A distributed implementation for the attention block. Includes the
       computation of the self- or cross-attention (context layer),
       followed by a linear mapping and dropout, which is optionally
       followed by the residual-connection and layer normalization.
    -  **Arguments:**
 
-      -  See ``DistributedTransformerLayer`` for a description of the
+      -  See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
          arguments.
       -  ``cross_attention``: If ``True``, it computes the attentions
          with respect to the ``cross_states`` tensor of the ``forward``
@@ -386,26 +386,27 @@ Tensor Parallelism Module APIs
 .. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096,  hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)
 
    -  Distributed implementation of a single transformer output layer. A
-      single ``DistributedTransformerLayer`` with
+      single :class:`smp.nn.DistributedTransformerLayer` with
       ``add_cross_attention=False`` consists of a single
       ``DistributedAttentionLayer`` immediately followed by a single
       ``DistributedTransformerOutputLayer``. The latter linearly maps
       the last channel of the input tensor from ``hidden_size`` to
       ``intermediate_size``, and then maps it back to ``hidden_size``.
    -  **Arguments:**
 
-      -  See ``DistributedTransformerLayer`` for a description of the
+      -  See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
          arguments.
       - ``fp32_residual_addition``: Set to ``True`` if you want to avoid overflow
-        (NaN loss values) for large models when using FP16. (Default: False)
+        (NaN loss values) for large models with more than 100 billion parameters
+        when using FP16. (Default: False)
 
 .. class:: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
 
    -  Distributed implementation of a single Embedding Layer. Currently
       only supports splitting across the embedding_dim.
    -  **Arguments:**
 
-      -  See ``DistributedEmbedding`` for a description of the
+      -  See :class:`smp.nn.DistributedEmbedding` for descriptions of the
          arguments.
 
 .. _enabling-tp:
@@ -449,7 +450,7 @@ following API:
 
    -  A context manager that enables or disables tensor parallelism for
       any supported module that is created inside. If there are nested
-      contexts, the innermost will override the rest. If there are
+      contexts, the innermost overrides the rest. If there are
       multiple supported modules created within the context, where one
       is the submodule of the other, only the outermost module will be
       distributed. If a supported module shares weights with another
@@ -467,7 +468,25 @@ following API:
              with smp.tensor_parallelism(enabled=False):
                  self.m1 = nn.Linear(20, 20)               # will not be distributed
 
-   - Keyword arguments `kwargs` can be used to modify the configurations of the distributed modules created inside the context. If a keyword argument provided here matches any `__init__` method arguments of a `DistributedModule` that substitutes a module created inside the `smp.tensor_parallelism` context, this keyword will override the value defined in the `init_hook`.
+   - ``kwargs`` - Keyword arguments that can be used to modify the configurations of
+     the distributed modules created inside the context.
+     If a keyword argument provided through it matches any ``__init__`` method arguments
+     of a ``DistributedModule`` that substitutes a module created inside
+     the ``smp.tensor_parallelism`` context, this keyword will override
+     the value defined in the ``init_hook``.
+
+     - (*For v1.7.0 and later*) Through the following additional keyword arguments,
+       the library supports `NVIDIA Megatron’s fused kernels
+       <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_
+
+       - ``fused_softmax`` (bool) - Fusion of attention masking and softmax.
+         By default, it is set to ``True``. You can deactivate it by setting
+         ``fused_softmax=False`` in the ``smp.tensor_parallelism`` context manager.
+       - ``fused_bias_gelu`` (bool) - Fusion of bias addition and Gelu activation.
+         By default, it is set to ``False``. You can activate it by setting
+         ``fused_bias_gelu=True`` in the ``smp.tensor_parallelism`` context manager.
+
+
 
 .. function:: smp.set_tensor_parallelism(module, enabled=True, **kwargs)