Skip to content

Commit efc0e48

Browse files
committed
incorp eng feedback, update docstrings, improve xref
1 parent 920ec06 commit efc0e48

File tree

2 files changed

+43
-31
lines changed

2 files changed

+43
-31
lines changed

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

Lines changed: 12 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -17,34 +17,27 @@ Sagemaker Distributed Model Parallel 1.7.0 Release Notes
1717
* Support for PyTorch 1.10.2
1818
* Support for Hugging Face Transformers 4.16.2
1919

20-
**New Features**
21-
22-
Additional tensor parallelism features for PyTorch:
20+
**Improvements**
2321

24-
* Support for query key layer scaling to avoid overflow for large model
22+
* Additional support for the :ref:`smdmp-pytorch-tensor-parallel`.
2523

26-
* This feature is integrated to the following modules:
24+
* Added support for FP32 residual addition to avoid overflow (NaN loss values)
25+
for large models with more than 100 billion parameters when using FP16.
26+
This is integrated to the following module:
2727

28-
* :class:`smp.nn.DistributedTransformerLMHead`
29-
* :class:`smp.nn.DistributedTransformer`
30-
* :class:`smp.nn.DistributedTransformerLayer`
31-
* :class:`smp.nn.DistributedAttentionLayer`
28+
* :class:`smp.nn.DistributedTransformerOutputLayer`
3229

33-
* Support for FP32 residual addition to avoid overflow (NaN loss values)
34-
for large models when using FP16
3530

36-
* This feature is integrated to the following module:
31+
* Added support for the following two `NVIDIA Megatron fused kernels
32+
<https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_:
3733

38-
* :class:`smp.nn.DistributedTransformerOutputLayer`
34+
* Fusion of attention masking and softmax (``fused_softmax``)
35+
* Fusion of bias addition and Gelu activation (``fused_bias_gelu``)
3936

40-
**Improvements**
37+
To learn more about these options and how to use them,
38+
see the :class:`smp.tensor_parallelism` context manager.
4139

42-
* Added support for a custom CUDA kernel for softmax to improve throughput
43-
* Added support for the following `NVIDIA Megatron’s fused kernels
44-
<https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_:
4540

46-
* Fusion of attention masking and softmax
47-
* Fusion of bias addition and Gelu activation
4841

4942
**Migration to AWS Deep Learning Containers**
5043

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,7 @@ Tensor Parallelism Module APIs
191191
- ``out_features``: The total number of output channels for the
192192
linear layer across all tensor-parallel ranks.
193193

194-
.. class:: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
194+
.. class:: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
195195

196196
- Constructs a distributed transformer model, including embeddings
197197
and a single LM head. A word embedding of size
@@ -205,7 +205,7 @@ Tensor Parallelism Module APIs
205205
if ``add_lm_head`` is ``True``, the output passes through a single
206206
LM head, which is a linear module without bias whose weight is
207207
tied to the word embeddings.
208-
- See ``DistributedTransformerLayer`` for a description of the rest
208+
- See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the rest
209209
of the arguments.
210210
- **Methods:**
211211

@@ -223,7 +223,7 @@ Tensor Parallelism Module APIs
223223
- ``attention_mask`` is assumed to be a 0-1 tensor of shape
224224
``[N, S]``, where 1 represents a masked position.
225225

226-
.. class:: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
226+
.. class:: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
227227

228228
- A sequence of ``smp.nn.DistributedTransformerLayer``\ s, whose
229229
number is given by ``num_layers`` argument. For the other
@@ -234,7 +234,7 @@ Tensor Parallelism Module APIs
234234
the ``DistributedTransformer``, in addition to the intermediate
235235
attention and transformer-output layers.
236236

237-
.. class:: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
237+
.. class:: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
238238

239239
- Tensor-parallel implementation of a single transformer layer.
240240
Number of attention heads, hidden size, and intermediate size
@@ -336,15 +336,15 @@ Tensor Parallelism Module APIs
336336
and the next three tensors are the same as the input
337337
arguments.
338338

339-
.. class:: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
339+
.. class:: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True)
340340

341341
- A distributed implementation for the attention block. Includes the
342342
computation of the self- or cross-attention (context layer),
343343
followed by a linear mapping and dropout, which is optionally
344344
followed by the residual-connection and layer normalization.
345345
- **Arguments:**
346346

347-
- See ``DistributedTransformerLayer`` for a description of the
347+
- See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
348348
arguments.
349349
- ``cross_attention``: If ``True``, it computes the attentions
350350
with respect to the ``cross_states`` tensor of the ``forward``
@@ -386,26 +386,27 @@ Tensor Parallelism Module APIs
386386
.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)
387387

388388
- Distributed implementation of a single transformer output layer. A
389-
single ``DistributedTransformerLayer`` with
389+
single :class:`smp.nn.DistributedTransformerLayer` with
390390
``add_cross_attention=False`` consists of a single
391391
``DistributedAttentionLayer`` immediately followed by a single
392392
``DistributedTransformerOutputLayer``. The latter linearly maps
393393
the last channel of the input tensor from ``hidden_size`` to
394394
``intermediate_size``, and then maps it back to ``hidden_size``.
395395
- **Arguments:**
396396

397-
- See ``DistributedTransformerLayer`` for a description of the
397+
- See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
398398
arguments.
399399
- ``fp32_residual_addition``: Set to ``True`` if you want to avoid overflow
400-
(NaN loss values) for large models when using FP16. (Default: False)
400+
(NaN loss values) for large models with more than 100 billion parameters
401+
when using FP16. (Default: False)
401402

402403
.. class:: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
403404

404405
- Distributed implementation of a single Embedding Layer. Currently
405406
only supports splitting across the embedding_dim.
406407
- **Arguments:**
407408

408-
- See ``DistributedEmbedding`` for a description of the
409+
- See :class:`smp.nn.DistributedEmbedding` for descriptions of the
409410
arguments.
410411

411412
.. _enabling-tp:
@@ -449,7 +450,7 @@ following API:
449450

450451
- A context manager that enables or disables tensor parallelism for
451452
any supported module that is created inside. If there are nested
452-
contexts, the innermost will override the rest. If there are
453+
contexts, the innermost overrides the rest. If there are
453454
multiple supported modules created within the context, where one
454455
is the submodule of the other, only the outermost module will be
455456
distributed. If a supported module shares weights with another
@@ -467,7 +468,25 @@ following API:
467468
with smp.tensor_parallelism(enabled=False):
468469
self.m1 = nn.Linear(20, 20) # will not be distributed
469470
470-
- Keyword arguments `kwargs` can be used to modify the configurations of the distributed modules created inside the context. If a keyword argument provided here matches any `__init__` method arguments of a `DistributedModule` that substitutes a module created inside the `smp.tensor_parallelism` context, this keyword will override the value defined in the `init_hook`.
471+
- ``kwargs`` - Keyword arguments that can be used to modify the configurations of
472+
the distributed modules created inside the context.
473+
If a keyword argument provided through it matches any ``__init__`` method arguments
474+
of a ``DistributedModule`` that substitutes a module created inside
475+
the ``smp.tensor_parallelism`` context, this keyword will override
476+
the value defined in the ``init_hook``.
477+
478+
- (*For v1.7.0 and later*) Through the following additional keyword arguments,
479+
the library supports `NVIDIA Megatron’s fused kernels
480+
<https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_
481+
482+
- ``fused_softmax`` (bool) - Fusion of attention masking and softmax.
483+
By default, it is set to ``True``. You can deactivate it by setting
484+
``fused_softmax=False`` in the ``smp.tensor_parallelism`` context manager.
485+
- ``fused_bias_gelu`` (bool) - Fusion of bias addition and Gelu activation.
486+
By default, it is set to ``False``. You can activate it by setting
487+
``fused_bias_gelu=True`` in the ``smp.tensor_parallelism`` context manager.
488+
489+
471490

472491
.. function:: smp.set_tensor_parallelism(module, enabled=True, **kwargs)
473492

0 commit comments

Comments
 (0)