Skip to content

Commit 704982a

Browse files
committed
add new args, refs, and links
1 parent d5cad97 commit 704982a

File tree

2 files changed

+26
-12
lines changed

2 files changed

+26
-12
lines changed

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,9 +22,21 @@ Sagemaker Distributed Model Parallel 1.7.0 Release Notes
2222
Additional tensor parallelism features for PyTorch:
2323

2424
* Support for query key layer scaling to avoid overflow for large model
25+
26+
* This feature is integrated to the following modules:
27+
28+
* :class:`smp.nn.DistributedTransformerLMHead`
29+
* :class:`smp.nn.DistributedTransformer`
30+
* :class:`smp.nn.DistributedTransformerLayer`
31+
* :class:`smp.nn.DistributedAttentionLayer`
32+
2533
* Support for FP32 residual addition to avoid overflow (NaN loss values)
2634
for large models when using FP16
2735

36+
* This feature is integrated to the following module:
37+
38+
* :class:`smp.nn.DistributedTransformerOutputLayer`
39+
2840
**Improvements**
2941

3042
* Added support for a custom CUDA kernel for softmax to improve throughput
@@ -62,10 +74,10 @@ Sagemaker Distributed Model Parallel 1.6.0 Release Notes
6274

6375
- Added extended memory-saving features for PyTorch 1.8.1:
6476

65-
- Tensor parallelism
66-
- Optimizer state sharding
67-
- Activation checkpointing
68-
- Activation offloading
77+
- `Tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html>`_
78+
- `Optimizer state sharding <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html>`_
79+
- `Activation checkpointing <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html>`_
80+
- `Activation offloading <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html>`_
6981

7082
For more information, see the following documentation:
7183

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -191,7 +191,7 @@ Tensor Parallelism Module APIs
191191
- ``out_features``: The total number of output channels for the
192192
linear layer across all tensor-parallel ranks.
193193

194-
.. class:: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
194+
.. class:: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
195195

196196
- Constructs a distributed transformer model, including embeddings
197197
and a single LM head. A word embedding of size
@@ -223,7 +223,7 @@ Tensor Parallelism Module APIs
223223
- ``attention_mask`` is assumed to be a 0-1 tensor of shape
224224
``[N, S]``, where 1 represents a masked position.
225225

226-
.. class:: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
226+
.. class:: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
227227

228228
- A sequence of ``smp.nn.DistributedTransformerLayer``\ s, whose
229229
number is given by ``num_layers`` argument. For the other
@@ -234,7 +234,7 @@ Tensor Parallelism Module APIs
234234
the ``DistributedTransformer``, in addition to the intermediate
235235
attention and transformer-output layers.
236236

237-
.. class:: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
237+
.. class:: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
238238

239239
- Tensor-parallel implementation of a single transformer layer.
240240
Number of attention heads, hidden size, and intermediate size
@@ -336,7 +336,7 @@ Tensor Parallelism Module APIs
336336
and the next three tensors are the same as the input
337337
arguments.
338338

339-
.. class:: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True)
339+
.. class:: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False)
340340

341341
- A distributed implementation for the attention block. Includes the
342342
computation of the self- or cross-attention (context layer),
@@ -346,9 +346,9 @@ Tensor Parallelism Module APIs
346346

347347
- See ``DistributedTransformerLayer`` for a description of the
348348
arguments.
349-
- If ``cross_attention`` is ``True``, computes the attentions
349+
- ``cross_attention``: If ``True``, it computes the attentions
350350
with respect to the ``cross_states`` tensor of the ``forward``
351-
method input tuple.
351+
method input tuple. (Default: ``False``)
352352

353353
- **Methods:**
354354

@@ -363,7 +363,7 @@ Tensor Parallelism Module APIs
363363
``[N, S, H]``, where ``N`` is batch size, ``S`` is
364364
sequence length, and ``H`` is ``hidden_size``.
365365
``attention_mask`` is assumed to be a tensor of
366-
dimensions ``[N, 1, 1, S]``, \***\* where ``N`` is the
366+
dimensions ``[N, 1, 1, S]``, where ``N`` is the
367367
batch size, and ``S`` is the sequence length.
368368
- If ``cross_attention=True``, ``inputs`` must be a tuple
369369
``(hidden_states, cross_states, attention_mask)``, where
@@ -383,7 +383,7 @@ Tensor Parallelism Module APIs
383383
- A single tensor that is the output of the attention
384384
layer.
385385

386-
.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
386+
.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)
387387

388388
- Distributed implementation of a single transformer output layer. A
389389
single ``DistributedTransformerLayer`` with
@@ -396,6 +396,8 @@ Tensor Parallelism Module APIs
396396

397397
- See ``DistributedTransformerLayer`` for a description of the
398398
arguments.
399+
- ``fp32_residual_addition``: Set to ``True`` if you want to avoid overflow
400+
(NaN loss values) for large models when using FP16. (Default: False)
399401

400402
.. class:: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
401403

0 commit comments

Comments
 (0)