@@ -191,7 +191,7 @@ Tensor Parallelism Module APIs
191
191
- ``out_features ``: The total number of output channels for the
192
192
linear layer across all tensor-parallel ranks.
193
193
194
- .. class :: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
194
+ .. class :: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False )
195
195
196
196
- Constructs a distributed transformer model, including embeddings
197
197
and a single LM head. A word embedding of size
@@ -223,7 +223,7 @@ Tensor Parallelism Module APIs
223
223
- ``attention_mask `` is assumed to be a 0-1 tensor of shape
224
224
``[N, S] ``, where 1 represents a masked position.
225
225
226
- .. class :: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
226
+ .. class :: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False )
227
227
228
228
- A sequence of ``smp.nn.DistributedTransformerLayer ``\ s, whose
229
229
number is given by ``num_layers `` argument. For the other
@@ -234,7 +234,7 @@ Tensor Parallelism Module APIs
234
234
the ``DistributedTransformer ``, in addition to the intermediate
235
235
attention and transformer-output layers.
236
236
237
- .. class :: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
237
+ .. class :: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False )
238
238
239
239
- Tensor-parallel implementation of a single transformer layer.
240
240
Number of attention heads, hidden size, and intermediate size
@@ -336,7 +336,7 @@ Tensor Parallelism Module APIs
336
336
and the next three tensors are the same as the input
337
337
arguments.
338
338
339
- .. class :: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True)
339
+ .. class :: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False )
340
340
341
341
- A distributed implementation for the attention block. Includes the
342
342
computation of the self- or cross-attention (context layer),
@@ -346,9 +346,9 @@ Tensor Parallelism Module APIs
346
346
347
347
- See ``DistributedTransformerLayer `` for a description of the
348
348
arguments.
349
- - If ``cross_attention `` is ``True ``, computes the attentions
349
+ - ``cross_attention ``: If ``True ``, it computes the attentions
350
350
with respect to the ``cross_states `` tensor of the ``forward ``
351
- method input tuple.
351
+ method input tuple. (Default: `` False ``)
352
352
353
353
- **Methods: **
354
354
@@ -363,7 +363,7 @@ Tensor Parallelism Module APIs
363
363
``[N, S, H] ``, where ``N `` is batch size, ``S `` is
364
364
sequence length, and ``H `` is ``hidden_size ``.
365
365
``attention_mask `` is assumed to be a tensor of
366
- dimensions ``[N, 1, 1, S] ``, \* ** \* where ``N `` is the
366
+ dimensions ``[N, 1, 1, S] ``, where ``N `` is the
367
367
batch size, and ``S `` is the sequence length.
368
368
- If ``cross_attention=True ``, ``inputs `` must be a tuple
369
369
``(hidden_states, cross_states, attention_mask) ``, where
@@ -383,7 +383,7 @@ Tensor Parallelism Module APIs
383
383
- A single tensor that is the output of the attention
384
384
layer.
385
385
386
- .. class :: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
386
+ .. class :: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False )
387
387
388
388
- Distributed implementation of a single transformer output layer. A
389
389
single ``DistributedTransformerLayer `` with
@@ -396,6 +396,8 @@ Tensor Parallelism Module APIs
396
396
397
397
- See ``DistributedTransformerLayer `` for a description of the
398
398
arguments.
399
+ - ``fp32_residual_addition ``: Set to ``True `` if you want to avoid overflow
400
+ (NaN loss values) for large models when using FP16. (Default: False)
399
401
400
402
.. class :: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
401
403
0 commit comments