@@ -191,7 +191,7 @@ Tensor Parallelism Module APIs
191
191
- ``out_features ``: The total number of output channels for the
192
192
linear layer across all tensor-parallel ranks.
193
193
194
- .. class :: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False )
194
+ .. class :: smp.nn.DistributedTransformerLMHead(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, vocab_size=30522, num_positions=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, num_token_types=0, causal_mask_size=None, add_cross_attention=False, add_lm_head=True, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
195
195
196
196
- Constructs a distributed transformer model, including embeddings
197
197
and a single LM head. A word embedding of size
@@ -205,7 +205,7 @@ Tensor Parallelism Module APIs
205
205
if ``add_lm_head `` is ``True ``, the output passes through a single
206
206
LM head, which is a linear module without bias whose weight is
207
207
tied to the word embeddings.
208
- - See `` DistributedTransformerLayer `` for a description of the rest
208
+ - See :class: ` smp.nn. DistributedTransformerLayer ` for descriptions of the rest
209
209
of the arguments.
210
210
- **Methods: **
211
211
@@ -223,7 +223,7 @@ Tensor Parallelism Module APIs
223
223
- ``attention_mask `` is assumed to be a 0-1 tensor of shape
224
224
``[N, S] ``, where 1 represents a masked position.
225
225
226
- .. class :: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False )
226
+ .. class :: smp.nn.DistributedTransformer(num_layers=12, num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
227
227
228
228
- A sequence of ``smp.nn.DistributedTransformerLayer ``\ s, whose
229
229
number is given by ``num_layers `` argument. For the other
@@ -234,7 +234,7 @@ Tensor Parallelism Module APIs
234
234
the ``DistributedTransformer ``, in addition to the intermediate
235
235
attention and transformer-output layers.
236
236
237
- .. class :: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False )
237
+ .. class :: smp.nn.DistributedTransformerLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, intermediate_size=4096, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, causal_mask_size=None, add_cross_attention=False, pre_layernorm=False, post_layernorm=True)
238
238
239
239
- Tensor-parallel implementation of a single transformer layer.
240
240
Number of attention heads, hidden size, and intermediate size
@@ -336,15 +336,15 @@ Tensor Parallelism Module APIs
336
336
and the next three tensors are the same as the input
337
337
arguments.
338
338
339
- .. class :: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True, query_key_layer_scaling=False )
339
+ .. class :: smp.nn.DistributedAttentionLayer(num_attention_heads=32, attention_head_size=32, hidden_size=1024, attention_dropout_prob=0.1, hidden_dropout_prob=0.1, layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, cross_attention=False, causal_mask_size=None, pre_layernorm=False, post_layernorm=True)
340
340
341
341
- A distributed implementation for the attention block. Includes the
342
342
computation of the self- or cross-attention (context layer),
343
343
followed by a linear mapping and dropout, which is optionally
344
344
followed by the residual-connection and layer normalization.
345
345
- **Arguments: **
346
346
347
- - See `` DistributedTransformerLayer `` for a description of the
347
+ - See :class: ` smp.nn. DistributedTransformerLayer ` for descriptions of the
348
348
arguments.
349
349
- ``cross_attention ``: If ``True ``, it computes the attentions
350
350
with respect to the ``cross_states `` tensor of the ``forward ``
@@ -386,26 +386,27 @@ Tensor Parallelism Module APIs
386
386
.. class :: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)
387
387
388
388
- Distributed implementation of a single transformer output layer. A
389
- single `` DistributedTransformerLayer ` ` with
389
+ single :class: ` smp.nn. DistributedTransformerLayer ` with
390
390
``add_cross_attention=False `` consists of a single
391
391
``DistributedAttentionLayer `` immediately followed by a single
392
392
``DistributedTransformerOutputLayer ``. The latter linearly maps
393
393
the last channel of the input tensor from ``hidden_size `` to
394
394
``intermediate_size ``, and then maps it back to ``hidden_size ``.
395
395
- **Arguments: **
396
396
397
- - See `` DistributedTransformerLayer `` for a description of the
397
+ - See :class: ` smp.nn. DistributedTransformerLayer ` for descriptions of the
398
398
arguments.
399
399
- ``fp32_residual_addition ``: Set to ``True `` if you want to avoid overflow
400
- (NaN loss values) for large models when using FP16. (Default: False)
400
+ (NaN loss values) for large models with more than 100 billion parameters
401
+ when using FP16. (Default: False)
401
402
402
403
.. class :: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)
403
404
404
405
- Distributed implementation of a single Embedding Layer. Currently
405
406
only supports splitting across the embedding_dim.
406
407
- **Arguments: **
407
408
408
- - See `` DistributedEmbedding `` for a description of the
409
+ - See :class: ` smp.nn. DistributedEmbedding ` for descriptions of the
409
410
arguments.
410
411
411
412
.. _enabling-tp :
@@ -449,7 +450,7 @@ following API:
449
450
450
451
- A context manager that enables or disables tensor parallelism for
451
452
any supported module that is created inside. If there are nested
452
- contexts, the innermost will override the rest. If there are
453
+ contexts, the innermost overrides the rest. If there are
453
454
multiple supported modules created within the context, where one
454
455
is the submodule of the other, only the outermost module will be
455
456
distributed. If a supported module shares weights with another
@@ -467,7 +468,25 @@ following API:
467
468
with smp.tensor_parallelism(enabled = False ):
468
469
self .m1 = nn.Linear(20 , 20 ) # will not be distributed
469
470
470
- - Keyword arguments `kwargs ` can be used to modify the configurations of the distributed modules created inside the context. If a keyword argument provided here matches any `__init__ ` method arguments of a `DistributedModule ` that substitutes a module created inside the `smp.tensor_parallelism ` context, this keyword will override the value defined in the `init_hook `.
471
+ - ``kwargs `` - Keyword arguments that can be used to modify the configurations of
472
+ the distributed modules created inside the context.
473
+ If a keyword argument provided through it matches any ``__init__ `` method arguments
474
+ of a ``DistributedModule `` that substitutes a module created inside
475
+ the ``smp.tensor_parallelism `` context, this keyword will override
476
+ the value defined in the ``init_hook ``.
477
+
478
+ - (*For v1.7.0 and later *) Through the following additional keyword arguments,
479
+ the library supports `NVIDIA Megatron’s fused kernels
480
+ <https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels> `_
481
+
482
+ - ``fused_softmax `` (bool) - Fusion of attention masking and softmax.
483
+ By default, it is set to ``True ``. You can deactivate it by setting
484
+ ``fused_softmax=False `` in the ``smp.tensor_parallelism `` context manager.
485
+ - ``fused_bias_gelu `` (bool) - Fusion of bias addition and Gelu activation.
486
+ By default, it is set to ``False ``. You can activate it by setting
487
+ ``fused_bias_gelu=True `` in the ``smp.tensor_parallelism `` context manager.
488
+
489
+
471
490
472
491
.. function :: smp.set_tensor_parallelism(module, enabled=True, **kwargs)
473
492
0 commit comments