Skip to content

documentation: sagemaker distributed model parallel 1.7.0 doc #2992

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
f166b60
change: update code to get commit_id in codepipeline (#2961)
navinsoni Feb 26, 2022
086258d
feature: Data Serializer (#2956)
jeniyat Feb 28, 2022
a39b750
change: reorganize test files for workflow (#2960)
qidewenwhen Mar 3, 2022
28fd737
feature: TensorFlow 2.4 for Neo (#2861)
Qingzi-Lan Mar 3, 2022
20df3d7
fix: Remove sagemaker_job_name from hyperparameters in TrainingStep (…
staubhp Mar 3, 2022
b9f90dc
fix: Style update in DataSerializer (#2962)
jeniyat Mar 3, 2022
6db3774
documentation: smddp doc update (#2968)
mchoi8739 Mar 4, 2022
d610bfb
fix: container env generation for S3 URI and add test for the same (#…
shreyapandit Mar 7, 2022
169dffd
documentation: update sagemaker training compiler docstring (#2969)
mchoi8739 Mar 7, 2022
4325fcd
feat: Python 3.9 for readthedocs (#2973)
ahsan-z-khan Mar 8, 2022
92d0627
Merge branch 'master' of https://github.com/aws/sagemaker-python-sdk …
mchoi8739 Mar 12, 2022
4cb56d6
fix doc structure
mchoi8739 Mar 12, 2022
d5cad97
archive 1.6.0 doc
mchoi8739 Mar 12, 2022
704982a
add new args, refs, and links
mchoi8739 Mar 12, 2022
920ec06
fix version number
mchoi8739 Mar 12, 2022
efc0e48
incorp eng feedback, update docstrings, improve xref
mchoi8739 Mar 15, 2022
f290a34
Trigger Build
mchoi8739 Mar 15, 2022
634eea5
Merge branch 'master' of https://github.com/aws/sagemaker-python-sdk …
mchoi8739 Mar 15, 2022
feb88ea
minor fix, trigger build again
mchoi8739 Mar 15, 2022
767ea87
fix typo
mchoi8739 Mar 15, 2022
8e99b0e
Merge branch 'master' into smdmp-1.7.0-doc
mufaddal-rohawala Mar 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/api/training/distributed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,6 @@ The SageMaker Distributed Model Parallel Library
:maxdepth: 3

smd_model_parallel
smp_versions/latest
smd_model_parallel_general
smd_model_parallel_release_notes/smd_model_parallel_change_log
20 changes: 0 additions & 20 deletions doc/api/training/smd_model_parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,6 @@ allowing you to increase prediction accuracy by creating larger models with more
You can use the library to automatically partition your existing TensorFlow and PyTorch workloads
across multiple GPUs with minimal code changes. The library's API can be accessed through the Amazon SageMaker SDK.

See the following sections to learn more about the SageMaker model parallel library APIs.

.. toctree::
:maxdepth: 3

smp_versions/latest
smd_model_parallel_general


.. tip::

We recommended using this API documentation with the conceptual guide at
Expand Down Expand Up @@ -48,14 +39,3 @@ See the following sections to learn more about the SageMaker model parallel libr
`Extend or Adapt A Docker Container that Contains the Model Parallel Library
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
for more information.

Release Notes
=============

New features, bug fixes, and improvements are regularly made to the SageMaker
distributed model parallel library.

.. toctree::
:maxdepth: 1

smd_model_parallel_release_notes/smd_model_parallel_change_log
Original file line number Diff line number Diff line change
@@ -1,6 +1,62 @@
Sagemaker Distributed Model Parallel 1.6.0 Release Notes
#############
Release Notes
#############

New features, bug fixes, and improvements are regularly made to the SageMaker
distributed model parallel library.

SageMaker Distributed Model Parallel 1.7.0 Release Notes
========================================================

*Date: March. 07. 2022*

**Currency Updates**

* Support for PyTorch 1.10.2
* Support for Hugging Face Transformers 4.16.2

**Improvements**

* Additional support for the :ref:`smdmp-pytorch-tensor-parallel`.

* Added support for FP32 residual addition to avoid overflow (NaN loss values)
for large models with more than 100 billion parameters when using FP16.
This is integrated to the following module:

* :class:`smp.nn.DistributedTransformerOutputLayer`


* Added support for the following two `NVIDIA Megatron fused kernels
<https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_:

* Fusion of attention masking and softmax (``fused_softmax``)
* Fusion of bias addition and Gelu activation (``fused_bias_gelu``)

To learn more about these options and how to use them,
see the :class:`smp.tensor_parallelism` context manager.



**Migration to AWS Deep Learning Containers**

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:


* PyTorch 1.10.2

.. code::

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker


----

Release History
===============

SageMaker Distributed Model Parallel 1.6.0 Release Notes
--------------------------------------------------------

*Date: December. 20. 2021*

**New Features**
Expand All @@ -9,10 +65,10 @@ Sagemaker Distributed Model Parallel 1.6.0 Release Notes

- Added extended memory-saving features for PyTorch 1.8.1:

- Tensor parallelism
- Optimizer state sharding
- Activation checkpointing
- Activation offloading
- `Tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html>`_
- `Optimizer state sharding <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html>`_
- `Activation checkpointing <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html>`_
- `Activation offloading <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html>`_

For more information, see the following documentation:

Expand All @@ -30,12 +86,9 @@ AWS Deep Learning Container(s):

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04

----

Release History
===============

Sagemaker Distributed Model Parallel 1.5.0 Release Notes
SageMaker Distributed Model Parallel 1.5.0 Release Notes
--------------------------------------------------------

*Date: November. 03. 2021*
Expand All @@ -59,7 +112,7 @@ AWS Deep Learning Containers:

----

Sagemaker Distributed Model Parallel 1.4.0 Release Notes
SageMaker Distributed Model Parallel 1.4.0 Release Notes
--------------------------------------------------------

*Date: June. 29. 2021*
Expand Down Expand Up @@ -90,7 +143,7 @@ AWS Deep Learning Containers:

----

Sagemaker Distributed Model Parallel 1.3.1 Release Notes
SageMaker Distributed Model Parallel 1.3.1 Release Notes
--------------------------------------------------------

- New Features
Expand Down Expand Up @@ -143,7 +196,7 @@ Sagemaker Distributed Model Parallel 1.3.1 Release Notes

----

Sagemaker Distributed Model Parallel 1.3.0 Release Notes
SageMaker Distributed Model Parallel 1.3.0 Release Notes
--------------------------------------------------------

- New Features
Expand Down Expand Up @@ -235,7 +288,7 @@ Sagemaker Distributed Model Parallel 1.3.0 Release Notes

----

Sagemaker Distributed Model Parallel 1.2.0 Release Notes
SageMaker Distributed Model Parallel 1.2.0 Release Notes
--------------------------------------------------------

- New Features
Expand Down Expand Up @@ -312,7 +365,7 @@ Sagemaker Distributed Model Parallel 1.2.0 Release Notes

----

Sagemaker Distributed Model Parallel 1.1.0 Release Notes
SageMaker Distributed Model Parallel 1.1.0 Release Notes
--------------------------------------------------------

- New Features
Expand Down
1 change: 1 addition & 0 deletions doc/api/training/smp_versions/archives.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
.. toctree::
:maxdepth: 1

v1_6_0.rst
v1_5_0.rst
v1_4_0.rst
v1_3_0.rst
Expand Down
2 changes: 1 addition & 1 deletion doc/api/training/smp_versions/latest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ depending on which version of the library you need to use.
To use the library, reference the
**Common API** documentation alongside the framework specific API documentation.

Version 1.6.0 (Latest)
Version 1.7.0 (Latest)
======================

To use the library, reference the Common API documentation alongside the framework specific API documentation.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ you need to add the following import statement at the top of your training scrip
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script-pt.html>`_
to learn how to use the following API in your PyTorch training script.

.. py:class:: smp.DistributedModel()
.. class:: smp.DistributedModel

A sub-class of ``torch.nn.Module`` which specifies the model to be
partitioned. Accepts a ``torch.nn.Module`` object ``module`` which is
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ Tensor Parallelism Module APIs
if ``add_lm_head`` is ``True``, the output passes through a single
LM head, which is a linear module without bias whose weight is
tied to the word embeddings.
- See ``DistributedTransformerLayer`` for a description of the rest
- See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the rest
of the arguments.
- **Methods:**

Expand Down Expand Up @@ -344,11 +344,11 @@ Tensor Parallelism Module APIs
followed by the residual-connection and layer normalization.
- **Arguments:**

- See ``DistributedTransformerLayer`` for a description of the
- See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
arguments.
- If ``cross_attention`` is ``True``, computes the attentions
- ``cross_attention``: If ``True``, it computes the attentions
with respect to the ``cross_states`` tensor of the ``forward``
method input tuple.
method input tuple. (Default: ``False``)

- **Methods:**

Expand All @@ -363,7 +363,7 @@ Tensor Parallelism Module APIs
``[N, S, H]``, where ``N`` is batch size, ``S`` is
sequence length, and ``H`` is ``hidden_size``.
``attention_mask`` is assumed to be a tensor of
dimensions ``[N, 1, 1, S]``, \***\* where ``N`` is the
dimensions ``[N, 1, 1, S]``, where ``N`` is the
batch size, and ``S`` is the sequence length.
- If ``cross_attention=True``, ``inputs`` must be a tuple
``(hidden_states, cross_states, attention_mask)``, where
Expand All @@ -383,27 +383,30 @@ Tensor Parallelism Module APIs
- A single tensor that is the output of the attention
layer.

.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True)
.. class:: smp.nn.DistributedTransformerOutputLayer(hidden_size=1024, intermediate_size=4096, hidden_dropout_prob=0.1, activation="gelu", layernorm_epsilon=1e-5, initializer_range=0.02, use_normal_initialization=False, pre_layernorm=False, post_layernorm=True, fp32_residual_addition=False)

- Distributed implementation of a single transformer output layer. A
single ``DistributedTransformerLayer`` with
single :class:`smp.nn.DistributedTransformerLayer` with
``add_cross_attention=False`` consists of a single
``DistributedAttentionLayer`` immediately followed by a single
``DistributedTransformerOutputLayer``. The latter linearly maps
the last channel of the input tensor from ``hidden_size`` to
``intermediate_size``, and then maps it back to ``hidden_size``.
- **Arguments:**

- See ``DistributedTransformerLayer`` for a description of the
- See :class:`smp.nn.DistributedTransformerLayer` for descriptions of the
arguments.
- ``fp32_residual_addition``: Set to ``True`` if you want to avoid overflow
(NaN loss values) for large models with more than 100 billion parameters
when using FP16. (Default: False)

.. class:: smp.nn.DistributedEmbedding(num_embeddings,embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, initializer_range=0.02, _skip_allgather=False,_skip_scatter_and_merge=False,)

- Distributed implementation of a single Embedding Layer. Currently
only supports splitting across the embedding_dim.
- **Arguments:**

- See ``DistributedEmbedding`` for a description of the
- See :class:`smp.nn.DistributedEmbedding` for descriptions of the
arguments.

.. _enabling-tp:
Expand Down Expand Up @@ -447,7 +450,7 @@ following API:

- A context manager that enables or disables tensor parallelism for
any supported module that is created inside. If there are nested
contexts, the innermost will override the rest. If there are
contexts, the innermost overrides the rest. If there are
multiple supported modules created within the context, where one
is the submodule of the other, only the outermost module will be
distributed. If a supported module shares weights with another
Expand All @@ -465,7 +468,25 @@ following API:
with smp.tensor_parallelism(enabled=False):
self.m1 = nn.Linear(20, 20) # will not be distributed

- Keyword arguments `kwargs` can be used to modify the configurations of the distributed modules created inside the context. If a keyword argument provided here matches any `__init__` method arguments of a `DistributedModule` that substitutes a module created inside the `smp.tensor_parallelism` context, this keyword will override the value defined in the `init_hook`.
- ``kwargs`` - Keyword arguments that can be used to modify the configurations of
the distributed modules created inside the context.
If a keyword argument provided through it matches any ``__init__`` method arguments
of a ``DistributedModule`` that substitutes a module created inside
the ``smp.tensor_parallelism`` context, this keyword will override
the value defined in the ``init_hook``.

- (*For v1.7.0 and later*) Through the following additional keyword arguments,
the library supports `NVIDIA Megatron’s fused kernels
<https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/fused_kernels>`_

- ``fused_softmax`` (bool) - Fusion of attention masking and softmax.
By default, it is set to ``True``. You can deactivate it by setting
``fused_softmax=False`` in the ``smp.tensor_parallelism`` context manager.
- ``fused_bias_gelu`` (bool) - Fusion of bias addition and Gelu activation.
By default, it is set to ``False``. You can activate it by setting
``fused_bias_gelu=True`` in the ``smp.tensor_parallelism`` context manager.



.. function:: smp.set_tensor_parallelism(module, enabled=True, **kwargs)

Expand Down
Loading