Skip to content

documentation: the SageMaker model parallel library 1.11.0 release #3321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Sep 2, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions doc/api/training/smd_model_parallel_general.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,35 @@ The SageMaker model parallel library internally uses MPI.
To use model parallelism, both ``smdistributed`` and MPI must be enabled
through the ``distribution`` parameter.

The following code example is a template of setting up model parallelism for a PyTorch estimator.

.. code:: python

import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
"enabled":True,
"parameters": {
...
}
}

mpi_options = {
"enabled" : True,
...
}

smdmp_estimator = PyTorch(
...
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
}
)

smdmp_estimator.fit()

.. tip::

This page provides you a complete list of parameters you can use
Expand Down Expand Up @@ -214,6 +243,34 @@ PyTorch-specific Parameters
- False
- Skips the initial tracing step. This can be useful in very large models
where even model tracing at the CPU is not possible due to memory constraints.
* - ``sharded_data_parallel_degree`` (**smdistributed-modelparallel**>=v1.11)
- int
- 1
- To run a training job using sharded data parallelism, add this parameter and specify a number greater than 1.
Sharded data parallelism is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.
For more information, see `Sharded Data Parallelism
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html>`_.
* - ``sdp_reduce_bucket_size`` (**smdistributed-modelparallel**>=v1.11)
- int
- 5e8
- Configuration parameter for sharded data parallelism (for ``sharded_data_parallel_degree > 2``).
Specifies the size of PyTorch DDP gradient buckets in number of elements of the default dtype.
* - ``sdp_param_persistence_threshold`` (**smdistributed-modelparallel**>=v1.11)
- int
- 1e6
- Specifies the size of a parameter tensor in number of elements that can persist at each GPU. Sharded data parallelism splits each parameter tensor across GPUs of a data parallel group. If the number of elements in the parameter tensor is smaller than this threshold, the parameter tensor is not split; this helps reduce communication overhead because the parameter tensor is replicated across data-parallel GPUs.
* - ``sdp_max_live_parameters`` (**smdistributed-modelparallel**>=v1.11)
- int
- 1e9
- Specifies the maximum number of parameters that can simultaneously be in a recombined training state during the forward and backward pass. Parameter fetching with the AllGather operation pauses when the number of active parameters reaches the given threshold. Note that increasing this parameter increases the memory footprint.
* - ``sdp_hierarchical_allgather`` (**smdistributed-modelparallel**>=v1.11)
- bool
- True
- If set to True, the AllGather operation runs hierarchically: it runs within each node first, and then runs across nodes. For multi-node distributed training jobs, the hierarchical AllGather operation is automatically activated.
* - ``sdp_gradient_clipping`` (**smdistributed-modelparallel**>=v1.11)
- float
- 1.0
- Specifies a threshold for gradient clipping the L2 norm of the gradients before propagating them backward through the model parameters. When sharded data parallelism is activated, gradient clipping is also activated. The default threshold is 1.0. Adjust this parameter if you have the exploding gradients problem.


Parameters for ``mpi``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,84 @@ Release Notes
New features, bug fixes, and improvements are regularly made to the SageMaker
distributed model parallel library.

SageMaker Distributed Model Parallel 1.10.0 Release Notes

SageMaker Distributed Model Parallel 1.11.0 Release Notes
=========================================================

*Date: August. 17. 2022*

**New Features**

The following new features are added for PyTorch.

* The library implements sharded data parallelism, which is a memory-saving
distributed training technique that splits the training state of a model
(model parameters, gradients, and optimizer states) across data parallel groups.
With sharded data parallelism, you can reduce the per-GPU memory footprint of
a model by sharding the training state over multiple GPUs. To learn more,
see `Sharded Data Parallelism
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html>`_
in the *Amazon SageMaker Developer Guide*.

**Migration to AWS Deep Learning Containers**

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

- DLC for PyTorch 1.12.0

.. code::

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker

Binary file of this version of the library for `custom container
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:

- For PyTorch 1.12.0

.. code::

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed_modelparallel-1.11.0-cp38-cp38-linux_x86_64.whl

----

Release History
===============

SageMaker Distributed Model Parallel 1.10.1 Release Notes
---------------------------------------------------------

*Date: August. 8. 2022*

**Currency Updates**

* Added support for Transformers v4.21.


**Migration to AWS Deep Learning Containers**

This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):

- DLC for PyTorch 1.11.0

.. code::

763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker


Binary file of this version of the library for `custom container
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:

- For PyTorch 1.11.0

.. code::

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-28-23-07/smdistributed_modelparallel-1.10.1-cp38-cp38-linux_x86_64.whl



SageMaker Distributed Model Parallel 1.10.0 Release Notes
---------------------------------------------------------

*Date: July. 19. 2022*

**New Features**
Expand Down Expand Up @@ -62,10 +137,6 @@ Binary file of this version of the library for `custom container

https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-07-11-19-23/smdistributed_modelparallel-1.10.0-cp38-cp38-linux_x86_64.whl

----

Release History
===============

SageMaker Distributed Model Parallel 1.9.0 Release Notes
--------------------------------------------------------
Expand Down
1 change: 1 addition & 0 deletions doc/api/training/smp_versions/archives.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
.. toctree::
:maxdepth: 1

v1_10_0.rst
v1_9_0.rst
v1_6_0.rst
v1_5_0.rst
Expand Down
2 changes: 1 addition & 1 deletion doc/api/training/smp_versions/latest.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ depending on which version of the library you need to use.
To use the library, reference the
**Common API** documentation alongside the framework specific API documentation.

Version 1.10.0 (Latest)
Version 1.11.0 (Latest)
===========================================

To use the library, reference the Common API documentation alongside the framework specific API documentation.
Expand Down
Loading