Skip to content

Commit d57c5b2

Browse files
committed
smp v1.15.0 release note
1 parent 4c1c118 commit d57c5b2

File tree

3 files changed

+81
-9
lines changed

3 files changed

+81
-9
lines changed

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

Lines changed: 66 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,47 @@ Release Notes
33
#############
44

55
New features, bug fixes, and improvements are regularly made to the SageMaker
6-
distributed model parallel library.
6+
distributed model parallelism library.
77

88

9-
SageMaker Distributed Model Parallel 1.14.0 Release Notes
9+
SageMaker Distributed Model Parallel 1.15.0 Release Notes
1010
=========================================================
1111

12-
*Date: Jan. 30. 2023*
12+
*Date: Apr. 25. 2023*
1313

1414
**Currency Updates**
1515

16-
* Added support for PyTorch v1.13.1
16+
* Added support for PyTorch v2.0.0.
17+
However, the library does not support ``torch.compile`` at this release.
1718

18-
**Improvements**
19+
**New Features**
1920

20-
* Upgraded the flash-attention (https://github.com/HazyResearch/flash-attention) library to v0.2.6.post1
21+
* Using sharded data parallelism with tensor parallelism together is now
22+
available for PyTorch 1.13.1. It allows you to train with smaller global batch
23+
sizes while scaling up to large clusters. For more information, see `Sharded
24+
data parallelism with tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism>`_
25+
in the *Amazon SageMaker Developer Guide*.
26+
* Added support for saving and loading full model checkpoints when using sharded
27+
data parallelism. This is enabled by using the standard checkpointing API,
28+
``smp.save_checkpoint`` with ``partial=False``.
29+
Before, full checkpoints needed to be created by merging partial checkpoint
30+
files after training finishes.
31+
* ``DistributedTransformer`` now supports the ALiBi position embeddings.
32+
When using DistributedTransformer, you can set the ``use_alibi`` parameter
33+
to ``True`` to use the Triton-based flash attention kernels. This helps
34+
evaluate sequences longer than those used for training.
35+
36+
**Bug Fixes**
37+
38+
* When using tensor parallelism, parameters were initialized multiple times
39+
unncessarily. This release fixed the multiple initialization of parameters
40+
so that each parameter is initialized exactly once.
41+
It not only saves time, but also ensures that the random generator behavior
42+
is similar to the non-tensor parallelism case.
43+
44+
**Known issues**
45+
46+
* Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.
2147

2248
**Migration to AWS Deep Learning Containers**
2349

@@ -44,6 +70,40 @@ Binary file of this version of the library for `custom container
4470
Release History
4571
===============
4672

73+
SageMaker Distributed Model Parallel 1.14.0 Release Notes
74+
---------------------------------------------------------
75+
76+
*Date: Jan. 30. 2023*
77+
78+
**Currency Updates**
79+
80+
* Added support for PyTorch v1.13.1
81+
82+
**Improvements**
83+
84+
* Upgraded the flash-attention (https://github.com/HazyResearch/flash-attention) library to v0.2.6.post1
85+
86+
**Migration to AWS Deep Learning Containers**
87+
88+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
89+
90+
- SageMaker training container for PyTorch v1.13.1
91+
92+
.. code::
93+
94+
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
95+
96+
97+
Binary file of this version of the library for `custom container
98+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
99+
100+
- For PyTorch 1.13.1
101+
102+
.. code::
103+
104+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-01-19-18-35/smdistributed_modelparallel-1.14.0-cp39-cp39-linux_x86_64.whl
105+
106+
47107
SageMaker Distributed Model Parallel 1.13.0 Release Notes
48108
---------------------------------------------------------
49109

doc/api/training/smp_versions/latest.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ depending on which version of the library you need to use.
1010
To use the library, reference the
1111
**Common API** documentation alongside the framework specific API documentation.
1212

13-
Version 1.11.0, 1.13.0, 1.14.0 (Latest)
14-
=======================================
13+
Version 1.11.0, 1.13.0, 1.14.0, 1.15.0 (Latest)
14+
===============================================
1515

1616
To use the library, reference the Common API documentation alongside the framework specific API documentation.
1717

doc/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.rst

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -301,7 +301,19 @@ Tensor Parallelism Module APIs
301301
``post_layernorm`` must be ``True``.
302302
- ``post_layernorm``: If ``True``, inserts layer normalization at
303303
the output. At least one of ``pre_layernorm`` and
304-
``post_layernorm`` must be ``True``.
304+
``post_layernorm`` must be ``True``. (Available from
305+
the SageMaker model parallelism library v1.15.0.)
306+
- ``use_alibi`` (bool, default False): Activates Attention with
307+
Linear Biases (ALiBi) for attention computation.
308+
ALiBi facilitates efficient extrapolation on input sequences
309+
and thus improves training efficiency.
310+
The library enables ALiBi by using the `Triton
311+
flash attention kernel
312+
<https://github.com/HazyResearch/flash-attention>`_.
313+
Refer to https://arxiv.org/abs/2108.12409 for more
314+
details on the technique.
315+
(Available from
316+
the SageMaker model parallelism library v1.15.0.)
305317

306318
- **Methods:**
307319

0 commit comments

Comments
 (0)