You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, the library does not support ``torch.compile`` at this release.
17
18
18
-
**Improvements**
19
+
**New Features**
19
20
20
-
* Upgraded the flash-attention (https://github.com/HazyResearch/flash-attention) library to v0.2.6.post1
21
+
* Using sharded data parallelism with tensor parallelism together is now
22
+
available for PyTorch 1.13.1. It allows you to train with smaller global batch
23
+
sizes while scaling up to large clusters. For more information, see `Sharded
24
+
data parallelism with tensor parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism>`_
25
+
in the *Amazon SageMaker Developer Guide*.
26
+
* Added support for saving and loading full model checkpoints when using sharded
27
+
data parallelism. This is enabled by using the standard checkpointing API,
28
+
``smp.save_checkpoint`` with ``partial=False``.
29
+
Before, full checkpoints needed to be created by merging partial checkpoint
30
+
files after training finishes.
31
+
* ``DistributedTransformer`` now supports the ALiBi position embeddings.
32
+
When using DistributedTransformer, you can set the ``use_alibi`` parameter
33
+
to ``True`` to use the Triton-based flash attention kernels. This helps
34
+
evaluate sequences longer than those used for training.
35
+
36
+
**Bug Fixes**
37
+
38
+
* When using tensor parallelism, parameters were initialized multiple times
39
+
unncessarily. This release fixed the multiple initialization of parameters
40
+
so that each parameter is initialized exactly once.
41
+
It not only saves time, but also ensures that the random generator behavior
42
+
is similar to the non-tensor parallelism case.
43
+
44
+
**Known issues**
45
+
46
+
* Model initialization might take longer with PyTorch 2.0 than that with PyTorch 1.13.
21
47
22
48
**Migration to AWS Deep Learning Containers**
23
49
@@ -44,6 +70,40 @@ Binary file of this version of the library for `custom container
44
70
Release History
45
71
===============
46
72
73
+
SageMaker Distributed Model Parallel 1.14.0 Release Notes
0 commit comments