Skip to content

Commit 4d69fcd

Browse files
committed
documentation: smddp 1.2.1 release note / convert md to rst
1 parent 939fab0 commit 4d69fcd

File tree

4 files changed

+178
-96
lines changed

4 files changed

+178
-96
lines changed

doc/api/training/sdp_versions/latest.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
Version 1.2.0 (Latest)
2+
Version 1.2.x (Latest)
33
======================
44

55
.. toctree::

doc/api/training/smd_data_parallel.rst

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,10 @@ Select a version to see the API documentation for version.
101101
Release Notes
102102
=============
103103

104-
New features, bug fixes, and improvements are regularly made to the SageMaker distributed data parallel library.
104+
New features, bug fixes, and improvements are regularly made to the SageMaker
105+
distributed data parallel library.
105106

106-
To see the the latest changes made to the library, refer to the library
107-
`Release Notes
108-
<https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_data_parallel_release_notes/>`_.
107+
.. toctree::
108+
:maxdepth: 1
109+
110+
smd_data_parallel_release_notes/smd_data_parallel_change_log

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.md

Lines changed: 0 additions & 91 deletions
This file was deleted.
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
Sagemaker Distributed Data Parallel 1.2.1 Release Notes
2+
=======================================================
3+
4+
**New Features:**
5+
6+
- Added support for TensorFlow 2.5.0.
7+
8+
**Improvements**
9+
10+
- Improved performance on a single node.
11+
- Improved performance on small clusters (2-4 nodes).
12+
- Improved performance of ``Accumulator``.
13+
14+
**Bug fixes**
15+
16+
- Device selection for SageMaker.
17+
- Enable ``sparse_as_dense`` by default for SageMaker distributed data
18+
parallel library for TensorFlow APIs: ``DistributedGradientTape`` and
19+
``DistributedOptimizer``.
20+
21+
**Migration to AWS Deep Learning Containers**
22+
23+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
24+
25+
- TensorFlow 2.5.0 DLC release: `v1.0-tf-2.5.0-tr-py37
26+
<https://github.com/aws/deep-learning-containers/releases/tag/v1.0-tf-2.5.0-tr-py37>`__
27+
28+
.. code::
29+
30+
763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.0-gpu-py37-cu112-ubuntu18.04-v1.0
31+
32+
----
33+
34+
Release History
35+
===============
36+
37+
Sagemaker Distributed Data Parallel 1.2.0 Release Notes
38+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
39+
40+
- New features
41+
- Bug Fixes
42+
43+
**New features:**
44+
45+
- Support of `EFA network
46+
interface <https://aws.amazon.com/hpc/efa/>`__ for distributed
47+
AllReduce. For best performance, it is recommended you use an
48+
instance type that supports Amazon Elastic Fabric Adapter
49+
(ml.p3dn.24xlarge and ml.p4d.24xlarge) when you train a model using
50+
Sagemaker Distributed data parallel.
51+
52+
**Bug Fixes:**
53+
54+
- Improved performance on single node and small clusters.
55+
56+
----
57+
58+
Sagemaker Distributed Data Parallel 1.1.2 Release Notes
59+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60+
61+
- Bug Fixes
62+
- Known Issues
63+
64+
**Bug Fixes:**
65+
66+
- Fixed a bug that caused some TensorFlow operations to not work with
67+
certain data types. Operations forwarded from C++ have been extended
68+
to support every dtype supported by NCCL.
69+
70+
**Known Issues:**
71+
72+
- Sagemaker Distributed data parallel has slower throughput than NCCL
73+
when run using a single node. For the best performance, use
74+
multi-node distributed training with smdistributed.dataparallel. Use
75+
a single node only for experimental runs while preparing your
76+
training pipeline.
77+
78+
----
79+
80+
Sagemaker Distributed Data Parallel 1.1.1 Release Notes
81+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
82+
83+
- New Features
84+
- Bug Fixes
85+
- Known Issues
86+
87+
**New Features:**
88+
89+
- Adds support for PyTorch 1.8.1
90+
91+
**Bug Fixes:**
92+
93+
- Fixes a bug that was causing gradients from one of the worker nodes
94+
to be added twice resulting in incorrect ``all_reduce`` results under
95+
some conditions.
96+
97+
**Known Issues:**
98+
99+
- SageMaker distributed data parallel still is not efficient when run
100+
using a single node. For the best performance, use multi-node
101+
distributed training with ``smdistributed.dataparallel``. Use a
102+
single node only for experimental runs while preparing your training
103+
pipeline.
104+
105+
----
106+
107+
Sagemaker Distributed Data Parallel 1.1.0 Release Notes
108+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109+
110+
- New Features
111+
- Bug Fixes
112+
- Improvements
113+
- Known Issues
114+
115+
**New Features:**
116+
117+
- Adds support for PyTorch 1.8.0 with CUDA 11.1 and CUDNN 8
118+
119+
**Bug Fixes:**
120+
121+
- Fixes crash issue when importing ``smdataparallel`` before PyTorch
122+
123+
**Improvements:**
124+
125+
- Update ``smdataparallel`` name in python packages, descriptions, and
126+
log outputs
127+
128+
**Known Issues:**
129+
130+
- SageMaker DataParallel is not efficient when run using a single node.
131+
For the best performance, use multi-node distributed training with
132+
``smdataparallel``. Use a single node only for experimental runs
133+
while preparing your training pipeline.
134+
135+
Getting Started
136+
137+
For getting started, refer to SageMaker Distributed Data Parallel Python
138+
SDK Guide
139+
(https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api).
140+
141+
----
142+
143+
Sagemaker Distributed Data Parallel 1.0.0 Release Notes
144+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
145+
146+
- First Release
147+
- Getting Started
148+
149+
First Release
150+
-------------
151+
152+
SageMaker’s distributed data parallel library extends SageMaker’s
153+
training capabilities on deep learning models with near-linear scaling
154+
efficiency, achieving fast time-to-train with minimal code changes.
155+
SageMaker Distributed Data Parallel:
156+
157+
- optimizes your training job for AWS network infrastructure and EC2
158+
instance topology.
159+
- takes advantage of gradient update to communicate between nodes with
160+
a custom AllReduce algorithm.
161+
162+
The library currently supports TensorFlow v2 and PyTorch via `AWS Deep
163+
Learning
164+
Containers <https://aws.amazon.com/machine-learning/containers/>`__.
165+
166+
Getting Started
167+
---------------
168+
169+
For getting started, refer to `SageMaker Distributed Data Parallel
170+
Python SDK
171+
Guide <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__.

0 commit comments

Comments
 (0)