Skip to content

Commit 3040723

Browse files
committed
documentation: smddp 1.2.1 release note / convert md to rst
1 parent 939fab0 commit 3040723

File tree

4 files changed

+180
-96
lines changed

4 files changed

+180
-96
lines changed

doc/api/training/sdp_versions/latest.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
Version 1.2.0 (Latest)
2+
Version 1.2.x (Latest)
33
======================
44

55
.. toctree::

doc/api/training/smd_data_parallel.rst

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -101,8 +101,10 @@ Select a version to see the API documentation for version.
101101
Release Notes
102102
=============
103103

104-
New features, bug fixes, and improvements are regularly made to the SageMaker distributed data parallel library.
104+
New features, bug fixes, and improvements are regularly made to the SageMaker
105+
distributed data parallel library.
105106

106-
To see the the latest changes made to the library, refer to the library
107-
`Release Notes
108-
<https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_data_parallel_release_notes/>`_.
107+
.. toctree::
108+
:maxdepth: 1
109+
110+
smd_data_parallel_release_notes/smd_data_parallel_change_log

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.md

Lines changed: 0 additions & 91 deletions
This file was deleted.
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
Sagemaker Distributed Data Parallel 1.2.1 Release Notes
2+
=======================================================
3+
4+
*Date: June. 29. 2021*
5+
6+
**New Features:**
7+
8+
- Added support for TensorFlow 2.5.0.
9+
10+
**Improvements**
11+
12+
- Improved performance on a single node.
13+
- Improved performance on small clusters (2-4 nodes).
14+
- Improved performance of ``Accumulator``.
15+
16+
**Bug fixes**
17+
18+
- Device selection for SageMaker.
19+
- Enable ``sparse_as_dense`` by default for SageMaker distributed data
20+
parallel library for TensorFlow APIs: ``DistributedGradientTape`` and
21+
``DistributedOptimizer``.
22+
23+
**Migration to AWS Deep Learning Containers**
24+
25+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers:
26+
27+
- TensorFlow 2.5.0 DLC release: `v1.0-tf-2.5.0-tr-py37
28+
<https://github.com/aws/deep-learning-containers/releases/tag/v1.0-tf-2.5.0-tr-py37>`__
29+
30+
.. code::
31+
32+
763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.0-gpu-py37-cu112-ubuntu18.04-v1.0
33+
34+
----
35+
36+
Release History
37+
===============
38+
39+
Sagemaker Distributed Data Parallel 1.2.0 Release Notes
40+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
41+
42+
- New features
43+
- Bug Fixes
44+
45+
**New features:**
46+
47+
- Support of `EFA network
48+
interface <https://aws.amazon.com/hpc/efa/>`__ for distributed
49+
AllReduce. For best performance, it is recommended you use an
50+
instance type that supports Amazon Elastic Fabric Adapter
51+
(ml.p3dn.24xlarge and ml.p4d.24xlarge) when you train a model using
52+
Sagemaker Distributed data parallel.
53+
54+
**Bug Fixes:**
55+
56+
- Improved performance on single node and small clusters.
57+
58+
----
59+
60+
Sagemaker Distributed Data Parallel 1.1.2 Release Notes
61+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
62+
63+
- Bug Fixes
64+
- Known Issues
65+
66+
**Bug Fixes:**
67+
68+
- Fixed a bug that caused some TensorFlow operations to not work with
69+
certain data types. Operations forwarded from C++ have been extended
70+
to support every dtype supported by NCCL.
71+
72+
**Known Issues:**
73+
74+
- Sagemaker Distributed data parallel has slower throughput than NCCL
75+
when run using a single node. For the best performance, use
76+
multi-node distributed training with smdistributed.dataparallel. Use
77+
a single node only for experimental runs while preparing your
78+
training pipeline.
79+
80+
----
81+
82+
Sagemaker Distributed Data Parallel 1.1.1 Release Notes
83+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
84+
85+
- New Features
86+
- Bug Fixes
87+
- Known Issues
88+
89+
**New Features:**
90+
91+
- Adds support for PyTorch 1.8.1
92+
93+
**Bug Fixes:**
94+
95+
- Fixes a bug that was causing gradients from one of the worker nodes
96+
to be added twice resulting in incorrect ``all_reduce`` results under
97+
some conditions.
98+
99+
**Known Issues:**
100+
101+
- SageMaker distributed data parallel still is not efficient when run
102+
using a single node. For the best performance, use multi-node
103+
distributed training with ``smdistributed.dataparallel``. Use a
104+
single node only for experimental runs while preparing your training
105+
pipeline.
106+
107+
----
108+
109+
Sagemaker Distributed Data Parallel 1.1.0 Release Notes
110+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
111+
112+
- New Features
113+
- Bug Fixes
114+
- Improvements
115+
- Known Issues
116+
117+
**New Features:**
118+
119+
- Adds support for PyTorch 1.8.0 with CUDA 11.1 and CUDNN 8
120+
121+
**Bug Fixes:**
122+
123+
- Fixes crash issue when importing ``smdataparallel`` before PyTorch
124+
125+
**Improvements:**
126+
127+
- Update ``smdataparallel`` name in python packages, descriptions, and
128+
log outputs
129+
130+
**Known Issues:**
131+
132+
- SageMaker DataParallel is not efficient when run using a single node.
133+
For the best performance, use multi-node distributed training with
134+
``smdataparallel``. Use a single node only for experimental runs
135+
while preparing your training pipeline.
136+
137+
Getting Started
138+
139+
For getting started, refer to SageMaker Distributed Data Parallel Python
140+
SDK Guide
141+
(https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api).
142+
143+
----
144+
145+
Sagemaker Distributed Data Parallel 1.0.0 Release Notes
146+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
147+
148+
- First Release
149+
- Getting Started
150+
151+
First Release
152+
-------------
153+
154+
SageMaker’s distributed data parallel library extends SageMaker’s
155+
training capabilities on deep learning models with near-linear scaling
156+
efficiency, achieving fast time-to-train with minimal code changes.
157+
SageMaker Distributed Data Parallel:
158+
159+
- optimizes your training job for AWS network infrastructure and EC2
160+
instance topology.
161+
- takes advantage of gradient update to communicate between nodes with
162+
a custom AllReduce algorithm.
163+
164+
The library currently supports TensorFlow v2 and PyTorch via `AWS Deep
165+
Learning
166+
Containers <https://aws.amazon.com/machine-learning/containers/>`__.
167+
168+
Getting Started
169+
---------------
170+
171+
For getting started, refer to `SageMaker Distributed Data Parallel
172+
Python SDK
173+
Guide <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__.

0 commit comments

Comments
 (0)