You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.md
+26Lines changed: 26 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,29 @@
1
+
# Sagemaker Distributed Data Parallel 1.2.0 Release Notes
2
+
3
+
* New features
4
+
* Bug Fixes
5
+
6
+
*New features:*
7
+
8
+
* Support of [EFA network interface](https://aws.amazon.com/hpc/efa/) for distributed AllReduce. For best performance, it is recommended you use an instance type that supports Amazon Elastic Fabric Adapter (ml.p3dn.24xlarge and ml.p4d.24xlarge) when you train a model using SageMaker distributed data parallel.
9
+
10
+
*Bug Fixes:*
11
+
12
+
* Improved performance on single node and small clusters.
13
+
14
+
# Sagemaker Distributed Data Parallel 1.1.2 Release Notes
15
+
16
+
* Bug Fixes
17
+
* Known Issues
18
+
19
+
*Bug Fixes:*
20
+
21
+
* Fixed a bug that caused some TensorFlow operations to not work with certain data types. Operations forwarded from C++ have been extended to support every dtype supported by NCCL.
22
+
23
+
*Known Issues:*
24
+
25
+
* SageMaker distributed data parallel has slower throughput than NCCL when run using a single node. For the best performance, use multi-node distributed training with smdistributed.dataparallel. Use a single node only for experimental runs while preparing your training pipeline.
26
+
1
27
# Sagemaker Distributed Data Parallel 1.1.1 Release Notes
0 commit comments