You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md
+41Lines changed: 41 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,44 @@
1
+
# Sagemaker Distributed Model Parallel 1.2.0 Release Notes
2
+
3
+
- New Features
4
+
- Bug Fixes
5
+
- Known Issues
6
+
7
+
## New Features
8
+
9
+
### PyTorch
10
+
11
+
#### Add support for PyTorch 1.7
12
+
13
+
- Adds support for `gradient_as_bucket_view` (PyTorch 1.7 only), `find_unused_parameters` (PyTorch 1.7 only) and `broadcast_buffers` options to `smp.DistributedModel`. These options behave the same as the corresponding options (with the same names) in
14
+
`torch.DistributedDataParallel` API. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
15
+
16
+
- Adds support for `join` (PyTorch 1.7 only) context manager, which is to be used in conjunction with an instance of `smp.DistributedModel` to be able to train with uneven inputs across participating processes.
17
+
18
+
- Adds support for `_register_comm_hook` (PyTorch 1.7 only) which will register the callable as a communication hook for DDP. NOTE: Like in DDP, this is an experimental API and subject to change.
19
+
20
+
### Tensorflow
21
+
22
+
- Adds support for Tensorflow 2.4
23
+
24
+
## Bug Fixes
25
+
26
+
### PyTorch
27
+
28
+
-`Serialization`: Fix a bug with serialization/flattening where instances of subclasses of dict/OrderedDicts were serialized/deserialized or internally flattened/unflattened as
29
+
regular dicts.
30
+
31
+
### Tensorflow
32
+
33
+
- Fix a bug that may cause a hang during evaluation when there is no model input for one partition.
34
+
35
+
## Known Issues
36
+
37
+
### PyTorch
38
+
39
+
- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636.
40
+
41
+
1
42
# Sagemaker Distributed Model Parallel 1.1.0 Release Notes
0 commit comments