Skip to content

Commit 7933173

Browse files
authored
Merge branch 'master' into clarify-acc
2 parents b9f8ee4 + 7acd890 commit 7933173

24 files changed

+1432
-85
lines changed
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
2+
Version 1.1.0 (Latest)
3+
======================
4+
5+
.. toctree::
6+
:maxdepth: 1
7+
8+
latest/smd_data_parallel_pytorch.rst
9+
latest/smd_data_parallel_tensorflow.rst

doc/api/training/sdp_versions/v1_1_0.rst

Lines changed: 0 additions & 9 deletions
This file was deleted.

doc/api/training/smd_data_parallel.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,7 @@ Select a version to see the API documentation for version.
8484
.. toctree::
8585
:maxdepth: 1
8686

87-
sdp_versions/v1_1_0.rst
87+
sdp_versions/latest.rst
8888
sdp_versions/v1_0_0.rst
8989

9090
.. important::

doc/api/training/smd_model_parallel.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ Select a version to see the API documentation for version. To use the library, r
3434
.. toctree::
3535
:maxdepth: 1
3636

37+
smp_versions/latest.rst
3738
smp_versions/v1_2_0.rst
3839
smp_versions/v1_1_0.rst
3940

doc/api/training/smd_model_parallel_general.rst

Lines changed: 83 additions & 57 deletions
Large diffs are not rendered by default.

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,42 @@
1+
# Sagemaker Distributed Model Parallel 1.3.0 Release Notes
2+
3+
- New Features
4+
- Bug Fixes
5+
- Known Issues
6+
7+
## New Features
8+
9+
### PyTorch
10+
11+
#### Add support for PyTorch 1.8
12+
13+
- Adds a new method to DistributedModel ``register_comm_hook`` (for PyTorch 1.8 and newer only). This method behaves the same as the corresponding method with the same name in
14+
`torch.DistributedDataParallel` API. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
15+
16+
#### Others
17+
- Adds a configuration ``active_microbatches`` to the SageMaker SDK API for launching jobs, to control the number of active microbatches during training. This helps limit memory usage in cases where the number of microbatches is high. Refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.
18+
19+
- Adds a configuration ``deterministic_server`` to the SageMaker SDK API for launching jobs, which ensures that the execution server for pipeline parallelism processes requests in a deterministic order across data parallel ranks. Refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.
20+
21+
- Parameter passing is now supported in ``module.forward`` methods for DistributedModel and its submodules. This removes the restriction of having to pass `nn.Parameter` to the `__init__` call and making it a member of the module to use it.
22+
## Bug Fixes
23+
24+
### PyTorch
25+
26+
- Fixed a case where training hangs due to a module having computation which requires grads that is not used by the final output of the module. Now such a situtation raises an error with suggestions on making such computation compatible.
27+
28+
- Fixed an issue with buffers which caused the buffers to not be on the correct device after a model is partitioned, and not be synchronized across steps (when ``broadcast_buffers`` is True). This could have caused correctness issues in models with buffers.
29+
30+
## Known Issues
31+
32+
### PyTorch
33+
34+
- ``mp_barrier`` and ``get_mp_process_group`` are wrongly marked as deprecated methods. Ignore the deprecation warning.
35+
36+
- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.
37+
38+
- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
39+
140
# Sagemaker Distributed Model Parallel 1.2.0 Release Notes
241

342
- New Features
@@ -11,7 +50,7 @@
1150
#### Add support for PyTorch 1.7.1
1251

1352
- Adds support for `gradient_as_bucket_view` (PyTorch 1.7.1 only), `find_unused_parameters` (PyTorch 1.7.1 only) and `broadcast_buffers` options to `smp.DistributedModel`. These options behave the same as the corresponding options (with the same names) in
14-
`torch.DistributedDataParallel` API. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
53+
`torch.DistributedDataParallel` API. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
1554

1655
- Adds support for `join` (PyTorch 1.7.1 only) context manager, which is to be used in conjunction with an instance of `smp.DistributedModel` to be able to train with uneven inputs across participating processes.
1756

@@ -36,7 +75,7 @@ regular dicts.
3675

3776
### PyTorch
3877

39-
- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6.0. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636.
78+
- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6.0. See the related discussion: https://github.com/pytorch/pytorch/issues/50636.
4079

4180

4281
# Sagemaker Distributed Model Parallel 1.1.0 Release Notes
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
2+
Version 1.3.0 (Latest)
3+
======================
4+
5+
To use the library, reference the Common API documentation alongside the framework specific API documentation.
6+
7+
.. toctree::
8+
:maxdepth: 1
9+
10+
latest/smd_model_parallel_common_api
11+
latest/smd_model_parallel_pytorch
12+
latest/smd_model_parallel_tensorflow

0 commit comments

Comments
 (0)