Skip to content

Commit 0035dff

Browse files
author
Ajay P
committed
Updated release notes and API docs for smd model parallel 1.3.1
1 parent 6adb716 commit 0035dff

File tree

2 files changed

+46
-8
lines changed

2 files changed

+46
-8
lines changed

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,33 @@
1+
# Sagemaker Distributed Model Parallel 1.3.1 Release Notes
2+
3+
- New Features
4+
- Bug Fixes
5+
- Known Issues
6+
7+
## New Features
8+
9+
### TensorFlow
10+
11+
- Exposes a new decorator ``register_post_partition_hook``. This allows invoking the decorated methods just after model partition but before executing the first step. For example loading a checkpoint. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html) for more information.
12+
13+
## Bug Fixes
14+
15+
### PyTorch
16+
17+
- Improved memory efficiency when using active microbatches by clearing activations at end of each microbatch.
18+
19+
### TensorFlow
20+
21+
- Fixed issue that caused hangs when training some models with XLA enabled.
22+
23+
## Known Issues
24+
25+
### PyTorch
26+
27+
- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, please only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.
28+
29+
- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
30+
131
# Sagemaker Distributed Model Parallel 1.3.0 Release Notes
232

333
- New Features

doc/api/training/smp_versions/latest/smd_model_parallel_tensorflow.rst

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -102,13 +102,6 @@ TensorFlow API
102102
                      max_to_keep=None,
103103
                      checkpoint_name="ckpt")
104104
105-
106-
**Important:** ``smp.CheckpointManager.restore()`` must be called after
107-
the first training step. This is because the first call of the
108-
``smp.step`` function constructs and partitions the model, which must
109-
take place before the checkpoint restore. Calling it before the first
110-
``smp.step`` call might result in hangs or unexpected behavior.
111-
112105
**Parameters**
113106

114107
- ``checkpoint``: A `tf.train.Checkpoint
@@ -154,7 +147,22 @@ TensorFlow API
154147
.. code:: python
155148
156149
for step, inputs in enumerate(train_ds):
157-
    if step == 1:                    # NOTE: restore occurs on the second step
150+
    if step == 0:                   
158151
        ckpt_manager.restore()
159152
    loss = train_step(inputs)
160153
154+
.. function:: register_post_partition_hook(hook)
155+
156+
Registers a callable ``hook`` to
157+
be executed after the model is partitioned. This is useful in situations
158+
where an operation needs to be executed after the model partition during
159+
the first call to ``smp.step``, but before the actual execution of the
160+
first forward pass.
161+
162+
.. code:: python
163+
164+
@smp.register_post_partition_hook
165+
def test_eager():
166+
# All statements here will be executed right after partition but before the first forward pass
167+
tf.print("Entered hook through eager context")
168+

0 commit comments

Comments
 (0)