Skip to content

Commit c6f738a

Browse files
authored
Merge branch 'master' into env_support_training
2 parents 26767b6 + b7b4549 commit c6f738a

14 files changed

+1381
-70
lines changed

doc/api/training/smd_model_parallel.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ Select a version to see the API documentation for version. To use the library, r
3434
.. toctree::
3535
:maxdepth: 1
3636

37+
smp_versions/v1_3_0.rst
3738
smp_versions/v1_2_0.rst
3839
smp_versions/v1_1_0.rst
3940

doc/api/training/smd_model_parallel_general.rst

Lines changed: 83 additions & 57 deletions
Large diffs are not rendered by default.

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,42 @@
1+
# Sagemaker Distributed Model Parallel 1.3.0 Release Notes
2+
3+
- New Features
4+
- Bug Fixes
5+
- Known Issues
6+
7+
## New Features
8+
9+
### PyTorch
10+
11+
#### Add support for PyTorch 1.8
12+
13+
- Adds a new method to DistributedModel ``register_comm_hook`` (for PyTorch 1.8 and newer only). This method behaves the same as the corresponding method with the same name in
14+
`torch.DistributedDataParallel` API. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
15+
16+
#### Others
17+
- Adds a configuration ``active_microbatches`` to the SageMaker SDK API for launching jobs, to control the number of active microbatches during training. This helps limit memory usage in cases where the number of microbatches is high. Refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.
18+
19+
- Adds a configuration ``deterministic_server`` to the SageMaker SDK API for launching jobs, which ensures that the execution server for pipeline parallelism processes requests in a deterministic order across data parallel ranks. Refer to the [SageMaker Python SDK parameters API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_general.html) for more information.
20+
21+
- Parameter passing is now supported in ``module.forward`` methods for DistributedModel and its submodules. This removes the restriction of having to pass `nn.Parameter` to the `__init__` call and making it a member of the module to use it.
22+
## Bug Fixes
23+
24+
### PyTorch
25+
26+
- Fixed a case where training hangs due to a module having computation which requires grads that is not used by the final output of the module. Now such a situtation raises an error with suggestions on making such computation compatible.
27+
28+
- Fixed an issue with buffers which caused the buffers to not be on the correct device after a model is partitioned, and not be synchronized across steps (when ``broadcast_buffers`` is True). This could have caused correctness issues in models with buffers.
29+
30+
## Known Issues
31+
32+
### PyTorch
33+
34+
- ``mp_barrier`` and ``get_mp_process_group`` are wrongly marked as deprecated methods. Ignore the deprecation warning.
35+
36+
- A crash was observed when ``optimizer.step()`` was called for certain optimizers such as AdaDelta, when the partition on which this method was called has no local parameters assigned to it after partitioning. This is due to a bug in PyTorch which [has since been fixed](https://github.com/pytorch/pytorch/pull/52944). Till that makes its way to the next release of PyTorch, only call ``optimizer.step()`` on processes which have at least one local parameter. This can be checked like this ``len(list(model.local_parameters())) > 0``.
37+
38+
- A performance regression still exists when training on SMP with PyTorch 1.7.1 compared to 1.6. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6. See the related discussion: https://github.com/pytorch/pytorch/issues/50636. This issue does not exist with PyTorch 1.8.
39+
140
# Sagemaker Distributed Model Parallel 1.2.0 Release Notes
241

342
- New Features
@@ -11,7 +50,7 @@
1150
#### Add support for PyTorch 1.7.1
1251

1352
- Adds support for `gradient_as_bucket_view` (PyTorch 1.7.1 only), `find_unused_parameters` (PyTorch 1.7.1 only) and `broadcast_buffers` options to `smp.DistributedModel`. These options behave the same as the corresponding options (with the same names) in
14-
`torch.DistributedDataParallel` API. Please refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
53+
`torch.DistributedDataParallel` API. Refer to the [SageMaker distributed model parallel API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_pytorch.html#smp.DistributedModel) for more information.
1554

1655
- Adds support for `join` (PyTorch 1.7.1 only) context manager, which is to be used in conjunction with an instance of `smp.DistributedModel` to be able to train with uneven inputs across participating processes.
1756

@@ -36,7 +75,7 @@ regular dicts.
3675

3776
### PyTorch
3877

39-
- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6.0. Please see the related discussion: https://github.com/pytorch/pytorch/issues/50636.
78+
- A performance regression was observed when training on SMP with PyTorch 1.7.1 compared to 1.6.0. The rootcause was found to be the slowdown in performance of `.grad` method calls in PyTorch 1.7.1 compared to 1.6.0. See the related discussion: https://github.com/pytorch/pytorch/issues/50636.
4079

4180

4281
# Sagemaker Distributed Model Parallel 1.1.0 Release Notes

doc/api/training/smp_versions/v1.1.0/smd_model_parallel_pytorch.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -265,7 +265,9 @@ This API document assumes you use the following import statements in your traini
265265
Returns the ``state_dict`` that contains optimizer state for the entire model.
266266
It first collects the ``local_state_dict`` and gathers and merges
267267
the ``local_state_dict`` from all ``mp_rank``s to create a full
268-
``state_dict``.
268+
``state_dict``. Please note that this needs to be called on all ranks with
269+
``dp_rank()==0`` to ensure the gather happens properly.
270+
If it is only called on all such ranks, it can hang.
269271

270272
.. function:: load_state_dict( )
271273
:noindex:

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,12 @@ The following SageMaker distribute model parallel APIs are common across all fra
2424
2525
2626
.. function:: smp.init( )
27+
:noindex:
2728

2829
Initialize the library. Must be called at the beginning of training script.
2930

3031
.. function:: @smp.step(non_split_inputs, input_split_axes, [*args, **kwargs])
32+
:noindex:
3133

3234
A decorator that must be placed over a function that represents a single
3335
forward and backward pass (for training use cases), or a single forward
@@ -162,7 +164,7 @@ The following SageMaker distribute model parallel APIs are common across all fra
162164

163165

164166
.. class:: StepOutput
165-
167+
:noindex:
166168

167169
A class that encapsulates all versions of a ``tf.Tensor``
168170
or \ ``torch.Tensor`` across all microbatches.
@@ -191,27 +193,32 @@ The following SageMaker distribute model parallel APIs are common across all fra
191193
post-processing operations on tensors.
192194
193195
.. data:: StepOutput.outputs
196+
:noindex:
194197
195198
Returns a list of the underlying tensors, indexed by microbatch.
196199
197200
.. function:: StepOutput.reduce_mean( )
201+
:noindex:
198202
199203
Returns a ``tf.Tensor``, ``torch.Tensor`` that averages the constituent ``tf.Tensor`` s
200204
``torch.Tensor`` s. This is commonly used for averaging loss and gradients across microbatches.
201205

202206
.. function:: StepOutput.reduce_sum( )
207+
:noindex:
203208

204209
Returns a ``tf.Tensor`` /
205210
``torch.Tensor`` that sums the constituent
206211
``tf.Tensor``\ s/\ ``torch.Tensor``\ s.
207212

208213
.. function:: StepOutput.concat( )
214+
:noindex:
209215

210216
Returns a
211217
``tf.Tensor``/``torch.Tensor`` that concatenates tensors along the
212218
batch dimension using ``tf.concat`` / ``torch.cat``.
213219

214220
.. function:: StepOutput.stack( )
221+
:noindex:
215222

216223
Applies ``tf.stack`` / ``torch.stack``
217224
operation to the list of constituent ``tf.Tensor``\ s /
@@ -220,13 +227,15 @@ The following SageMaker distribute model parallel APIs are common across all fra
220227
**TensorFlow-only methods**
221228

222229
.. function:: StepOutput.merge( )
230+
:noindex:
223231

224232
Returns a ``tf.Tensor`` that
225233
concatenates the constituent ``tf.Tensor``\ s along the batch
226234
dimension. This is commonly used for merging the model predictions
227235
across microbatches.
228236

229237
.. function:: StepOutput.accumulate(method="variable", var=None)
238+
:noindex:
230239

231240
Functionally the same as ``StepOutput.reduce_mean()``. However, it is
232241
more memory-efficient, especially for large numbers of microbatches,
@@ -252,6 +261,7 @@ The following SageMaker distribute model parallel APIs are common across all fra
252261
ignored.
253262

254263
.. _mpi_basics:
264+
:noindex:
255265

256266
MPI Basics
257267
^^^^^^^^^^
@@ -274,7 +284,8 @@ The library exposes the following basic MPI primitives to its Python API:
274284
- ``smp.get_dp_group()``: The list of ranks that hold different
275285
replicas of the same model partition.
276286

277-
.. _communication_api:
287+
.. _communication_api:
288+
:noindex:
278289

279290
Communication API
280291
^^^^^^^^^^^^^^^^^
@@ -288,6 +299,7 @@ should involve.
288299
**Helper structures**
289300

290301
.. data:: smp.CommGroup
302+
:noindex:
291303

292304
An ``enum`` that takes the values
293305
``CommGroup.WORLD``, ``CommGroup.MP_GROUP``, and ``CommGroup.DP_GROUP``.
@@ -306,6 +318,7 @@ should involve.
306318
themselves.
307319
308320
.. data:: smp.RankType
321+
:noindex:
309322
310323
An ``enum`` that takes the values
311324
``RankType.WORLD_RANK``, ``RankType.MP_RANK``, and ``RankType.DP_RANK``.
@@ -321,6 +334,7 @@ should involve.
321334
**Communication primitives:**
322335

323336
.. function:: smp.broadcast(obj, group)
337+
:noindex:
324338

325339
Sends the object to all processes in the
326340
group. The receiving process must call ``smp.recv_from`` to receive the
@@ -353,6 +367,7 @@ should involve.
353367
    smp.recv_from(0, rank_type=smp.RankType.WORLD_RANK)
354368
355369
.. function:: smp.send(obj, dest_rank, rank_type)
370+
:noindex:
356371
357372
Sends the object ``obj`` to
358373
``dest_rank``, which is of a type specified by ``rank_type``.
@@ -376,6 +391,7 @@ should involve.
376391
``recv_from`` call.
377392
378393
.. function:: smp.recv_from(src_rank, rank_type)
394+
:noindex:
379395
380396
Receive an object from a peer process. Can be used with a matching
381397
``smp.send`` or a ``smp.broadcast`` call.
@@ -401,6 +417,7 @@ should involve.
401417
``broadcast`` call, and the object is received.
402418

403419
.. function:: smp.allgather(obj, group)
420+
:noindex:
404421

405422
A collective call that gathers all the
406423
submitted objects across all ranks in the specified ``group``. Returns a
@@ -434,6 +451,7 @@ should involve.
434451
    out = smp.allgather(obj2, smp.CommGroup.MP_GROUP# returns [obj1, obj2]
435452
436453
.. function:: smp.barrier(group=smp.WORLD)
454+
:noindex:
437455

438456
A statement that hangs until all
439457
processes in the specified group reach the barrier statement, similar to
@@ -455,12 +473,14 @@ should involve.
455473
processes outside that ``mp_group``.
456474

457475
.. function:: smp.dp_barrier()
476+
:noindex:
458477

459478
Same as passing ``smp.DP_GROUP``\ to ``smp.barrier()``.
460479
Waits for the processes in the same \ ``dp_group`` as
461480
the current process to reach the same point in execution.
462481
463482
.. function:: smp.mp_barrier()
483+
:noindex:
464484
465485
Same as passing ``smp.MP_GROUP`` to
466486
``smp.barrier()``. Waits for the processes in the same ``mp_group`` as

0 commit comments

Comments
 (0)