Skip to content

Commit 36b5f95

Browse files
authored
documentation: adding change log for smd model parallel (#2056)
* documentation: adding change log for smd model parallel * documentation: small typo fixes * documentation: nudging CI
1 parent 200a4c2 commit 36b5f95

File tree

2 files changed

+116
-11
lines changed

2 files changed

+116
-11
lines changed

doc/api/training/smd_model_parallel.rst

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,21 @@ Use the following sections to learn more about the model parallelism and the lib
2020
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
2121
for more information.
2222

23+
How to Use this Guide
24+
=====================
25+
26+
The library contains a Common API that is shared across frameworks, as well as APIs
27+
that are specific to supported frameworks, TensorFlow and PyTorch. To use the library, reference the
28+
**Common API** documentation alongside the framework specific API documentation.
29+
30+
.. toctree::
31+
:maxdepth: 1
32+
33+
smd_model_parallel_general
34+
smd_model_parallel_common_api
35+
smd_model_parallel_pytorch
36+
smd_model_parallel_tensorflow
37+
2338
It is recommended to use this documentation alongside `SageMaker Distributed Model Parallel
2439
<http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`__ in the Amazon SageMaker
2540
developer guide. This developer guide documentation includes:
@@ -34,17 +49,12 @@ developer guide. This developer guide documentation includes:
3449
- `Configuration tips and pitfalls
3550
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html>`__
3651

37-
**How to Use this Guide**
38-
39-
The library contains a Common API that is shared across frameworks, as well as APIs
40-
that are specific to supported frameworks, TensorFlow and PyTroch. To use the library, reference the
41-
**Common API** documentation alongside framework specific API documentation.
52+
Latest Updates
53+
==============
4254

55+
New features, bug fixes, and improvements are regularly made to the SageMaker distributed model parallel library.
4356

44-
.. toctree::
45-
:maxdepth: 1
57+
To see the the latest changes made to the library, refer to the library
58+
`Release Notes
59+
<https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_model_parallel_release_notes/>`_.
4660

47-
smd_model_parallel_general
48-
smd_model_parallel_common_api
49-
smd_model_parallel_pytorch
50-
smd_model_parallel_tensorflow
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Sagemaker Distributed Model Parallel 1.1.0 Release Notes
2+
3+
- New Features
4+
- Bug Fixes
5+
- Improvements
6+
- Performance
7+
- Known Issues
8+
9+
## New Features
10+
11+
The following sections describe new feature releases that are common across frameworks and that are framework specific.
12+
13+
### Common across frameworks
14+
15+
#### Custom slicing support (`smp_slice` method) for objects passed to `smp.step` decorated functions
16+
17+
To pass an object to `smp.step` that contains tensors that needs to be split across
18+
microbatches and is not an instance of list, dict, tuple or set, you should implement `smp_slice` method for the object.
19+
20+
Below is an example of how to use this with PyTorch:
21+
22+
```
23+
class CustomType:
24+
def __init__(self, tensor):
25+
self.data = tensor
26+
27+
# SMP will call this to invoke slicing on the object passing in total microbatches (num_mb)
28+
# and the current microbatch index (mb).
29+
def smp_slice(self, num_mb, mb, axis):
30+
dim_size = list(self.data.size())[axis]
31+
32+
split_size = dim_size // num_mb
33+
sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
34+
return CustomType(sliced_tensor, self.other)
35+
36+
custom_obj = CustomType(torch.ones(4,))
37+
38+
@smp.step()
39+
def step(custom_obj):
40+
loss = model(custom_obj)
41+
model.backward(loss)
42+
return loss
43+
```
44+
45+
### PyTorch
46+
47+
#### Add support for smp.DistributedModel.cpu()
48+
49+
`smp.DistributedModel.cpu()`
50+
[allgather](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_common_api.html#smp.allgather)s
51+
parameters and buffers across all `mp_ranks` and moves them to the CPU.
52+
53+
#### Add `trace_memory_usage` option to `smp.DistributedModel` to measure memory usage per module
54+
55+
Adds `trace_memory_usage` option to `smp.DistributedModel`. This attempts to measure memory usage per module during
56+
tracing. If this is disabled, memory usage is estimated through the sizes of tensors returned from the module.
57+
This option is disabled by default.
58+
59+
## Bug Fixes
60+
61+
### PyTorch
62+
63+
- `torch.nn.Sequential`: Fix a bug with `torch.nn.Sequential` which causes a failure with the error message : `shouldnt go less than 0, there is a bug` when the inputs to the first module don't require grads.
64+
65+
- `smp.DistributedModel`: Fix a bug with `DistributedModel` execution when a module has multiple parents. The bug surfaces with the error message: `actual_parent should be different than module_execution_stack parent only for torch.nn.ModuleList`
66+
67+
- `apex.optimizers.FusedNovoGrad`: Fix a bug with `apex.optimizers.FusedNovoGrad` which surfaces with the error message: `KeyError: 'exp_avg_sq'`
68+
69+
## Improvements
70+
71+
### Usability
72+
73+
#### PyTorch
74+
75+
- `smp.DistributedModel`: Improve the error message when the forward pass on `smp.DistributedModel` is called outside the `smp.step` decorated function.
76+
77+
- `smp.load`: Add user friendly error messages when loading checkpoints with `smp.load`.
78+
79+
### Partitioning Algorithm
80+
81+
#### PyTorch
82+
83+
- Better memory balancing by taking into account the existing modules already assigned to the parent, while partitioning the children of a given module.
84+
85+
## Performance
86+
87+
### Tensorflow
88+
89+
- Addresses long pre-processing times introduced by SMP XLA optimizer when dealing with large graphs and large number of microbatches. BERT (large) preprocessing time goes down from 40 minutes to 6 minutes on p3.16xlarge.
90+
91+
## Known Issues
92+
93+
### PyTorch
94+
95+
- Serialization for Torch in SMP overwrites instances of dict subclass to be dict itself, instead of the instances of subclass. One of the use cases which fails because of this issue is when a user implements a subclass of OrderedDict with the `__getitem__` method. After serialization/deserialization in SMP, indexing on the object will lead to errors. A workaround is to use the dict keys to access the underlying item.

0 commit comments

Comments
 (0)