|
| 1 | +# Sagemaker Distributed Model Parallel 1.1.0 Release Notes |
| 2 | + |
| 3 | +- New Features |
| 4 | +- Bug Fixes |
| 5 | +- Improvements |
| 6 | +- Performance |
| 7 | +- Known Issues |
| 8 | + |
| 9 | +## New Features |
| 10 | + |
| 11 | +The following sections describe new feature releases that are common across frameworks and that are framework specific. |
| 12 | + |
| 13 | +### Common across frameworks |
| 14 | + |
| 15 | +#### Custom slicing support (`smp_slice` method) for objects passed to `smp.step` decorated functions |
| 16 | + |
| 17 | +To pass an object to `smp.step` that contains tensors that needs to be split across |
| 18 | +microbatches and is not an instance of list, dict, tuple or set, you should implement `smp_slice` method for the object. |
| 19 | + |
| 20 | +Below is an example of how to use this with PyTorch: |
| 21 | + |
| 22 | +``` |
| 23 | + class CustomType: |
| 24 | + def __init__(self, tensor): |
| 25 | + self.data = tensor |
| 26 | +
|
| 27 | + # SMP will call this to invoke slicing on the object passing in total microbatches (num_mb) |
| 28 | + # and the current microbatch index (mb). |
| 29 | + def smp_slice(self, num_mb, mb, axis): |
| 30 | + dim_size = list(self.data.size())[axis] |
| 31 | +
|
| 32 | + split_size = dim_size // num_mb |
| 33 | + sliced_tensor = self.data.narrow(axis, mb * split_size, split_size) |
| 34 | + return CustomType(sliced_tensor, self.other) |
| 35 | +
|
| 36 | +custom_obj = CustomType(torch.ones(4,)) |
| 37 | +
|
| 38 | +@smp.step() |
| 39 | +def step(custom_obj): |
| 40 | + loss = model(custom_obj) |
| 41 | + model.backward(loss) |
| 42 | + return loss |
| 43 | +``` |
| 44 | + |
| 45 | +### PyTorch |
| 46 | + |
| 47 | +#### Add support for smp.DistributedModel.cpu() |
| 48 | + |
| 49 | +`smp.DistributedModel.cpu()` |
| 50 | +[allgather](https://sagemaker.readthedocs.io/en/stable/api/training/smd_model_parallel_common_api.html#smp.allgather)s |
| 51 | +parameters and buffers across all `mp_ranks` and moves them to the CPU. |
| 52 | + |
| 53 | +#### Add `trace_memory_usage` option to `smp.DistributedModel` to measure memory usage per module |
| 54 | + |
| 55 | +Adds `trace_memory_usage` option to `smp.DistributedModel`. This attempts to measure memory usage per module during |
| 56 | +tracing. If this is disabled, memory usage is estimated through the sizes of tensors returned from the module. |
| 57 | +This option is disabled by default. |
| 58 | + |
| 59 | +## Bug Fixes |
| 60 | + |
| 61 | +### PyTorch |
| 62 | + |
| 63 | +- `torch.nn.Sequential`: Fix a bug with `torch.nn.Sequential` which causes a failure with the error message : `shouldnt go less than 0, there is a bug` when the inputs to the first module don't require grads. |
| 64 | + |
| 65 | +- `smp.DistributedModel`: Fix a bug with `DistributedModel` execution when a module has multiple parents. The bug surfaces with the error message: `actual_parent should be different than module_execution_stack parent only for torch.nn.ModuleList` |
| 66 | + |
| 67 | +- `apex.optimizers.FusedNovoGrad`: Fix a bug with `apex.optimizers.FusedNovoGrad` which surfaces with the error message: `KeyError: 'exp_avg_sq'` |
| 68 | + |
| 69 | +## Improvements |
| 70 | + |
| 71 | +### Usability |
| 72 | + |
| 73 | +#### PyTorch |
| 74 | + |
| 75 | +- `smp.DistributedModel`: Improve the error message when the forward pass on `smp.DistributedModel` is called outside the `smp.step` decorated function. |
| 76 | + |
| 77 | +- `smp.load`: Add user friendly error messages when loading checkpoints with `smp.load`. |
| 78 | + |
| 79 | +### Partitioning Algorithm |
| 80 | + |
| 81 | +#### PyTorch |
| 82 | + |
| 83 | +- Better memory balancing by taking into account the existing modules already assigned to the parent, while partitioning the children of a given module. |
| 84 | + |
| 85 | +## Performance |
| 86 | + |
| 87 | +### Tensorflow |
| 88 | + |
| 89 | +- Addresses long pre-processing times introduced by SMP XLA optimizer when dealing with large graphs and large number of microbatches. BERT (large) preprocessing time goes down from 40 minutes to 6 minutes on p3.16xlarge. |
| 90 | + |
| 91 | +## Known Issues |
| 92 | + |
| 93 | +### PyTorch |
| 94 | + |
| 95 | +- Serialization for Torch in SMP overwrites instances of dict subclass to be dict itself, instead of the instances of subclass. One of the use cases which fails because of this issue is when a user implements a subclass of OrderedDict with the `__getitem__` method. After serialization/deserialization in SMP, indexing on the object will lead to errors. A workaround is to use the dict keys to access the underlying item. |
0 commit comments