Add API enhancements for SMP (#2048)

anirudh2290 · aaronmarkham · web-flow · commit 2292b3096d9e · 2020-12-21T16:16:12.000-08:00
* Add API enhancements for SMP

* Update doc/api/training/smd_model_parallel_common_api.rst

Co-authored-by: Aaron Markham &lt;markhama@amazon.com&gt;

* Update doc/api/training/smd_model_parallel_pytorch.rst

Co-authored-by: Aaron Markham &lt;markhama@amazon.com&gt;

* Update doc/api/training/smd_model_parallel_common_api.rst

Co-authored-by: Aaron Markham &lt;markhama@amazon.com&gt;

Co-authored-by: Aaron Markham &lt;markhama@amazon.com&gt;
diff --git a/doc/api/training/smd_model_parallel_common_api.rst b/doc/api/training/smd_model_parallel_common_api.rst
@@ -57,6 +57,37 @@ The following APIs are common across all frameworks.
    versions of the tensor across different microbatches
    (see ``StepOutput`` entry for more information).
 
+   The argument to ``smp.step`` decorated function should either be a tensor
+   or an instance of list, tuple, dict or set for it to be split across
+   microbatches. If your object doesn't fall into this category, you can make
+   the library split your object, by implementing ``smp_slice`` method.
+
+   Below is an example of how to use it with PyTorch.
+
+   .. code:: python
+
+      class CustomType:
+          def __init__(self, tensor):
+              self.data = tensor
+
+          # The library will call this to invoke slicing on the object passing in total microbatches (num_mb)
+          # and the current microbatch index (mb).
+          def smp_slice(self, num_mb, mb, axis):
+              dim_size = list(self.data.size())[axis]
+
+              split_size = dim_size // num_mb
+              sliced_tensor = self.data.narrow(axis, mb * split_size, split_size)
+              return CustomType(sliced_tensor, self.other)
+
+      custom_obj = CustomType(torch.ones(4,))
+
+      @smp.step()
+      def step(custom_obj):
+          loss = model(custom_obj)
+          model.backward(loss)
+          return loss
+
+
    **Important:** ``smp.step`` splits the batch into microbatches, and
    executes everything inside the decorated function once per microbatch.
    This might affect the behavior of batch normalization, any operation
diff --git a/doc/api/training/smd_model_parallel_pytorch.rst b/doc/api/training/smd_model_parallel_pytorch.rst
@@ -128,6 +128,11 @@ This API document assumes you use the following import statements in your traini
       computation. \ ``bucket_cap_mb``\ controls the bucket size in MegaBytes
       (MB).
 
+    - ``trace_memory_usage`` (default: False): When set to True, the library attempts
+      to measure memory usage per module during tracing. If this is disabled,
+      memory usage will be estimated through the sizes of tensors returned from
+      the module.
+
    **Properties**
 
    -  ``partitioned``: Is ``True`` if the model is partitioned, ``False``
@@ -215,6 +220,11 @@ This API document assumes you use the following import statements in your traini
       first forward pass. Returns a ``RemovableHandle`` object ``handle``,
       which can be used to remove the hook by calling ``handle.remove()``.
 
+    .. function:: cpu( )
+
+      Allgathers parameters and buffers across all ``mp_rank``\ s and moves them
+      to the CPU.
+
 .. class:: smp.DistributedOptimizer
 
    **Parameters**