documentation: adding details about mpi options, other small updates

Talia Chopra · Talia Chopra · commit 9ef16422eb15 · 2021-02-11T10:10:49.000-08:00
diff --git a/doc/api/training/smd_model_parallel_general.rst b/doc/api/training/smd_model_parallel_general.rst
@@ -5,13 +5,13 @@
 
 .. _sm-sdk-modelparallel-params:
 
-SageMaker Python SDK ``modelparallel`` parameters
-=================================================
+Required SageMaker Python SDK parameters
+========================================
 
 The TensorFlow and PyTorch ``Estimator`` objects contains a ``distribution`` parameter,
 which is used to enable and specify parameters for the
 initialization of the SageMaker distributed model parallel library. The library internally uses MPI,
-so in order to use model parallelism, MPI must be enabled using the ``distribution`` parameter.
+so in order to use model parallelism, MPI must also be enabled using the ``distribution`` parameter.
 
 The following is an example of how you can launch a new PyTorch training job with the library.
 
@@ -55,6 +55,9 @@ The following is an example of how you can launch a new PyTorch training job wit
 
    smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
 
+``smdistributed`` Parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
 You can use the following parameters to initialize the library using the ``parameters``
 in the ``smdistributed`` of ``distribution``.
 
@@ -302,6 +305,41 @@ table are optional.
    |                   |                         |                 | SageMaker.                        |
    +-------------------+-------------------------+-----------------+-----------------------------------+
 
+``mpi`` Parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+For the ``"mpi"`` key, a dict must be passed which contains:
+
+* ``"enabled"``: Set to ``True`` to launch the training job with MPI.
+
+* ``"processes_per_host"``: Specifies the number of processes MPI should launch on each host.
+  In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker Python SDK maintains
+  a one-to-one mapping between processes and GPUs across model and data parallelism.
+  This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process.
+  If you are using PyTorch, you must restrict each process to its own device using
+  ``torch.cuda.set_device(smp.local_rank())``. To learn more, see
+  `Modify a PyTorch Training Script
+  <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt-16>`_.
+
+  .. important::
+   ``process_per_host`` must be less than the number of GPUs per instance, and typically will be equal to
+   the number of GPUs per instance.
+
+  For example, if you use one instance with 4-way model parallelism and 2-way data parallelism,
+  then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs,
+  such as an ml.p3.16xlarge.
+
+  The following image illustrates how 2-way data parallelism and 4-way model parallelism is distributed across 8 GPUs:
+  the models is partitioned across 4 GPUs, and each partition is added to 2 GPUs.
+
+  .. image:: smp_versions/model-data-parallel.png
+      :width: 650
+      :alt: 2-way data parallelism and 4-way model parallelism distributed across 8 GPUs
+
+
+* ``"custom_mpi_options"``: Use this key to pass any custom MPI options you might need.
+  To avoid Docker warnings from contaminating your training logs, we recommend the following flag.
+  ```--mca btl_vader_single_copy_mechanism none```
+
 
 .. _ranking-basics:
 
diff --git a/doc/api/training/smp_versions/model-data-parallel.png b/doc/api/training/smp_versions/model-data-parallel.png
diff --git a/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst b/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst
@@ -118,6 +118,9 @@ The following SageMaker distribute model parallel APIs are common across all fra
    -  https://www.tensorflow.org/api_docs/python/tf/function\
    -  https://www.tensorflow.org/guide/function\
 
+   Each ``smp.step`` decorated function must have a return value that depends on the
+   output of ``smp.DistributedModel``.
+
    **Common parameters**
 
    -  ``non_split_inputs`` (``list``): The list of arguments to the decorated function
diff --git a/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst
@@ -31,7 +31,6 @@ This API document assumes you use the following import statements in your traini
    model in the training script can be wrapped with
    ``smp.DistributedModel``.
 
-
    **Example:**
 
    .. code:: python
@@ -89,6 +88,17 @@ This API document assumes you use the following import statements in your traini
    the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
    a ``smp.step``-decorated function.
 
+   **Using DDP**
+
+   If DDP is enabled, do not not place a PyTorch
+   ``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
+   the ``DistributedModel`` wrapper will also handle data parallelism.
+
+   Unlike the original DDP wrapper, when you use ``DistributedModel``,
+   model parameters and buffers are not immediately broadcast across
+   processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
+   ``smp.step-decorated`` function when the partition is done.
+
    **Parameters**
 
    -  ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
@@ -248,11 +258,14 @@ This API document assumes you use the following import statements in your traini
    .. function:: join( )
 
       **Available for PyTorch 1.7 only**
+
       A context manager to be used in conjunction with an instance of
-      ``smp.DistributedModel``to be able to train with uneven inputs across
+      ``smp.DistributedModel`` to be able to train with uneven inputs across
       participating processes. This is only supported when ``ddp=True`` for
       ``smp.DistributedModel``. This will use the join with the wrapped
-      ``DistributedDataParallel`` instance. Please see: `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__.
+      ``DistributedDataParallel`` instance. For more information, see:
+      `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
+      in the PyTorch documentation.
 
 
 .. class:: smp.DistributedOptimizer