documentation: small fixes to sm dist. mp updates

Talia Chopra · Talia Chopra · commit 2c161b23b066 · 2021-02-11T10:55:53.000-08:00
diff --git a/doc/api/training/smd_model_parallel_general.rst b/doc/api/training/smd_model_parallel_general.rst
@@ -312,7 +312,7 @@ For the ``"mpi"`` key, a dict must be passed which contains:
 * ``"enabled"``: Set to ``True`` to launch the training job with MPI.
 
 * ``"processes_per_host"``: Specifies the number of processes MPI should launch on each host.
-  In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker Python SDK maintains
+  In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker distributed model parallel library maintains
   a one-to-one mapping between processes and GPUs across model and data parallelism.
   This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process.
   If you are using PyTorch, you must restrict each process to its own device using
@@ -321,15 +321,15 @@ For the ``"mpi"`` key, a dict must be passed which contains:
   <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt-16>`_.
 
   .. important::
-   ``process_per_host`` must be less than the number of GPUs per instance, and typically will be equal to
+   ``process_per_host`` must be less than or equal to the number of GPUs per instance, and typically will be equal to
    the number of GPUs per instance.
 
   For example, if you use one instance with 4-way model parallelism and 2-way data parallelism,
   then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs,
   such as an ml.p3.16xlarge.
 
   The following image illustrates how 2-way data parallelism and 4-way model parallelism is distributed across 8 GPUs:
-  the models is partitioned across 4 GPUs, and each partition is added to 2 GPUs.
+  the model is partitioned across 4 GPUs, and each partition is added to 2 GPUs.
 
   .. image:: smp_versions/model-data-parallel.png
       :width: 650
diff --git a/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst b/doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst
@@ -97,7 +97,7 @@ This API document assumes you use the following import statements in your traini
    Unlike the original DDP wrapper, when you use ``DistributedModel``,
    model parameters and buffers are not immediately broadcast across
    processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
-   ``smp.step-decorated`` function when the partition is done.
+   ``smp.step``-decorated function when the partition is done.
 
    **Parameters**