Skip to content

Commit 2c161b2

Browse files
author
Talia Chopra
committed
documentation: small fixes to sm dist. mp updates
1 parent 9ef1642 commit 2c161b2

File tree

2 files changed

+4
-4
lines changed

2 files changed

+4
-4
lines changed

doc/api/training/smd_model_parallel_general.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ For the ``"mpi"`` key, a dict must be passed which contains:
312312
* ``"enabled"``: Set to ``True`` to launch the training job with MPI.
313313

314314
* ``"processes_per_host"``: Specifies the number of processes MPI should launch on each host.
315-
In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker Python SDK maintains
315+
In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker distributed model parallel library maintains
316316
a one-to-one mapping between processes and GPUs across model and data parallelism.
317317
This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process.
318318
If you are using PyTorch, you must restrict each process to its own device using
@@ -321,15 +321,15 @@ For the ``"mpi"`` key, a dict must be passed which contains:
321321
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt-16>`_.
322322

323323
.. important::
324-
``process_per_host`` must be less than the number of GPUs per instance, and typically will be equal to
324+
``process_per_host`` must be less than or equal to the number of GPUs per instance, and typically will be equal to
325325
the number of GPUs per instance.
326326

327327
For example, if you use one instance with 4-way model parallelism and 2-way data parallelism,
328328
then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs,
329329
such as an ml.p3.16xlarge.
330330

331331
The following image illustrates how 2-way data parallelism and 4-way model parallelism is distributed across 8 GPUs:
332-
the models is partitioned across 4 GPUs, and each partition is added to 2 GPUs.
332+
the model is partitioned across 4 GPUs, and each partition is added to 2 GPUs.
333333

334334
.. image:: smp_versions/model-data-parallel.png
335335
:width: 650

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ This API document assumes you use the following import statements in your traini
9797
Unlike the original DDP wrapper, when you use ``DistributedModel``,
9898
model parameters and buffers are not immediately broadcast across
9999
processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
100-
``smp.step-decorated`` function when the partition is done.
100+
``smp.step``-decorated function when the partition is done.
101101

102102
**Parameters**
103103

0 commit comments

Comments
 (0)