Skip to content

Commit 973f72c

Browse files
mchoi8739navinsoni
authored andcommitted
update the doc with my suggestions
1 parent c50ef5b commit 973f72c

File tree

1 file changed

+31
-6
lines changed

1 file changed

+31
-6
lines changed

doc/frameworks/pytorch/using_pytorch.rst

Lines changed: 31 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@ fit Optional Arguments
196196
- ``logs``: Defaults to True, whether to show logs produced by training
197197
job in the Python session. Only meaningful when wait is True.
198198

199+
----
199200

200201
Distributed PyTorch Training
201202
============================
@@ -262,15 +263,19 @@ during the PyTorch DDP initialization.
262263

263264
.. note::
264265

265-
The SageMaker PyTorch estimator can operates both ``mpirun`` and ``torchrun`` in the backend for distributed training.
266+
The SageMaker PyTorch estimator can operate both ``mpirun`` (for PyTorch 1.12.0 and later)
267+
and ``torchrun``
268+
(for PyTorch 1.13.1 and later) in the backend for distributed training.
266269

267270
For more information about setting up PyTorch DDP in your training script,
268271
see `Getting Started with Distributed Data Parallel
269272
<https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ in the
270273
PyTorch documentation.
271274

272-
The following example shows how to run a PyTorch DDP training in SageMaker
273-
using two ``ml.p4d.24xlarge`` instances:
275+
The following examples show how to set a PyTorch estimator
276+
to run a distributed training job on two ``ml.p4d.24xlarge`` instances.
277+
278+
**Using PyTorch DDP with the ``mpirun`` backend**
274279

275280
.. code:: python
276281
@@ -290,7 +295,27 @@ using two ``ml.p4d.24xlarge`` instances:
290295
}
291296
)
292297
293-
pt_estimator.fit("s3://bucket/path/to/training/data")
298+
**Using PyTorch DDP with the ``torchrun`` backend**
299+
300+
.. code:: python
301+
302+
from sagemaker.pytorch import PyTorch
303+
304+
pt_estimator = PyTorch(
305+
entry_point="train_ptddp.py",
306+
role="SageMakerRole",
307+
framework_version="1.13.1",
308+
py_version="py38",
309+
instance_count=2,
310+
instance_type="ml.p4d.24xlarge",
311+
distribution={
312+
"torch_distributed": {
313+
"enabled": True
314+
}
315+
}
316+
)
317+
318+
----
294319

295320
.. _distributed-pytorch-training-on-trainium:
296321

@@ -316,14 +341,14 @@ with the ``torch_distributed`` option as the distribution strategy.
316341
.. note::
317342

318343
This ``torch_distributed`` support is available
319-
in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0 and other gpu instances starting v1.13.1.
344+
in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0.
320345
To find a complete list of supported versions of PyTorch Neuron, see
321346
`Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_
322347
in the *AWS Deep Learning Containers GitHub repository*.
323348

324349
.. note::
325350

326-
SageMaker Debugger is currently not supported with Trn1 instances.
351+
SageMaker Debugger is not compatible with Trn1 instances.
327352

328353
Adapt Your Training Script to Initialize with the XLA backend
329354
-------------------------------------------------------------

0 commit comments

Comments
 (0)