Skip to content

Commit 2c618a5

Browse files
authored
Merge pull request #1 from mchoi8739/doc-torchrun-update
update the doc with my suggestions
2 parents a1a93d7 + 57a4dfb commit 2c618a5

File tree

2 files changed

+42
-10
lines changed

2 files changed

+42
-10
lines changed

doc/frameworks/pytorch/using_pytorch.rst

Lines changed: 38 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@ fit Optional Arguments
196196
- ``logs``: Defaults to True, whether to show logs produced by training
197197
job in the Python session. Only meaningful when wait is True.
198198

199+
----
199200

200201
Distributed PyTorch Training
201202
============================
@@ -262,15 +263,19 @@ during the PyTorch DDP initialization.
262263

263264
.. note::
264265

265-
The SageMaker PyTorch estimator can operates both ``mpirun`` and ``torchrun`` in the backend for distributed training.
266+
The SageMaker PyTorch estimator can operate both ``mpirun`` (for PyTorch 1.12.0 and later)
267+
and ``torchrun``
268+
(for PyTorch 1.13.1 and later) in the backend for distributed training.
266269

267270
For more information about setting up PyTorch DDP in your training script,
268271
see `Getting Started with Distributed Data Parallel
269272
<https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ in the
270273
PyTorch documentation.
271274

272-
The following example shows how to run a PyTorch DDP training in SageMaker
273-
using two ``ml.p4d.24xlarge`` instances:
275+
The following examples show how to set a PyTorch estimator
276+
to run a distributed training job on two ``ml.p4d.24xlarge`` instances.
277+
278+
**Using PyTorch DDP with the mpirun backend**
274279

275280
.. code:: python
276281
@@ -290,7 +295,34 @@ using two ``ml.p4d.24xlarge`` instances:
290295
}
291296
)
292297
293-
pt_estimator.fit("s3://bucket/path/to/training/data")
298+
**Using PyTorch DDP with the torchrun backend**
299+
300+
.. code:: python
301+
302+
from sagemaker.pytorch import PyTorch
303+
304+
pt_estimator = PyTorch(
305+
entry_point="train_ptddp.py",
306+
role="SageMakerRole",
307+
framework_version="1.13.1",
308+
py_version="py38",
309+
instance_count=2,
310+
instance_type="ml.p4d.24xlarge",
311+
distribution={
312+
"torch_distributed": {
313+
"enabled": True
314+
}
315+
}
316+
)
317+
318+
319+
.. note::
320+
321+
For more information about setting up ``torchrun`` in your training script,
322+
see `torchrun (Elastic Launch) <https://pytorch.org/docs/stable/elastic/run.html>_` in *the
323+
PyTorch documentation*.
324+
325+
----
294326

295327
.. _distributed-pytorch-training-on-trainium:
296328

@@ -316,14 +348,14 @@ with the ``torch_distributed`` option as the distribution strategy.
316348
.. note::
317349

318350
This ``torch_distributed`` support is available
319-
in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0 and other gpu instances starting v1.13.1.
351+
in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0.
320352
To find a complete list of supported versions of PyTorch Neuron, see
321353
`Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_
322354
in the *AWS Deep Learning Containers GitHub repository*.
323355

324356
.. note::
325357

326-
SageMaker Debugger is currently not supported with Trn1 instances.
358+
SageMaker Debugger is not compatible with Trn1 instances.
327359

328360
Adapt Your Training Script to Initialize with the XLA backend
329361
-------------------------------------------------------------

src/sagemaker/pytorch/estimator.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,8 @@ def __init__(
171171
To learn more, see `Distributed PyTorch Training
172172
<https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training>`_.
173173
174-
**To enable Torch Distributed (for Trainium instances only):**
174+
**To enable Torch Distributed:**
175+
This is available for general distributed training on GPU instances from PyTorch v1.13.1 and later.
175176
176177
.. code:: python
177178
@@ -181,6 +182,7 @@ def __init__(
181182
}
182183
}
183184
185+
This option also supports distributed training on Trn1.
184186
To learn more, see `Distributed PyTorch Training on Trainium
185187
<https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training-on-trainium>`_.
186188
@@ -210,9 +212,7 @@ def __init__(
210212
To learn more, see `Training with parameter servers
211213
<https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#training-with-parameter-servers>`_.
212214
213-
**To enable distributed training with
214-
`SageMaker Training Compiler <https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html>`_
215-
for PyTorch:**
215+
**To enable distributed training with SageMaker Training Compiler:**
216216
217217
.. code:: python
218218

0 commit comments

Comments
 (0)