documentation: Torchrun gpu support documentation change (#3698)

yl-to · mchoi8739 · web-flow · commit ad4a92a605da · 2023-03-08T15:47:14.000-08:00
Co-authored-by: Miyoung &lt;myoung8739@gmail.com&gt;
Co-authored-by: Miyoung Choi &lt;myoung8739@gmail.com&gt;
diff --git a/doc/frameworks/pytorch/using_pytorch.rst b/doc/frameworks/pytorch/using_pytorch.rst
@@ -196,6 +196,7 @@ fit Optional Arguments
 -  ``logs``: Defaults to True, whether to show logs produced by training
    job in the Python session. Only meaningful when wait is True.
 
+----
 
 Distributed PyTorch Training
 ============================
@@ -262,16 +263,18 @@ during the PyTorch DDP initialization.
 
 .. note::
 
-  The SageMaker PyTorch estimator operates ``mpirun`` in the backend.
-  It doesn’t use ``torchrun`` for distributed training.
+  The SageMaker PyTorch estimator can operate both ``mpirun`` (for PyTorch 1.12.0 and later)
+  and ``torchrun`` (for PyTorch 1.13.1 and later) in the backend for distributed training.
 
 For more information about setting up PyTorch DDP in your training script,
 see `Getting Started with Distributed Data Parallel
 <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ in the
 PyTorch documentation.
 
-The following example shows how to run a PyTorch DDP training in SageMaker
-using two ``ml.p4d.24xlarge`` instances:
+The following examples show how to set a PyTorch estimator
+to run a distributed training job on two ``ml.p4d.24xlarge`` instances.
+
+**Using PyTorch DDP with the mpirun backend**
 
 .. code:: python
 
@@ -291,7 +294,34 @@ using two ``ml.p4d.24xlarge`` instances:
         }
     )
 
-    pt_estimator.fit("s3://bucket/path/to/training/data")
+**Using PyTorch DDP with the torchrun backend**
+
+.. code:: python
+
+    from sagemaker.pytorch import PyTorch
+
+    pt_estimator = PyTorch(
+        entry_point="train_ptddp.py",
+        role="SageMakerRole",
+        framework_version="1.13.1",
+        py_version="py38",
+        instance_count=2,
+        instance_type="ml.p4d.24xlarge",
+        distribution={
+            "torch_distributed": {
+                "enabled": True
+            }
+        }
+    )
+
+
+.. note::
+
+    For more information about setting up ``torchrun`` in your training script,
+    see `torchrun (Elastic Launch) <https://pytorch.org/docs/stable/elastic/run.html>`_ in *the
+    PyTorch documentation*.
+
+----
 
 .. _distributed-pytorch-training-on-trainium:
 
@@ -324,7 +354,7 @@ with the ``torch_distributed`` option as the distribution strategy.
 
 .. note::
 
-  SageMaker Debugger is currently not supported with Trn1 instances.
+  SageMaker Debugger is not compatible with Trn1 instances.
 
 Adapt Your Training Script to Initialize with the XLA backend
 -------------------------------------------------------------
diff --git a/src/sagemaker/pytorch/estimator.py b/src/sagemaker/pytorch/estimator.py
@@ -171,7 +171,10 @@ def __init__(
                     To learn more, see `Distributed PyTorch Training
                     <https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training>`_.
 
-                **To enable Torch Distributed (for Trainium instances only):**
+                **To enable Torch Distributed:**
+
+                    This is available for general distributed training on
+                    GPU instances from PyTorch v1.13.1 and later.
 
                     .. code:: python
 
@@ -181,6 +184,7 @@ def __init__(
                             }
                         }
 
+                    This option also supports distributed training on Trn1.
                     To learn more, see `Distributed PyTorch Training on Trainium
                     <https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training-on-trainium>`_.
 
@@ -210,9 +214,7 @@ def __init__(
                     To learn more, see `Training with parameter servers
                     <https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#training-with-parameter-servers>`_.
 
-                **To enable distributed training with
-                `SageMaker Training Compiler <https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html>`_
-                for PyTorch:**
+                **To enable distributed training with SageMaker Training Compiler:**
 
                     .. code:: python