Merge branch 'master' into xgb-1.7-1_launch

NikhilRaverkar · web-flow · commit 2a4beb6c3d28 · 2023-03-09T21:03:05.000+05:30
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,22 @@
 # Changelog
 
+## v2.136.0 (2023-03-09)
+
+### Features
+
+ * with_feature_group [feature_store]
+ * Djl Large Model Support
+ * Decouple model.right_size() from model registry
+
+### Bug Fixes and Other Changes
+
+ * Fix integration test error in test_default_right_size_and_deploy_unregistered_base_model
+ * Add djl 0.21.0 dlc images
+
+### Documentation Changes
+
+ * Torchrun gpu support documentation change
+
 ## v2.135.1.post0 (2023-03-02)
 
 ### Documentation Changes
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-2.135.2.dev0
+2.136.1.dev0
diff --git a/doc/frameworks/pytorch/using_pytorch.rst b/doc/frameworks/pytorch/using_pytorch.rst
@@ -196,6 +196,7 @@ fit Optional Arguments
 -  ``logs``: Defaults to True, whether to show logs produced by training
    job in the Python session. Only meaningful when wait is True.
 
+----
 
 Distributed PyTorch Training
 ============================
@@ -262,16 +263,18 @@ during the PyTorch DDP initialization.
 
 .. note::
 
-  The SageMaker PyTorch estimator operates ``mpirun`` in the backend.
-  It doesn’t use ``torchrun`` for distributed training.
+  The SageMaker PyTorch estimator can operate both ``mpirun`` (for PyTorch 1.12.0 and later)
+  and ``torchrun`` (for PyTorch 1.13.1 and later) in the backend for distributed training.
 
 For more information about setting up PyTorch DDP in your training script,
 see `Getting Started with Distributed Data Parallel
 <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ in the
 PyTorch documentation.
 
-The following example shows how to run a PyTorch DDP training in SageMaker
-using two ``ml.p4d.24xlarge`` instances:
+The following examples show how to set a PyTorch estimator
+to run a distributed training job on two ``ml.p4d.24xlarge`` instances.
+
+**Using PyTorch DDP with the mpirun backend**
 
 .. code:: python
 
@@ -291,7 +294,34 @@ using two ``ml.p4d.24xlarge`` instances:
         }
     )
 
-    pt_estimator.fit("s3://bucket/path/to/training/data")
+**Using PyTorch DDP with the torchrun backend**
+
+.. code:: python
+
+    from sagemaker.pytorch import PyTorch
+
+    pt_estimator = PyTorch(
+        entry_point="train_ptddp.py",
+        role="SageMakerRole",
+        framework_version="1.13.1",
+        py_version="py38",
+        instance_count=2,
+        instance_type="ml.p4d.24xlarge",
+        distribution={
+            "torch_distributed": {
+                "enabled": True
+            }
+        }
+    )
+
+
+.. note::
+
+    For more information about setting up ``torchrun`` in your training script,
+    see `torchrun (Elastic Launch) <https://pytorch.org/docs/stable/elastic/run.html>`_ in *the
+    PyTorch documentation*.
+
+----
 
 .. _distributed-pytorch-training-on-trainium:
 
@@ -324,7 +354,7 @@ with the ``torch_distributed`` option as the distribution strategy.
 
 .. note::
 
-  SageMaker Debugger is currently not supported with Trn1 instances.
+  SageMaker Debugger is not compatible with Trn1 instances.
 
 Adapt Your Training Script to Initialize with the XLA backend
 -------------------------------------------------------------
diff --git a/src/sagemaker/pytorch/estimator.py b/src/sagemaker/pytorch/estimator.py
@@ -171,7 +171,10 @@ def __init__(
                     To learn more, see `Distributed PyTorch Training
                     <https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training>`_.
 
-                **To enable Torch Distributed (for Trainium instances only):**
+                **To enable Torch Distributed:**
+
+                    This is available for general distributed training on
+                    GPU instances from PyTorch v1.13.1 and later.
 
                     .. code:: python
 
@@ -181,6 +184,7 @@ def __init__(
                             }
                         }
 
+                    This option also supports distributed training on Trn1.
                     To learn more, see `Distributed PyTorch Training on Trainium
                     <https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training-on-trainium>`_.
 
@@ -210,9 +214,7 @@ def __init__(
                     To learn more, see `Training with parameter servers
                     <https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#training-with-parameter-servers>`_.
 
-                **To enable distributed training with
-                `SageMaker Training Compiler <https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html>`_
-                for PyTorch:**
+                **To enable distributed training with SageMaker Training Compiler:**
 
                     .. code:: python