Skip to content

Commit 2a4beb6

Browse files
Merge branch 'master' into xgb-1.7-1_launch
2 parents 4cfebe4 + ca1e535 commit 2a4beb6

File tree

4 files changed

+60
-11
lines changed

4 files changed

+60
-11
lines changed

CHANGELOG.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,22 @@
11
# Changelog
22

3+
## v2.136.0 (2023-03-09)
4+
5+
### Features
6+
7+
* with_feature_group [feature_store]
8+
* Djl Large Model Support
9+
* Decouple model.right_size() from model registry
10+
11+
### Bug Fixes and Other Changes
12+
13+
* Fix integration test error in test_default_right_size_and_deploy_unregistered_base_model
14+
* Add djl 0.21.0 dlc images
15+
16+
### Documentation Changes
17+
18+
* Torchrun gpu support documentation change
19+
320
## v2.135.1.post0 (2023-03-02)
421

522
### Documentation Changes

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.135.2.dev0
1+
2.136.1.dev0

doc/frameworks/pytorch/using_pytorch.rst

Lines changed: 36 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,7 @@ fit Optional Arguments
196196
- ``logs``: Defaults to True, whether to show logs produced by training
197197
job in the Python session. Only meaningful when wait is True.
198198

199+
----
199200

200201
Distributed PyTorch Training
201202
============================
@@ -262,16 +263,18 @@ during the PyTorch DDP initialization.
262263

263264
.. note::
264265

265-
The SageMaker PyTorch estimator operates ``mpirun`` in the backend.
266-
It doesn’t use ``torchrun`` for distributed training.
266+
The SageMaker PyTorch estimator can operate both ``mpirun`` (for PyTorch 1.12.0 and later)
267+
and ``torchrun`` (for PyTorch 1.13.1 and later) in the backend for distributed training.
267268

268269
For more information about setting up PyTorch DDP in your training script,
269270
see `Getting Started with Distributed Data Parallel
270271
<https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_ in the
271272
PyTorch documentation.
272273

273-
The following example shows how to run a PyTorch DDP training in SageMaker
274-
using two ``ml.p4d.24xlarge`` instances:
274+
The following examples show how to set a PyTorch estimator
275+
to run a distributed training job on two ``ml.p4d.24xlarge`` instances.
276+
277+
**Using PyTorch DDP with the mpirun backend**
275278

276279
.. code:: python
277280
@@ -291,7 +294,34 @@ using two ``ml.p4d.24xlarge`` instances:
291294
}
292295
)
293296
294-
pt_estimator.fit("s3://bucket/path/to/training/data")
297+
**Using PyTorch DDP with the torchrun backend**
298+
299+
.. code:: python
300+
301+
from sagemaker.pytorch import PyTorch
302+
303+
pt_estimator = PyTorch(
304+
entry_point="train_ptddp.py",
305+
role="SageMakerRole",
306+
framework_version="1.13.1",
307+
py_version="py38",
308+
instance_count=2,
309+
instance_type="ml.p4d.24xlarge",
310+
distribution={
311+
"torch_distributed": {
312+
"enabled": True
313+
}
314+
}
315+
)
316+
317+
318+
.. note::
319+
320+
For more information about setting up ``torchrun`` in your training script,
321+
see `torchrun (Elastic Launch) <https://pytorch.org/docs/stable/elastic/run.html>`_ in *the
322+
PyTorch documentation*.
323+
324+
----
295325

296326
.. _distributed-pytorch-training-on-trainium:
297327

@@ -324,7 +354,7 @@ with the ``torch_distributed`` option as the distribution strategy.
324354

325355
.. note::
326356

327-
SageMaker Debugger is currently not supported with Trn1 instances.
357+
SageMaker Debugger is not compatible with Trn1 instances.
328358

329359
Adapt Your Training Script to Initialize with the XLA backend
330360
-------------------------------------------------------------

src/sagemaker/pytorch/estimator.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,10 @@ def __init__(
171171
To learn more, see `Distributed PyTorch Training
172172
<https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training>`_.
173173
174-
**To enable Torch Distributed (for Trainium instances only):**
174+
**To enable Torch Distributed:**
175+
176+
This is available for general distributed training on
177+
GPU instances from PyTorch v1.13.1 and later.
175178
176179
.. code:: python
177180
@@ -181,6 +184,7 @@ def __init__(
181184
}
182185
}
183186
187+
This option also supports distributed training on Trn1.
184188
To learn more, see `Distributed PyTorch Training on Trainium
185189
<https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training-on-trainium>`_.
186190
@@ -210,9 +214,7 @@ def __init__(
210214
To learn more, see `Training with parameter servers
211215
<https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#training-with-parameter-servers>`_.
212216
213-
**To enable distributed training with
214-
`SageMaker Training Compiler <https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html>`_
215-
for PyTorch:**
217+
**To enable distributed training with SageMaker Training Compiler:**
216218
217219
.. code:: python
218220

0 commit comments

Comments
 (0)