Skip to content

documentation: doc fix #3435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 25, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 0 additions & 101 deletions doc/frameworks/pytorch/using_pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -293,107 +293,6 @@ using two ``ml.p4d.24xlarge`` instances:

pt_estimator.fit("s3://bucket/path/to/training/data")

.. _distributed-pytorch-training-on-trainium:

Distributed PyTorch Training on Trainium
========================================

SageMaker Training on Trainium instances now supports the ``xla``
package through ``torchrun``. With this, you do not need to manually pass RANK,
WORLD_SIZE, MASTER_ADDR, and MASTER_PORT. You can launch the training job using the
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class
with the ``torch_distributed`` option as the distribution strategy.

.. note::

This ``torch_distributed`` support is available
in the SageMaker Trainium (trn1) PyTorch Deep Learning Containers starting v1.11.0.
To find a complete list of supported versions of PyTorch Neuron, see `Neuron Containers <https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers>`_ in the *AWS Deep Learning Containers GitHub repository*.

SageMaker Debugger and Profiler are currently not supported with Trainium instances.

Adapt Your Training Script to Initialize with the XLA backend
-------------------------------------------------------------

To initialize distributed training in your script, call
`torch.distributed.init_process_group
<https://pytorch.org/docs/master/distributed.html#torch.distributed.init_process_group>`_
with the ``xla`` backend as shown below.

.. code:: python

import torch.distributed as dist

dist.init_process_group('xla')

SageMaker takes care of ``'MASTER_ADDR'`` and ``'MASTER_PORT'`` for you via ``torchrun``

For detailed documentation about modifying your training script for Trainium, see `Multi-worker data-parallel MLP training using torchrun <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html?highlight=torchrun#multi-worker-data-parallel-mlp-training-using-torchrun>`_ in the *AWS Neuron Documentation*.

**Currently Supported backends:**

- ``xla`` for Trainium (Trn1) instances

For up-to-date information on supported backends for Trainium instances, see `AWS Neuron Documentation <https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html>`_.

Launching a Distributed Training Job on Trainium
------------------------------------------------

You can run multi-node distributed PyTorch training jobs on Trainium instances using the
:class:`sagemaker.pytorch.estimator.PyTorch` estimator class.
With ``instance_count=1``, the estimator submits a
single-node training job to SageMaker; with ``instance_count`` greater
than one, a multi-node training job is launched.

With the ``torch_distributed`` option, the SageMaker PyTorch estimator runs a SageMaker
training container for PyTorch Neuron, sets up the environment, and launches
the training job using the ``torchrun`` command on each worker with the given information.

**Examples**

The following examples show how to run a PyTorch training using ``torch_distributed`` in SageMaker
on one ``ml.trn1.2xlarge`` instance and two ``ml.trn1.32xlarge`` instances:

.. code:: python

from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
entry_point="train_ptddp.py",
role="SageMakerRole",
framework_version="1.11.0",
py_version="py38",
instance_count=1,
instance_type="ml.trn1.2xlarge",
distribution={
"torch_distributed": {
"enabled": True
}
}
)

pt_estimator.fit("s3://bucket/path/to/training/data")

.. code:: python

from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
entry_point="train_ptddp.py",
role="SageMakerRole",
framework_version="1.11.0",
py_version="py38",
instance_count=2,
instance_type="ml.trn1.32xlarge",
distribution={
"torch_distributed": {
"enabled": True
}
}
)

pt_estimator.fit("s3://bucket/path/to/training/data")

*********************
Deploy PyTorch Models
*********************
Expand Down