Skip to content

Commit 9a5e8bc

Browse files
committed
update distributed training doc
1 parent 5159c85 commit 9a5e8bc

File tree

9 files changed

+208
-203
lines changed

9 files changed

+208
-203
lines changed

doc/api/training/distributed.rst

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,20 @@ SageMaker distributed training libraries offer both data parallel and model para
44
They combine software and hardware technologies to improve inter-GPU and inter-node communications.
55
They extend SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.
66

7+
.. _sdp_api_docs_toc:
8+
79
The SageMaker Distributed Data Parallel Library
810
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
911

1012
.. toctree::
11-
:maxdepth: 3
13+
:maxdepth: 3
1214

13-
smd_data_parallel
15+
smd_data_parallel
16+
sdp_versions/latest
17+
smd_data_parallel_use_sm_pysdk
18+
smd_data_parallel_release_notes/smd_data_parallel_change_log
1419

20+
.. _smp_api_docs_toc:
1521

1622
The SageMaker Distributed Model Parallel Library
1723
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

doc/api/training/sdp_versions/latest.rst

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
.. _sdp_api_docs:
22

3-
###############################################
4-
Use the Library's API to Adapt Training Scripts
5-
###############################################
3+
#############################################
4+
Use the Library to Adapt Your Training Script
5+
#############################################
66

77
This section contains the SageMaker distributed data parallel API documentation.
88
If you are a new user of this library, it is recommended you use this guide alongside
@@ -15,11 +15,13 @@ Select the latest or one of the previous versions of the API documentation
1515
depending on the version of the library you use.
1616

1717
.. important::
18+
1819
The distributed data parallel library supports training jobs using CUDA 11 or later.
19-
When you define a SageMaker ``PyTorch`` or ``TensorFlow``
20-
estimator with ``dataparallel`` parameter ``enabled`` set to ``True``,
21-
it uses CUDA 11. When you extend or customize your own training image,
22-
you must use a CUDA 11 base image. See
20+
When you define a :class:`sagemaker.tensorflow.estimator.TensorFlow` or
21+
:class:`sagemaker.pytorch.estimator.PyTorch`
22+
estimator with the data parallel library enabled,
23+
SageMaker uses CUDA 11. When you extend or customize your own training image,
24+
you must use a base image with CUDA 11 or later. See
2325
`SageMaker Python SDK's distributed data parallel library APIs
2426
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`_
2527
for more information.

doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst

Lines changed: 20 additions & 134 deletions
Original file line numberDiff line numberDiff line change
@@ -11,154 +11,40 @@ data parallel library API for PyTorch.
1111

1212
.. _pytorch-sdp-modify:
1313

14-
Modify a PyTorch training script to use the SageMaker data parallel library
15-
===========================================================================
14+
Use the SageMaker Distributed Data Parallel Library as a Backend of ``torch.distributed``
15+
===========================================================================================
1616

17-
The following steps show you how to convert a PyTorch training script to
18-
utilize SageMaker's distributed data parallel library.
19-
20-
The distributed data parallel library works as a backend of the PyTorch distributed package.
21-
See `SageMaker distributed data parallel PyTorch examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#pytorch-distributed>`__ 
22-
for additional details on how to use the library.
23-
24-
1. Import the SageMaker distributed data parallel library’s PyTorch client.
25-
26-
.. code:: python
27-
28-
import smdistributed.dataparallel.torch.torch_smddp
29-
30-
2. Import the PyTorch distributed modules.
31-
32-
.. code:: python
33-
34-
import torch
35-
import torch.distributed as dist
36-
from torch.nn.parallel import DistributedDataParallel as DDP
37-
38-
3. Set the backend of ``torch.distributed`` as ``smddp``.
39-
40-
.. code:: python
41-
42-
dist.init_process_group(backend='smddp')
43-
44-
4. After parsing arguments and defining a batch size parameter
45-
(for example, ``batch_size=args.batch_size``), add a two-line of code to
46-
resize the batch size per worker (GPU). PyTorch's DataLoader operation
47-
does not automatically handle the batch resizing for distributed training.
48-
49-
.. code:: python
50-
51-
batch_size //= dist.get_world_size()
52-
batch_size = max(batch_size, 1)
53-
54-
5. Pin each GPU to a single SageMaker data parallel library process with
55-
``local_rank``. This refers to the relative rank of the process within a given node.
56-
57-
You can retrieve the rank of the process from the ``LOCAL_RANK`` environment variable.
58-
59-
.. code:: python
60-
61-
import os
62-
local_rank = os.environ["LOCAL_RANK"]
63-
torch.cuda.set_device(local_rank)
64-
65-
6. After defining a model, wrap it with the PyTorch DDP.
66-
67-
.. code:: python
68-
69-
model = ...
70-
71-
# Wrap the model with the PyTorch DistributedDataParallel API
72-
model = DDP(model)
73-
74-
7. When you call the ``torch.utils.data.distributed.DistributedSampler`` API,
75-
specify the total number of processes (GPUs) participating in training across
76-
all the nodes in the cluster. This is called ``world_size``, and you can retrieve
77-
the number from the ``torch.distributed.get_world_size()`` API. Also, specify
78-
the rank of each process among all processes using the ``torch.distributed.get_rank()`` API.
79-
80-
.. code:: python
81-
82-
train_sampler = DistributedSampler(
83-
train_dataset,
84-
num_replicas = dist.get_world_size(),
85-
rank = dist.get_rank()
86-
)
87-
88-
8. Modify your script to save checkpoints only on the leader process (rank 0).
89-
The leader process has a synchronized model. This also avoids other processes
90-
overwriting the checkpoints and possibly corrupting the checkpoints.
91-
92-
The following example code shows the structure of a PyTorch training script with DDP and smddp as the backend.
17+
To use the SageMaker distributed data parallel library,
18+
the only thing you need to do is to import the SageMaker distributed data
19+
parallel library’s PyTorch client (``smdistributed.dataparallel.torch.torch_smddp``).
20+
The client registers ``smddp`` as a backend for PyTorch.
21+
When you initialize the PyTorch distributed process group using
22+
the ``torch.distributed.init_process_group`` API,
23+
make sure you specify ``'smddp'`` to the backend argument.
9324

9425
.. code:: python
9526
96-
import os
97-
import torch
98-
99-
# SageMaker data parallel: Import the library PyTorch API
10027
import smdistributed.dataparallel.torch.torch_smddp
101-
102-
# SageMaker data parallel: Import PyTorch's distributed API
10328
import torch.distributed as dist
104-
from torch.nn.parallel import DistributedDataParallel as DDP
10529
106-
# SageMaker data parallel: Initialize the process group
10730
dist.init_process_group(backend='smddp')
10831
109-
class Net(nn.Module):
110-
...
111-
# Define model
112-
113-
def train(...):
114-
...
115-
# Model training
116-
117-
def test(...):
118-
...
119-
# Model evaluation
120-
121-
def main():
122-
123-
# SageMaker data parallel: Scale batch size by world size
124-
batch_size //= dist.get_world_size()
125-
batch_size = max(batch_size, 1)
126-
127-
# Prepare dataset
128-
train_dataset = torchvision.datasets.MNIST(...)
129-
130-
# SageMaker data parallel: Set num_replicas and rank in DistributedSampler
131-
train_sampler = torch.utils.data.distributed.DistributedSampler(
132-
train_dataset,
133-
num_replicas=dist.get_world_size(),
134-
rank=dist.get_rank())
135-
136-
train_loader = torch.utils.data.DataLoader(..)
137-
138-
# SageMaker data parallel: Wrap the PyTorch model with the library's DDP
139-
model = DDP(Net().to(device))
14032
141-
# SageMaker data parallel: Pin each GPU to a single library process.
142-
local_rank = os.environ["LOCAL_RANK"]
143-
torch.cuda.set_device(local_rank)
144-
model.cuda(local_rank)
33+
If you already have a working PyTorch script and only need to add the
34+
backend specification, you can proceed to Using the SageMaker PyTorch Estimator
35+
in the Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker Python SDK topic.
14536

146-
# Train
147-
optimizer = optim.Adadelta(...)
148-
scheduler = StepLR(...)
149-
for epoch in range(1, args.epochs + 1):
150-
train(...)
151-
if rank == 0:
152-
test(...)
153-
scheduler.step()
37+
.. note::
15438

155-
# SageMaker data parallel: Save model on the leader node (rank 0).
156-
if dist.get_rank() == 0:
157-
torch.save(...)
39+
The ``smddp`` backend currently does not support creating subprocess groups
40+
with the ``torch.distributed.new_group()`` API.
41+
You cannot use the ``smddp`` backend concurrently with other backends.
15842

159-
if __name__ == '__main__':
160-
main()
43+
.. seealso::
16144

45+
If you still need to modify your training script to properly use
46+
the PyTorch distributed package, see `Preparing a PyTorch Training Script for Distributed Training <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html>`_
47+
in the *Amazon SageMaker Developer Guide*.
16248

16349
.. _pytorch-sdp-api:
16450

doc/api/training/smd_data_parallel.rst

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,3 @@ To learn more about the core features of this library, see
2727
`Introduction to SageMaker's Distributed Data Parallel Library
2828
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html>`_
2929
in the SageMaker Developer Guide.
30-
31-
.. toctree::
32-
:maxdepth: 3
33-
34-
sdp_versions/latest
35-
smd_data_parallel_use_sm_pysdk
36-
smd_data_parallel_release_notes/smd_data_parallel_change_log

doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst

Lines changed: 39 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,55 @@ distributed data parallel library.
1010
SageMaker Distributed Data Parallel 1.4.0 Release Notes
1111
=======================================================
1212

13-
*Date: Feb. 18. 2022*
13+
*Date: Feb. 24. 2022*
1414

1515
**New Features**
1616

1717
* Integrated to PyTorch DDP as a backend option
1818
* Added support for PyTorch 1.10.2
1919

20-
**Bug Fixes**
20+
**Breaking Changes**
21+
22+
* As the library is migrated into the PyTorch distributed package as a backend,
23+
the following smdistributed implementation APIs are deprecated in
24+
the SageMaker data parallal library v1.4.0 and later.
25+
Please use the `PyTorch distributed APIs <https://pytorch.org/docs/stable/distributed.html>`_ instead.
26+
27+
* ``smdistributed.dataparallel.torch.distributed``
28+
* ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``
29+
* Please note the slight differences between the deprecated
30+
``smdistributed.dataparallel.torch`` APIs and the
31+
`PyTorch distributed APIs <https://pytorch.org/docs/stable/distributed.html>`_.
2132

22-
*
33+
* `torch.distributed.barrier <https://pytorch.org/docs/master/distributed.html#torch.distributed.barrier)>`_
34+
takes ``device_ids``, which the ``smddp`` backend does not support.
35+
* The ``gradient_accumulation_steps`` option in
36+
``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``
37+
is no longer supported. Please use the PyTorch
38+
`no_sync <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=no_sync#torch.nn.parallel.DistributedDataParallel.no_sync>`_ API.
39+
40+
41+
* If you want to find documentation for the previous versions of the library
42+
(v1.3.0 or before), see the `archived SageMaker distributed data parallel library documentation <https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest.html#documentation-archive>`_.
2343

2444
**Improvements**
2545

26-
*
46+
* Support AllReduce Large Tensors
47+
* we support the following new arguments in the `PyTorch DDP class <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`_.
48+
49+
* ``broadcast_buffers``
50+
* ``find_unused_parameters``
51+
* ``gradient_as_bucket_view``
52+
53+
**Bug Fixes**
54+
55+
* Fixed stalling issues when training on ``ml.p3.16xlarge``.
56+
57+
**Known Issues**
58+
59+
* The library currently does not support the PyTorch sub-process groups API (`torch.distributed.new_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.new_group>`_).
60+
This means that you cannot use the ``smddp`` backend concurrently with other
61+
process group backends such as NCCL and Gloo.
2762

2863
**Migration to AWS Deep Learning Containers**
2964

doc/api/training/smd_data_parallel_use_sm_pysdk.rst

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Run a Distributed Training Job Using the SageMaker Python SDK
2-
=============================================================
1+
Launch a Distributed Training Job Using the SageMaker Python SDK
2+
================================================================
33

44
To use the SageMaker distributed data parallel library with the SageMaker Python SDK,
55
you will need the following:
@@ -18,13 +18,19 @@ you will need the following:
1818
inputs <https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs>`__ documentation.
1919

2020
When you define
21-
a Pytorch or TensorFlow ``Estimator`` using the SageMaker Python SDK,
22-
you must select ``dataparallel`` as your ``distribution`` strategy:
21+
a :class:`sagemaker.tensorflow.estimator.TensorFlow` or :class:`sagemaker.pytorch.estimator.PyTorch` estimator,
22+
you must select ``smdistributed`` and then ``dataparallel`` as your ``distribution`` strategy.
2323

2424
.. code:: python
2525
2626
distribution = { "smdistributed": { "dataparallel": { "enabled": True } } }
2727
28+
.. seealso::
29+
30+
To learn more, see `Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker Python SDK
31+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html>`_
32+
in the *Amazon SageMaker Developer Guide*.
33+
2834
We recommend you use one of the example notebooks as your template to launch a training job. When
2935
you use an example notebook you’ll need to swap your training script with the one that came with the
3036
notebook and modify any input functions as necessary. For instructions on how to get started using a
@@ -35,7 +41,6 @@ Once you have launched a training job, you can monitor it using CloudWatch. To l
3541
`Monitor and Analyze Training Jobs Using Metrics
3642
<https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html>`_.
3743

38-
3944
After you train a model, you can see how to deploy your trained model to an endpoint for inference by
4045
following one of the `example notebooks for deploying a model
4146
<https://sagemaker-examples.readthedocs.io/en/latest/inference/index.html>`_.

doc/api/training/smd_model_parallel_general.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _sm-sdk-modelparallel-general:
2+
13
#############################################################
24
Run a Distributed Training Job Using the SageMaker Python SDK
35
#############################################################

0 commit comments

Comments
 (0)