Skip to content

Commit 5159c85

Browse files
committed
polish doc style and structure
1 parent 5e4128e commit 5159c85

8 files changed

+88
-63
lines changed

doc/api/training/distributed.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,19 @@ SageMaker distributed training libraries offer both data parallel and model para
44
They combine software and hardware technologies to improve inter-GPU and inter-node communications.
55
They extend SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.
66

7+
The SageMaker Distributed Data Parallel Library
8+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9+
710
.. toctree::
811
:maxdepth: 3
912

1013
smd_data_parallel
14+
15+
16+
The SageMaker Distributed Model Parallel Library
17+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
18+
19+
.. toctree::
20+
:maxdepth: 3
21+
1122
smd_model_parallel

doc/api/training/sdp_versions/latest.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Select the latest or one of the previous versions of the API documentation
1515
depending on the version of the library you use.
1616

1717
.. important::
18-
The distributed data parallel library only supports training jobs using CUDA 11 or later.
18+
The distributed data parallel library supports training jobs using CUDA 11 or later.
1919
When you define a SageMaker ``PyTorch`` or ``TensorFlow``
2020
estimator with ``dataparallel`` parameter ``enabled`` set to ``True``,
2121
it uses CUDA 11. When you extend or customize your own training image,

doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst

Lines changed: 52 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
##############################################################
2-
PyTorch Guide to SageMaker's distributed data parallel library
3-
##############################################################
1+
#################
2+
Guide for PyTorch
3+
#################
44

5-
Use this guide to learn about the SageMaker distributed
5+
Use this guide to learn how to use the SageMaker distributed
66
data parallel library API for PyTorch.
77

88
.. contents:: Topics
@@ -21,63 +21,73 @@ The distributed data parallel library works as a backend of the PyTorch distribu
2121
See `SageMaker distributed data parallel PyTorch examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#pytorch-distributed>`__ 
2222
for additional details on how to use the library.
2323

24-
1. Import the SageMaker distributed data parallel library’s PyTorch client.
24+
1. Import the SageMaker distributed data parallel library’s PyTorch client.
2525

26-
.. code:: python
26+
.. code:: python
2727
28-
import smdistributed.dataparallel.torch.torch_smddp
28+
import smdistributed.dataparallel.torch.torch_smddp
2929
30-
2. Import the PyTorch distributed modules.
30+
2. Import the PyTorch distributed modules.
3131

32-
.. code:: python
32+
.. code:: python
3333
34-
import torch
35-
import torch.distributed as dist
36-
from torch.nn.parallel import DistributedDataParallel as DDP
34+
import torch
35+
import torch.distributed as dist
36+
from torch.nn.parallel import DistributedDataParallel as DDP
3737
38-
3. Set the backend of torch.distributed as smddp.
38+
3. Set the backend of ``torch.distributed`` as ``smddp``.
3939

40-
.. code:: python
40+
.. code:: python
4141
42-
dist.init_process_group(backend='smddp')
42+
dist.init_process_group(backend='smddp')
4343
44-
4. After parsing arguments and defining a batch size parameter (for example, batch_size=args.batch_size), add a two-line of code to resize the batch size per worker (GPU). PyTorch's DataLoader operation does not automatically handle the batch resizing for distributed training.
44+
4. After parsing arguments and defining a batch size parameter
45+
(for example, ``batch_size=args.batch_size``), add a two-line of code to
46+
resize the batch size per worker (GPU). PyTorch's DataLoader operation
47+
does not automatically handle the batch resizing for distributed training.
4548

46-
.. code:: python
49+
.. code:: python
4750
48-
batch_size //= dist.get_world_size()
49-
batch_size = max(batch_size, 1)
51+
batch_size //= dist.get_world_size()
52+
batch_size = max(batch_size, 1)
5053
51-
5. Pin each GPU to a single SageMaker data parallel library process with local_rank—this refers to the relative rank of the process within a given node.
54+
5. Pin each GPU to a single SageMaker data parallel library process with
55+
``local_rank``. This refers to the relative rank of the process within a given node.
5256

53-
You can retreive the rank of the process from the LOCAL_RANK environment variable.
57+
You can retrieve the rank of the process from the ``LOCAL_RANK`` environment variable.
5458

55-
.. code:: python
59+
.. code:: python
5660
57-
import os
58-
local_rank = os.environ["LOCAL_RANK"]
59-
torch.cuda.set_device(local_rank)
61+
import os
62+
local_rank = os.environ["LOCAL_RANK"]
63+
torch.cuda.set_device(local_rank)
6064
61-
6. After defining a model, wrap it with the PyTorch DDP.
65+
6. After defining a model, wrap it with the PyTorch DDP.
6266

63-
.. code:: python
67+
.. code:: python
6468
65-
model = ...
69+
model = ...
6670
67-
# Wrap the model with the PyTorch DistributedDataParallel API
68-
model = DDP(model)
71+
# Wrap the model with the PyTorch DistributedDataParallel API
72+
model = DDP(model)
6973
70-
7. When you call the torch.utils.data.distributed.DistributedSampler API, specify the total number of processes (GPUs) participating in training across all the nodes in the cluster. This is called world_size, and you can retrieve the number from the torch.distributed.get_world_size() API. Also, specify the rank of each process among all processes using the torch.distributed.get_rank() API.
74+
7. When you call the ``torch.utils.data.distributed.DistributedSampler`` API,
75+
specify the total number of processes (GPUs) participating in training across
76+
all the nodes in the cluster. This is called ``world_size``, and you can retrieve
77+
the number from the ``torch.distributed.get_world_size()`` API. Also, specify
78+
the rank of each process among all processes using the ``torch.distributed.get_rank()`` API.
7179

72-
.. code:: python
80+
.. code:: python
7381
74-
train_sampler = DistributedSampler(
75-
train_dataset,
76-
num_replicas = dist.get_world_size(),
77-
rank = dist.get_rank()
78-
)
82+
train_sampler = DistributedSampler(
83+
train_dataset,
84+
num_replicas = dist.get_world_size(),
85+
rank = dist.get_rank()
86+
)
7987
80-
8. Modify your script to save checkpoints only on the leader process (rank 0). The leader process has a synchronized model. This also avoids other processes overwriting the checkpoints and possibly corrupting the checkpoints.
88+
8. Modify your script to save checkpoints only on the leader process (rank 0).
89+
The leader process has a synchronized model. This also avoids other processes
90+
overwriting the checkpoints and possibly corrupting the checkpoints.
8191

8292
The following example code shows the structure of a PyTorch training script with DDP and smddp as the backend.
8393

@@ -142,7 +152,7 @@ The following example code shows the structure of a PyTorch training script with
142152
test(...)
143153
scheduler.step()
144154
145-
# SageMaker data parallel: Save model on the main node (rank 0).
155+
# SageMaker data parallel: Save model on the leader node (rank 0).
146156
if dist.get_rank() == 0:
147157
torch.save(...)
148158
@@ -171,16 +181,16 @@ that are supported in the library v1.3.0 and before.
171181

172182
.. warning::
173183

174-
The following ``smdistributed`` APIs for its implementation of distributed data parallelism
175-
for PyTorch is deprecated.
184+
The following APIs for ``smdistributed`` implementation of the PyTorch distributed modules
185+
are deprecated.
176186

177187

178188
.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None)
179189

180190
.. deprecated:: 1.4.0
181191

182192
Use the `torch.nn.parallel.DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_
183-
instead.
193+
API instead.
184194

185195

186196
.. function:: smdistributed.dataparallel.torch.distributed.is_available()

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
1-
#################################################################
2-
TensorFlow Guide to SageMaker's distributed data parallel library
3-
#################################################################
1+
####################
2+
Guide for TensorFlow
3+
####################
44

5-
.. admonition:: Contents
5+
Use this guide to learn how to use the SageMaker distributed
6+
data parallel library API for TensorFlow.
67

7-
- :ref:`tensorflow-sdp-modify`
8-
- :ref:`tensorflow-sdp-api`
8+
.. contents:: Topics
9+
:depth: 3
10+
:local:
911

1012
.. _tensorflow-sdp-modify:
1113

12-
Modify a TensorFlow 2.x training script to use SageMaker data parallel
13-
======================================================================
14+
Modify a TensorFlow 2.x training script to use the SageMaker data parallel library
15+
==================================================================================
1416

1517
The following steps show you how to convert a TensorFlow 2.x training
1618
script to utilize the distributed data parallel library.

doc/api/training/smd_data_parallel.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
###############################################
2-
The SageMaker Distributed Data Parallel Library
3-
###############################################
1+
########################################################
2+
The SageMaker Distributed Data Parallel Library Overview
3+
########################################################
44

55
SageMaker's distributed data parallel library extends SageMaker’s training
66
capabilities on deep learning models with near-linear scaling efficiency,

doc/api/training/smd_data_parallel_use_sm_pysdk.rst

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
1-
Use with the SageMaker Python SDK
2-
=================================
1+
Run a Distributed Training Job Using the SageMaker Python SDK
2+
=============================================================
33

44
To use the SageMaker distributed data parallel library with the SageMaker Python SDK,
55
you will need the following:
66

77
- A TensorFlow or PyTorch training script that is
8-
adapted to use the distributed data parallel library. The :ref:`sdp_api_docs` includes
9-
framework specific examples of training scripts that are adapted to use this library.
8+
adapted to use the distributed data parallel library. Make sure you read through
9+
the previous topic at
10+
:ref:`sdp_api_docs`, which includes instructions on how to modify your script and
11+
framework-specific examples.
1012
- Your input data must be in an S3 bucket or in FSx in the AWS region
1113
that you will use to launch your training job. If you use the Jupyter
1214
notebooks provided, create a SageMaker notebook instance in the same

doc/api/training/smd_model_parallel.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
The SageMaker Distributed Model Parallel Library
2-
------------------------------------------------
1+
The SageMaker Distributed Model Parallel Library Overview
2+
---------------------------------------------------------
33

44
The Amazon SageMaker distributed model parallel library is a model parallelism library for training
55
large deep learning models that were previously difficult to train due to GPU memory limitations.

doc/api/training/smd_model_parallel_general.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
#################################
2-
Use with the SageMaker Python SDK
3-
#################################
1+
#############################################################
2+
Run a Distributed Training Job Using the SageMaker Python SDK
3+
#############################################################
44

55
Walk through the following pages to learn about the SageMaker model parallel library's APIs
66
to configure and enable distributed model parallelism

0 commit comments

Comments
 (0)