You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following steps show you how to convert a PyTorch training script to
18
-
utilize SageMaker's distributed data parallel library.
19
-
20
-
The distributed data parallel library works as a backend of the PyTorch distributed package.
21
-
See `SageMaker distributed data parallel PyTorch examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#pytorch-distributed>`__
22
-
for additional details on how to use the library.
23
-
24
-
1. Import the SageMaker distributed data parallel library’s PyTorch client.
# SageMaker data parallel: Wrap the PyTorch model with the library's DDP
139
-
model = DDP(Net().to(device))
140
32
141
-
# SageMaker data parallel: Pin each GPU to a single library process.
142
-
local_rank = os.environ["LOCAL_RANK"]
143
-
torch.cuda.set_device(local_rank)
144
-
model.cuda(local_rank)
33
+
If you already have a working PyTorch script and only need to add the
34
+
backend specification, you can proceed to Using the SageMaker PyTorch Estimator
35
+
in the Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker Python SDK topic.
145
36
146
-
# Train
147
-
optimizer = optim.Adadelta(...)
148
-
scheduler = StepLR(...)
149
-
for epoch inrange(1, args.epochs +1):
150
-
train(...)
151
-
if rank ==0:
152
-
test(...)
153
-
scheduler.step()
37
+
.. note::
154
38
155
-
# SageMaker data parallel: Save model on the leader node (rank 0).
156
-
if dist.get_rank() ==0:
157
-
torch.save(...)
39
+
The ``smddp`` backend currently does not support creating subprocess groups
40
+
with the ``torch.distributed.new_group()`` API.
41
+
You cannot use the ``smddp`` backend concurrently with other backends.
158
42
159
-
if__name__=='__main__':
160
-
main()
43
+
.. seealso::
161
44
45
+
If you still need to modify your training script to properly use
46
+
the PyTorch distributed package, see `Preparing a PyTorch Training Script for Distributed Training <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html>`_
* If you want to find documentation for the previous versions of the library
42
+
(v1.3.0 or before), see the `archived SageMaker distributed data parallel library documentation <https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest.html#documentation-archive>`_.
23
43
24
44
**Improvements**
25
45
26
-
*
46
+
* Support AllReduce Large Tensors
47
+
* we support the following new arguments in the `PyTorch DDP class <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`_.
48
+
49
+
* ``broadcast_buffers``
50
+
* ``find_unused_parameters``
51
+
* ``gradient_as_bucket_view``
52
+
53
+
**Bug Fixes**
54
+
55
+
* Fixed stalling issues when training on ``ml.p3.16xlarge``.
56
+
57
+
**Known Issues**
58
+
59
+
* The library currently does not support the PyTorch sub-process groups API (`torch.distributed.new_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.new_group>`_).
60
+
This means that you cannot use the ``smddp`` backend concurrently with other
0 commit comments