You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Many overdue updates
* updating the overview to include TP/PP and DTensor/Devicemesh
* removing RPC, DataParallel and Elastic as they are no longer supported
``DTensor`` and ``DeviceMesh`` are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.
34
+
35
+
- `DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md>`__ represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
36
+
- `DeviceMesh <https://pytorch.org/docs/stable/distributed.html#devicemesh>`__ abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying ``ProcessGroup`` instances for collective communications in multi-dimensional parallelisms. Try out our `Device Mesh Recipe <https://pytorch.org/tutorials/recipes/distributed_device_mesh.html>`__ to learn more.
37
+
38
+
Communications APIs
39
+
*******************
40
+
41
+
The `PyTorch distributed communication layer (C10D) <https://pytorch.org/docs/stable/distributed.html>`__ offers both collective communication APIs (e.g., `all_reduce <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__
38
42
and `all_gather <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather>`__)
if you would like to further speed up training and are willing to write a
73
-
little more code to set it up.
74
-
4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
75
-
and the `launching script <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__,
76
-
if the application needs to scale across machine boundaries.
77
-
5. Use multi-GPU `FullyShardedDataParallel <https://pytorch.org/docs/stable/fsdp.html>`__
78
-
training on a single-machine or multi-machine when the data and model cannot
79
-
fit on one GPU.
80
-
6. Use `torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
81
-
to launch distributed training if errors (e.g., out-of-memory) are expected or if
82
-
resources can join and leave dynamically during training.
53
+
`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.
83
54
84
55
85
-
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
56
+
Applying Parallelism To Scale Your Model
57
+
----------------------------------------
86
58
59
+
Data Parallelism is a widely adopted single-program multiple-data training paradigm
60
+
where the model is replicated on every process, every model replica computes local gradients for
61
+
a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step.
87
62
88
-
``torch.nn.DataParallel``
89
-
~~~~~~~~~~~~~~~~~~~~~~~~~
90
-
91
-
The `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
92
-
package enables single-machine multi-GPU parallelism with the lowest coding
93
-
hurdle. It only requires a one-line change to the application code. The tutorial
94
-
`Optional: Data Parallelism <../beginner/blitz/data_parallel_tutorial.html>`__
95
-
shows an example. Although ``DataParallel`` is very easy to
96
-
use, it usually does not offer the best performance because it replicates the
97
-
model in every forward pass, and its single-process multi-thread parallelism
98
-
naturally suffers from
99
-
`GIL <https://wiki.python.org/moin/GlobalInterpreterLock>`__ contention. To get
offer a starter example and some brief descriptions of its design and
123
-
implementation. If this is your first time using DDP, start from this
124
-
document.
125
-
2. `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
126
-
explains some common problems with DDP training, including unbalanced
127
-
workload, checkpointing, and multi-device models. Note that, DDP can be
128
-
easily combined with single-machine multi-device model parallelism which is
129
-
described in the
130
-
`Single-Machine Model Parallel Best Practices <../intermediate/model_parallel_tutorial.html>`__
131
-
tutorial.
132
-
3. The `Launching and configuring distributed data parallel applications <https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md>`__
133
-
document shows how to use the DDP launching script.
134
-
4. The `Shard Optimizer States With ZeroRedundancyOptimizer <../recipes/zero_redundancy_optimizer.html>`__
135
-
recipe demonstrates how `ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html>`__
136
-
helps to reduce optimizer memory footprint.
137
-
5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <../advanced/generic_join.html>`__
138
-
tutorial walks through using the generic join context for distributed training with uneven inputs.
139
-
140
-
141
-
``torch.distributed.FullyShardedDataParallel``
142
-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143
-
144
-
The `FullyShardedDataParallel <https://pytorch.org/docs/stable/fsdp.html>`__
145
-
(FSDP) is a type of data parallelism paradigm which maintains a per-GPU copy of a model’s
146
-
parameters, gradients and optimizer states, it shards all of these states across
147
-
data-parallel workers. The support for FSDP was added starting PyTorch v1.11. The tutorial
148
-
`Getting Started with FSDP <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
149
-
provides in depth explanation and example of how FSDP works.
150
-
151
-
152
-
torch.distributed.elastic
153
-
~~~~~~~~~~~~~~~~~~~~~~~~~
154
-
155
-
With the growth of the application complexity and scale, failure recovery
156
-
becomes a requirement. Sometimes it is inevitable to hit errors
157
-
like out-of-memory (OOM) when using DDP, but DDP itself cannot recover from those errors,
158
-
and it is not possible to handle them using a standard ``try-except`` construct.
159
-
This is because DDP requires all processes to operate in a closely synchronized manner
160
-
and all ``AllReduce`` communications launched in different processes must match.
161
-
If one of the processes in the group
162
-
throws an exception, it is likely to lead to desynchronization (mismatched
163
-
``AllReduce`` operations) which would then cause a crash or hang.
adds fault tolerance and the ability to make use of a dynamic pool of machines (elasticity).
166
-
167
-
RPC-Based Distributed Training
168
-
------------------------------
63
+
Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn't fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques.
169
64
170
-
Many training paradigms do not fit into data parallelism, e.g.,
171
-
parameter server paradigm, distributed pipeline parallelism, reinforcement
172
-
learning applications with multiple observers or agents, etc.
173
-
`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__ aims at
174
-
supporting general distributed training scenarios.
decorator, which can help speed up inference and training. It uses
211
-
RL and PS examples similar to those in the above tutorials 1 and 2.
212
-
5. The `Combining Distributed DataParallel with Distributed RPC Framework <../advanced/rpc_ddp_tutorial.html>`__
213
-
tutorial demonstrates how to combine DDP with RPC to train a model using
214
-
distributed data parallelism combined with distributed model parallelism.
65
+
When deciding what parallelism techniques to choose for your model, use these common guidelines:
66
+
67
+
#. Use `DistributedDataParallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`__,
68
+
if your model fits in a single GPU but you want to easily scale up training using multiple GPUs.
69
+
70
+
* Use `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__, to launch multiple pytorch processes if you are you using more than one node.
71
+
72
+
* See also: `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
73
+
74
+
#. Use `FullyShardedDataParallel (FSDP) <https://pytorch.org/docs/stable/fsdp.html>`__ when your model cannot fit on one GPU.
75
+
76
+
* See also: `Getting Started with FSDP <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
77
+
78
+
#. Use `Tensor Parallel (TP) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__ and/or `Pipeline Parallel (PP) <https://pytorch.org/docs/main/distributed.pipelining.html>`__ if you reach scaling limitations with FSDP.
* See also: `TorchTitan end to end example of 3D parallelism <https://github.com/pytorch/torchtitan>`__
83
+
84
+
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
215
85
216
86
217
87
PyTorch Distributed Developers
218
88
------------------------------
219
89
220
90
If you'd like to contribute to PyTorch Distributed, refer to our
0 commit comments