@@ -74,7 +74,10 @@ common development trajectory would be:
74
74
4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html >`__
75
75
and the `launching script <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md >`__,
76
76
if the application needs to scale across machine boundaries.
77
- 5. Use `torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html >`__
77
+ 5. Use multi-GPU `FullyShardedDataParallel <https://pytorch.org/docs/stable/fsdp.html >`__
78
+ training on a single-machine or multi-machine when the data and model cannot
79
+ fit on one GPU.
80
+ 6. Use `torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html >`__
78
81
to launch distributed training if errors (e.g., out-of-memory) are expected or if
79
82
resources can join and leave dynamically during training.
80
83
@@ -134,6 +137,18 @@ DDP materials are listed below:
134
137
5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <../advanced/generic_join.html >`__
135
138
tutorial walks through using the generic join context for distributed training with uneven inputs.
136
139
140
+
141
+ ``torch.distributed.FullyShardedDataParallel ``
142
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143
+
144
+ The `FullyShardedDataParallel <https://pytorch.org/docs/stable/fsdp.html >`__
145
+ (FSDP) is a type of data parallelism paradigm which maintains a per-GPU copy of a model’s
146
+ parameters, gradients and optimizer states, it shards all of these states across
147
+ data-parallel workers. The support for FSDP was added starting PyTorch v1.11. The tutorial
148
+ `Getting Started with FSDP <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html >`__
149
+ provides in depth explanation and example of how FSDP works.
150
+
151
+
137
152
torch.distributed.elastic
138
153
~~~~~~~~~~~~~~~~~~~~~~~~~
139
154
0 commit comments