Clean up dist_overview.rst (#2913)

wconstab · web-flow · commit 2748f2c76f31 · 2024-06-13T15:38:24.000-07:00
Many overdue updates
* updating the overview to include TP/PP and DTensor/Devicemesh
* removing RPC, DataParallel and Elastic as they are no longer supported
diff --git a/beginner_source/dist_overview.rst b/beginner_source/dist_overview.rst
@@ -1,9 +1,6 @@
 PyTorch Distributed Overview
 ============================
-**Author**: `Shen Li <https://mrshenli.github.io/>`_
-
-.. note::
-   |edit| View and edit this tutorial in `github <https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst>`__.
+**Author**: `Will Constable <https://github.com/wconstab/>`_
 
 This is the overview page for the ``torch.distributed`` package. The goal of
 this page is to categorize documents into different topics and briefly
@@ -15,207 +12,83 @@ to the technology that can best serve your use case.
 Introduction
 ------------
 
-As of PyTorch v1.6.0, features in ``torch.distributed`` can be categorized into
-three main components:
-
-* `Distributed Data-Parallel Training <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
-  (DDP) is a widely adopted single-program multiple-data training paradigm. With
-  DDP, the model is replicated on every process, and every model replica will be
-  fed with a different set of input data samples. DDP takes care of gradient
-  communication to keep model replicas synchronized and overlaps it with the
-  gradient computations to speed up training.
-* `RPC-Based Distributed Training <https://pytorch.org/docs/stable/rpc.html>`__
-  (RPC) supports general training structures that cannot fit into
-  data-parallel training such as distributed pipeline parallelism, parameter
-  server paradigm, and combinations of DDP with other training paradigms. It
-  helps manage remote object lifetime and extends the
-  `autograd engine <https://pytorch.org/docs/stable/autograd.html>`__ beyond
-  machine boundaries.
-* `Collective Communication <https://pytorch.org/docs/stable/distributed.html>`__
-  (c10d) library supports sending tensors across processes within a group. It
-  offers both collective communication APIs (e.g.,
-  `all_reduce <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__
+The PyTorch Distributed library includes a collective of parallelism modules,
+a communications layer, and infrastructure for launching and
+debugging large training jobs.
+
+
+Parallelism APIs
+****************
+
+These Parallelism Modules offer high-level functionality and compose with existing models:
+
+- `Distributed Data-Parallel (DDP) <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
+- `Fully Sharded Data-Parallel Training (FSDP) <https://pytorch.org/docs/stable/fsdp.html>`__
+- `Tensor Parallel (TP) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__
+- `Pipeline Parallel (PP) <https://pytorch.org/docs/main/distributed.pipelining.html>`__
+
+Sharding primitives
+*******************
+
+``DTensor`` and ``DeviceMesh`` are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.
+
+- `DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md>`__ represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
+- `DeviceMesh <https://pytorch.org/docs/stable/distributed.html#devicemesh>`__ abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying ``ProcessGroup`` instances for collective communications in multi-dimensional parallelisms.  Try out our `Device Mesh Recipe <https://pytorch.org/tutorials/recipes/distributed_device_mesh.html>`__ to learn more.
+
+Communications APIs
+*******************
+
+The `PyTorch distributed communication layer (C10D) <https://pytorch.org/docs/stable/distributed.html>`__ offers both collective communication APIs (e.g., `all_reduce <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__
   and `all_gather <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather>`__)
   and P2P communication APIs (e.g.,
   `send <https://pytorch.org/docs/stable/distributed.html#torch.distributed.send>`__
-  and `isend <https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend>`__).
-  DDP and RPC (`ProcessGroup Backend <https://pytorch.org/docs/stable/rpc.html#process-group-backend>`__)
-  are built on c10d, where the former uses collective communications
-  and the latter uses P2P communications. Usually, developers do not need to
-  directly use this raw communication API, as the DDP and RPC APIs can serve
-  many distributed training scenarios. However, there are use cases where this API
-  is still helpful. One example would be distributed parameter averaging, where
-  applications would like to compute the average values of all model parameters
-  after the backward pass instead of using DDP to communicate gradients. This can
-  decouple communications from computations and allow finer-grain control over
-  what to communicate, but on the other hand, it also gives up the performance
-  optimizations offered by DDP.
+  and `isend <https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend>`__),
+  which are used under the hood in all of the parallelism implementations.
   `Writing Distributed Applications with PyTorch <../intermediate/dist_tuto.html>`__
   shows examples of using c10d communication APIs.
 
+Launcher
+********
 
-Data Parallel Training
-----------------------
-
-PyTorch provides several options for data-parallel training. For applications
-that gradually grow from simple to complex and from prototype to production, the
-common development trajectory would be:
-
-1. Use single-device training if the data and model can fit in one GPU, and
-   training speed is not a concern.
-2. Use single-machine multi-GPU
-   `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
-   to make use of multiple GPUs on a single machine to speed up training with
-   minimal code changes.
-3. Use single-machine multi-GPU
-   `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__,
-   if you would like to further speed up training and are willing to write a
-   little more code to set it up.
-4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
-   and the `launching script <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__,
-   if the application needs to scale across machine boundaries.
-5. Use multi-GPU `FullyShardedDataParallel <https://pytorch.org/docs/stable/fsdp.html>`__
-   training on a single-machine or multi-machine when the data and model cannot
-   fit on one GPU.
-6. Use `torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
-   to launch distributed training if errors (e.g., out-of-memory) are expected or if
-   resources can join and leave dynamically during training.
+`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.
 
 
-.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
+Applying Parallelism To Scale Your Model
+----------------------------------------
 
+Data Parallelism is a widely adopted single-program multiple-data training paradigm
+where the model is replicated on every process, every model replica computes local gradients for
+a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step.
 
-``torch.nn.DataParallel``
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
-package enables single-machine multi-GPU parallelism with the lowest coding
-hurdle. It only requires a one-line change to the application code. The tutorial
-`Optional: Data Parallelism <../beginner/blitz/data_parallel_tutorial.html>`__
-shows an example. Although ``DataParallel`` is very easy to
-use, it usually does not offer the best performance because it replicates the
-model in every forward pass, and its single-process multi-thread parallelism
-naturally suffers from
-`GIL <https://wiki.python.org/moin/GlobalInterpreterLock>`__ contention. To get
-better performance, consider using
-`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__.
-
-
-``torch.nn.parallel.DistributedDataParallel``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Compared to `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__,
-`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
-requires one more step to set up, i.e., calling
-`init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__.
-DDP uses multi-process parallelism, and hence there is no GIL contention across
-model replicas. Moreover, the model is broadcast at DDP construction time instead
-of in every forward pass, which also helps to speed up training. DDP is shipped
-with several performance optimization technologies. For a more in-depth
-explanation, refer to this
-`paper <http://www.vldb.org/pvldb/vol13/p3005-li.pdf>`__ (VLDB'20).
-
-
-DDP materials are listed below:
-
-1. `DDP notes <https://pytorch.org/docs/stable/notes/ddp.html>`__
-   offer a starter example and some brief descriptions of its design and
-   implementation. If this is your first time using DDP, start from this
-   document.
-2. `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
-   explains some common problems with DDP training, including unbalanced
-   workload, checkpointing, and multi-device models. Note that, DDP can be
-   easily combined with single-machine multi-device model parallelism which is
-   described in the
-   `Single-Machine Model Parallel Best Practices <../intermediate/model_parallel_tutorial.html>`__
-   tutorial.
-3. The `Launching and configuring distributed data parallel applications <https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md>`__
-   document shows how to use the DDP launching script.
-4. The `Shard Optimizer States With ZeroRedundancyOptimizer <../recipes/zero_redundancy_optimizer.html>`__
-   recipe demonstrates how `ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html>`__
-   helps to reduce optimizer memory footprint.
-5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <../advanced/generic_join.html>`__
-   tutorial walks through using the generic join context for distributed training with uneven inputs.
-
-
-``torch.distributed.FullyShardedDataParallel``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The `FullyShardedDataParallel <https://pytorch.org/docs/stable/fsdp.html>`__
-(FSDP) is a type of data parallelism paradigm which maintains a per-GPU copy of a model’s
-parameters, gradients and optimizer states, it shards all of these states across
-data-parallel workers. The support for FSDP was added starting PyTorch v1.11. The tutorial
-`Getting Started with FSDP <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
-provides in depth explanation and example of how FSDP works.
-
-
-torch.distributed.elastic
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-With the growth of the application complexity and scale, failure recovery
-becomes a requirement. Sometimes it is inevitable to hit errors
-like out-of-memory (OOM) when using DDP, but DDP itself cannot recover from those errors,
-and it is not possible to handle them using a standard ``try-except`` construct.
-This is because DDP requires all processes to operate in a closely synchronized manner
-and all ``AllReduce`` communications launched in different processes must match.
-If one of the processes in the group
-throws an exception, it is likely to lead to desynchronization (mismatched
-``AllReduce`` operations) which would then cause a crash or hang.
-`torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
-adds fault tolerance and the ability to make use of a dynamic pool of machines (elasticity).
-
-RPC-Based Distributed Training
-------------------------------
+Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn't fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques.
 
-Many training paradigms do not fit into data parallelism, e.g.,
-parameter server paradigm, distributed pipeline parallelism, reinforcement
-learning applications with multiple observers or agents, etc.
-`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__ aims at
-supporting general distributed training scenarios.
-
-`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__
-has four main pillars:
-
-* `RPC <https://pytorch.org/docs/stable/rpc.html#rpc>`__ supports running
-  a given function on a remote worker.
-* `RRef <https://pytorch.org/docs/stable/rpc.html#rref>`__ helps to manage the
-  lifetime of a remote object. The reference counting protocol is presented in the
-  `RRef notes <https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol>`__.
-* `Distributed Autograd <https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework>`__
-  extends the autograd engine beyond machine boundaries. Please refer to
-  `Distributed Autograd Design <https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design>`__
-  for more details.
-* `Distributed Optimizer <https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim>`__
-  automatically reaches out to all participating workers to update
-  parameters using gradients computed by the distributed autograd engine.
-
-RPC Tutorials are listed below:
-
-1. The `Getting Started with Distributed RPC Framework <../intermediate/rpc_tutorial.html>`__
-   tutorial first uses a simple Reinforcement Learning (RL) example to
-   demonstrate RPC and RRef. Then, it applies a basic distributed model
-   parallelism to an RNN example to show how to use distributed autograd and
-   distributed optimizer.
-2. The `Implementing a Parameter Server Using Distributed RPC Framework <../intermediate/rpc_param_server_tutorial.html>`__
-   tutorial borrows the spirit of
-   `HogWild! training <https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf>`__
-   and applies it to an asynchronous parameter server (PS) training application.
-3. The `Distributed Pipeline Parallelism Using RPC <../intermediate/dist_pipeline_parallel_tutorial.html>`__
-   tutorial extends the single-machine pipeline parallel example (presented in
-   `Single-Machine Model Parallel Best Practices <../intermediate/model_parallel_tutorial.html>`__)
-   to a distributed environment and shows how to implement it using RPC.
-4. The `Implementing Batch RPC Processing Using Asynchronous Executions <../intermediate/rpc_async_execution.html>`__
-   tutorial demonstrates how to implement RPC batch processing using the
-   `@rpc.functions.async_execution <https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution>`__
-   decorator, which can help speed up inference and training. It uses
-   RL and PS examples similar to those in the above tutorials 1 and 2.
-5. The `Combining Distributed DataParallel with Distributed RPC Framework <../advanced/rpc_ddp_tutorial.html>`__
-   tutorial demonstrates how to combine DDP with RPC to train a model using
-   distributed data parallelism combined with distributed model parallelism.
+When deciding what parallelism techniques to choose for your model, use these common guidelines:
+
+#. Use `DistributedDataParallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`__,
+   if your model fits in a single GPU but you want to easily scale up training using multiple GPUs.
+
+   * Use `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__, to launch multiple pytorch processes if you are you using more than one node.
+
+   * See also: `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
+
+#. Use `FullyShardedDataParallel (FSDP) <https://pytorch.org/docs/stable/fsdp.html>`__ when your model cannot fit on one GPU.
+
+   * See also: `Getting Started with FSDP <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
+
+#. Use `Tensor Parallel (TP) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__ and/or `Pipeline Parallel (PP) <https://pytorch.org/docs/main/distributed.pipelining.html>`__ if you reach scaling limitations with FSDP.
+
+   * Try our `Tensor Parallelism Tutorial <https://pytorch.org/tutorials/intermediate/TP_tutorial.html>`__
+
+   * See also: `TorchTitan end to end example of 3D parallelism <https://github.com/pytorch/torchtitan>`__
+
+.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
 
 
 PyTorch Distributed Developers
 ------------------------------
 
 If you'd like to contribute to PyTorch Distributed, refer to our
 `Developer Guide <https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md>`_.
+
+.. note::
+   |edit| View and edit this tutorial in `github <https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst>`__.
diff --git a/en-wordlist.txt b/en-wordlist.txt
@@ -335,6 +335,7 @@ dataset’s
 deallocation
 decompositions
 decorrelated
+devicemesh
 deserialize
 deserialized
 desynchronization
@@ -346,6 +347,7 @@ distractor
 downsample
 downsamples
 dropdown
+dtensor
 duration
 elementwise
 embeddings
@@ -482,6 +484,7 @@ prespecified
 pretrained
 prewritten
 primals
+processgroup
 profiler
 profilers
 protobuf
@@ -503,6 +506,7 @@ relu
 reproducibility
 rescale
 rescaling
+reshard
 resnet
 restride
 rewinded
@@ -515,6 +519,8 @@ runtime
 runtime
 runtimes
 scalable
+sharded
+Sharding
 softmax
 sparsified
 sparsifier