Add Context Parallel tutorial

XilunWu · XilunWu · commit b87d98dc75f5 · 2025-04-16T12:06:10.000-07:00
diff --git a/prototype_source/context_parallel.rst b/prototype_source/context_parallel.rst
@@ -0,0 +1,208 @@
+Introduction to Context Parallel
+======================================
+**Authors**: `Xilun Wu <https://github.com/XilunWu>`_, `Chien-Chin Huang <https://github.com/fegin>`__
+
+.. note::
+    |edit| View and edit this tutorial in `github <https://github.com/pytorch/tutorials/blob/main/intermediate_source/context_parallel.rst>`__.
+
+.. grid:: 2
+
+   .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
+      :class-card: card-prerequisites
+
+      * `Context Parallel APIs <https://pytorch.org/docs/stable/distributed.tensor.html#torch.distributed.tensor.experimental.context_parallel>`__
+      * `1M sequence training in torchtitan with Context Parallel <https://discuss.pytorch.org/t/distributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel/215082>`__
+
+
+   .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
+      :class-card: card-prerequisites
+
+      * PyTorch 2.7 or later
+
+
+Introduction
+------------
+
+Context Parallel is an approach used in LLM to reduce peak activation size by sharding the long input sequence across multiple devices.
+It breaks the constraint on input sequence length resulting from peak memory usage on storing activations in Transformer blocks.
+
+The core of Context Parallel is Ring Attention, a novel parallel implementation of the Attention layer.
+Ring Attention shuffles the KV shards and calculates the partial attention scores,
+repeats until all KV shards have been used on each device.
+We implemented two Ring Attention variants: `pass-KV <https://arxiv.org/abs/2411.01783>`__ and `all-to-all <https://openreview.net/forum?id=WsRHpHH4s0>`__.
+The pass-KV approach all-gathers KV shards while performing the local SDPA (Scaled Dot Product Attention) then performs the rest when the communication completes.
+The all-to-all approach uses interleaved all-to-all collectives to ring shuffle KV shards to overlap the SDPA computation and the all-to-all communication
+necessary for the next SDPA.
+
+The Context Parallel APIs consist of two parts:
+
+1. ``context_parallel()`` allows users to create a Python context where the SDPA function (``torch.nn.functional.scaled_dot_product_attention``)
+will be automatically replaced with Ring Attention. To shard Tensors along a dimension, simply pass the Tensors and their sharding dimensions to
+argument ``buffers`` and ``buffer_seq_dims`` respectively.
+2. ``set_rotate_method()`` allows users to choose between the pass-KV approach and the all-to-all approach.
+
+
+Setup
+---------------------
+
+With ``torch.distributed.tensor.experimental.context_parallel()``, users can easily shard the Tensor input and parallelize the execution of the SDPA function.
+To better demonstrate the usage of this API, we start with a simple code snippet doing SDPA and then parallelize it using the API:
+
+.. code:: python
+
+    import torch
+    import torch.nn.functional as F
+
+    from torch.nn.attention import sdpa_kernel, SDPBackend
+
+
+    def sdpa_example():
+        assert torch.cuda.is_available()
+        torch.cuda.set_device("cuda:0")
+        torch.cuda.manual_seed(0)
+
+        batch = 8
+        nheads = 8
+        qkv_len = 8192
+        dim = 32
+        backend = SDPBackend.FLASH_ATTENTION
+        dtype = (
+            torch.bfloat16
+            if backend == SDPBackend.FLASH_ATTENTION
+            or backend == SDPBackend.CUDNN_ATTENTION
+            else torch.float32
+        )
+
+        qkv = [
+            torch.rand(
+                (batch, nheads, qkv_len, dim),
+                dtype=dtype,
+                requires_grad=True,
+                device='cuda',
+            )
+            for _ in range(3)
+        ]
+
+        with sdpa_kernel(backend):
+            out = F.scaled_dot_product_attention(*qkv, is_causal=True)
+
+
+    if __name__ == "__main__":
+        sdpa_example()
+
+
+Enable Context Parallel
+-----------------------
+
+Now, let's first adapt it to a distributed program where each rank has the same tensor input. Then we apply the context parallel API to
+shard to input and distribute the computation across ranks:
+
+.. code:: python
+
+    # file: cp_sdpa_example.py
+    import os
+
+    import torch
+    import torch.distributed as dist
+    import torch.nn.functional as F
+    from torch.distributed.device_mesh import init_device_mesh
+    from torch.distributed.tensor.experimental import context_parallel
+    from torch.distributed.tensor.experimental._attention import context_parallel_unshard
+    from torch.nn.attention import sdpa_kernel, SDPBackend
+
+
+    def context_parallel_sdpa_example(world_size: int, rank: int):
+        assert torch.cuda.is_available()
+        assert dist.is_nccl_available()
+        torch.cuda.set_device(f"cuda:{rank}")
+        torch.cuda.manual_seed(0)
+
+        dist.init_process_group(
+            backend="nccl",
+            init_method="env://",
+            world_size=world_size,
+            rank=rank,
+        )
+        device_mesh = init_device_mesh(
+            device_type="cuda", mesh_shape=(world_size,), mesh_dim_names=("cp",)
+        )
+
+        batch = 8
+        nheads = 8
+        qkv_len = 64
+        dim = 32
+        backend = SDPBackend.FLASH_ATTENTION
+        dtype = (
+            torch.bfloat16
+            if backend == SDPBackend.FLASH_ATTENTION
+            or backend == SDPBackend.CUDNN_ATTENTION
+            else torch.float32
+        )
+
+        qkv = [
+            torch.rand(
+                (batch, nheads, qkv_len, dim),
+                dtype=dtype,
+                requires_grad=True,
+                device='cuda',
+            )
+            for _ in range(3)
+        ]
+        cp_qkv = [t.detach().clone() for t in qkv]
+
+        with sdpa_kernel(backend):
+            with context_parallel(
+                device_mesh, buffers=tuple(cp_qkv), buffer_seq_dims=(2, 2, 2)
+            ):
+                cp_out = F.scaled_dot_product_attention(*cp_qkv, is_causal=True)
+
+            (cp_out,) = context_parallel_unshard(device_mesh, [cp_out], [2])
+            out = F.scaled_dot_product_attention(*qkv, is_causal=True)
+
+            assert torch.allclose(
+                cp_out,
+                out,
+                atol=(1e-08 if dtype == torch.float32 else 1e-03 * world_size),
+            )
+
+
+    if __name__ == "__main__":
+        rank = int(os.environ["RANK"])
+        world_size = int(os.environ["WORLD_SIZE"])
+
+        try:
+            context_parallel_sdpa_example(world_size, rank)
+        finally:
+            dist.barrier()
+            dist.destroy_process_group()
+
+
+You can use the command ``torchrun --standalone --nnodes=1 --nproc-per-node=4 cp_sdpa_example.py`` to launch the above context parallel
+SDPA on 4 GPUs. We demonstrate the nemuric correctness by comparing the output of Ring Attention to that of SDPA on a single GPU.
+
+
+Select Rotation Approach
+------------------------
+
+You can choose the desired shards rotation approach in Ring Attention by using ``torch.distributed.tensor.experimental._attention.set_rotate_method()``:
+
+.. code:: python
+
+    # file: cp_sdpa_example.py
+    from torch.distributed.tensor.experimental._attention import set_rotate_method
+
+    set_rotate_method("alltoall")  # rotate shards using all-to-all
+
+    with sdpa_kernel(backend):
+        with context_parallel(
+            device_mesh, buffers=tuple(cp_qkv), buffer_seq_dims=(2, 2, 2)
+        ):
+            cp_out = F.scaled_dot_product_attention(*cp_qkv, is_causal=True)
+
+
+Conclusion
+----------
+
+In this tutorial, have learned how to parallelize the SDPA computation along the sequence dimension easily with our Context Parallel APIs. For
+design and implementation details, performance analysis, and an end-to-end training example in `torchtitan <https://github.com/pytorch/torchtitan>`__,
+see our post on `PyTorch native long-context training <https://discuss.pytorch.org/t/distributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel/215082>`__.
diff --git a/prototype_source/prototype_index.rst b/prototype_source/prototype_index.rst
@@ -239,6 +239,14 @@ Prototype features are not available as part of binary distributions like PyPI o
    :link: ../prototype/flight_recorder_tutorial.html
    :tags: Distributed, Debugging, FlightRecorder
 
+.. Distributed
+.. customcarditem::
+   :header: Context Parallel Tutorial
+   :card_description: Parallelize the attention computation along sequence dimension
+   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
+   :link: ../prototype/context_parallel.html
+   :tags: Distributed, Context Parallel
+
 .. Integration
 .. customcarditem::
    :header: Out-of-tree extension autoloading in Python
@@ -265,6 +273,7 @@ Prototype features are not available as part of binary distributions like PyPI o
 .. toctree::
    :hidden:
 
+   prototype/context_parallel.html
    prototype/fx_graph_mode_quant_guide.html
    prototype/fx_graph_mode_ptq_dynamic.html
    prototype/fx_graph_mode_ptq_static.html