[doc] add small example to flight recorder tutorial

c-p-i-o · c-p-i-o · commit 2794049256b6 · 2024-11-18T10:11:29.000-08:00
Summary:
Add a small example that demonstrated flight recorder end-to-end.

Test Plan:
Test on github to make sure that the tutorial renders correctly.
diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
@@ -202,6 +202,78 @@ Caveat: tabulate module is needed, so you might need pip install it first.
   python fr_trace.py <dump dir containing trace files> -j [--selected-ranks i j k ...] [--pg-filters tp dp]
   torchfrtrace <dump dir containing trace files> -j [--selected-ranks i j k ...] [--pg-filters 0 2]
 
+
+A Small example
+---------------
+To put this all togther, we demonstrate Flight Recorder using a small program where we induce mismatched collectives.
+`rank0` is programmed to do an additional collective.
+We write out Flight Recorder dump files to the `/tmp`` directory.
+For the purpose of this example, we named the small program `crash.py`.
+
+.. code:: python
+
+  import torch
+  import torch.distributed as dist
+  import os
+  from datetime import timedelta
+
+  local_rank = int(os.environ["LOCAL_RANK"])
+  world_size = int(os.environ["WORLD_SIZE"])
+  assert world_size <= 8, "world size must be less than or equal to 8"
+  os.environ["TORCH_NCCL_DEBUG_INFO_TEMP_FILE"] = "/tmp/trace_"
+  os.environ["TORCH_NCCL_DUMP_ON_TIMEOUT"] = "1"
+  os.environ["TORCH_NCCL_TRACE_BUFFER_SIZE"] = "2000"
+  device = torch.device(f"cuda:{local_rank}")
+  print(f"{local_rank=} {world_size=} master addr: {os.environ['MASTER_ADDR']} master port: {os.environ['MASTER_PORT']} {device=}")
+
+  # Initialize the process group with a small timeout so that jobs fail quickly
+  dist.init_process_group("nccl", world_size=world_size, rank=local_rank, timeout=timedelta(seconds=1))
+
+  a = torch.full((3, 4), float(local_rank), device=device)
+  # Write some collectives to populate Flight Recorder data
+  for i in range(2):
+    print(f"calling allreduce on {local_rank=}")
+    f = dist.all_reduce(a)
+
+  # rank0 is doing an additional collective
+  if local_rank == 0:
+    print("rank0 is doing an allreduce on tensor b, but other ranks forgot")
+    b = torch.full((4,5), float(local_rank), device=device)
+    f = dist.all_reduce(b)
+
+  for i in range(2):
+    print(f"calling allreduce on {local_rank=}")
+    f = dist.all_reduce(a)
+
+  torch.cuda.synchronize(device=device)
+  print(f"{local_rank=} exiting")
+
+
+To run this program, we use `torchrun`.
+
+
+.. code:: python
+
+  torchrun --nnodes=1 --nproc_per_node=2 crash.py
+
+You'll notice two files in the `/tmp` directory
+
+.. code:: bash
+
+  ls /tmp/trace*
+  # Expected output
+  # /tmp/trace_0 /tmp/trace1
+
+Finally, to analyze these two files, we use the `torchfrtrace` command.
+
+.. code:: bash
+
+  torchfrtrace --prefix "trace_" /tmp/
+  # Expected output
+  # Collective 3 at entry 2 error
+  # ...
+
+
 Conclusion
 ----------
 In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder.