Skip to content

Commit 143d736

Browse files
committed
rebase and address comments
1 parent 457ad11 commit 143d736

File tree

1 file changed

+11
-10
lines changed

1 file changed

+11
-10
lines changed

prototype_source/flight_recorder_tutorial.rst

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ Enabling Flight Recorder
4848
------------------------
4949
There are two required environment variables to get the initial version of Flight Recorder working.
5050

51-
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. The dump is one
52-
file per rank. The default value is ``/tmp/nccl_trace_rank_``.
51+
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
52+
rank. The default value is ``/tmp/nccl_trace_rank_``.
5353
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N)``: Setting ``N`` to a positive number enables collection.
5454
``N`` represents the number of entries that will be kept internally in a circular buffer.
5555
We recommended to set this value at *2000*.
@@ -73,9 +73,9 @@ Additional Settings
7373

7474
``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
7575
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
76-
- If you don't want the flight recorder to be dumped into the local disk but instead onto your own storage, users can define your own writer class
77-
which inherits from class ``::c10d::DebugInfoWriter`` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter`` before
78-
we initiate c10d distributed.
76+
- If you prefer not to have the flight recorder data dumped into the local disk but rather onto your own storage, you can define your own writer class.
77+
This class should inherit from class ``::c10d::DebugInfoWriter`` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter``
78+
before we initiate PyTorch distributed.
7979

8080
Retrieving Flight Recorder Data via an API
8181
------------------------------------------
@@ -178,17 +178,18 @@ To run the convenience script, follow these steps:
178178
179179
python fr_trace.py <dump dir containing trace files> [-o <output file>]
180180
181-
Or if you install PyTorch nightly build or build from scratch (with ``USE_DISTRIBUTED=1``), you can directly use the following command:
181+
If you install the PyTorch nightly build or build from scratch with ``USE_DISTRIBUTED=1``, you can directly use the following
182+
command directly:
182183

183184
.. code:: shell
184185
185186
torchfrtrace <dump dir containing trace files> [-o <output file>]
186187
187188
188-
For now, we support two modes for the analyzer script, one is to let the script to apply some heuristics to the parsed flight recorder
189-
dumps to generate a report on potential culprit for the timeout/hang; the other one is to just print out raw dumps. For the latter, by
190-
default the script prints for all ranks and all ProcessGroups(PGs), and this can be narrowed down to certain ranks and PGs. Example
191-
command is:
189+
Currently, we support two modes for the analyzer script. The first mode allows the script to apply some heuristics to the parsed flight
190+
recorder dumps to generate a report identifying potential culprits for the timeout. The second mode is simply outputs the raw dumps.
191+
By default, the script prints flight recoder dumps for all ranks and all ``ProcessGroups``(PGs). This can be narrowed down to certain
192+
ranks and PGs. An example command is:
192193
193194
Caveat: tabulate module is needed, so you might need pip install it first.
194195

0 commit comments

Comments
 (0)