You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: prototype_source/flight_recorder_tutorial.rst
+11-10Lines changed: 11 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -48,8 +48,8 @@ Enabling Flight Recorder
48
48
------------------------
49
49
There are two required environment variables to get the initial version of Flight Recorder working.
50
50
51
-
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. The dump is one
52
-
file per rank. The default value is ``/tmp/nccl_trace_rank_``.
51
+
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
52
+
rank. The default value is ``/tmp/nccl_trace_rank_``.
53
53
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N)``: Setting ``N`` to a positive number enables collection.
54
54
``N`` represents the number of entries that will be kept internally in a circular buffer.
55
55
We recommended to set this value at *2000*.
@@ -73,9 +73,9 @@ Additional Settings
73
73
74
74
``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
75
75
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
76
-
- If you don't want the flight recorder to be dumped into the local disk but instead onto your own storage, users can define your own writer class
77
-
which inherits from class ``::c10d::DebugInfoWriter`` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter`` before
78
-
we initiate c10d distributed.
76
+
- If you prefer not to have the flight recorder data dumped into the local disk but rather onto your own storage, you can define your own writer class.
77
+
This class should inherit from class ``::c10d::DebugInfoWriter`` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter``
78
+
before we initiate PyTorch distributed.
79
79
80
80
Retrieving Flight Recorder Data via an API
81
81
------------------------------------------
@@ -178,17 +178,18 @@ To run the convenience script, follow these steps:
178
178
179
179
python fr_trace.py <dump dir containing trace files> [-o <output file>]
180
180
181
-
Or if you install PyTorch nightly build or build from scratch (with ``USE_DISTRIBUTED=1``), you can directly use the following command:
181
+
If you install the PyTorch nightly build or build from scratch with ``USE_DISTRIBUTED=1``, you can directly use the following
182
+
command directly:
182
183
183
184
.. code:: shell
184
185
185
186
torchfrtrace <dump dir containing trace files> [-o <output file>]
186
187
187
188
188
-
For now, we support two modes for the analyzer script, one is to let the script to apply some heuristics to the parsed flight recorder
189
-
dumps to generate a report on potential culprit for the timeout/hang; the other one is to just print out raw dumps. For the latter, by
190
-
default the script prints for all ranks and all ProcessGroups(PGs), and this can be narrowed down to certain ranks and PGs. Example
191
-
command is:
189
+
Currently, we support two modes for the analyzer script. The first mode allows the script to apply some heuristics to the parsed flight
190
+
recorder dumps to generate a report identifying potential culprits for the timeout. The second mode is simply outputs the raw dumps.
191
+
By default, the script prints flight recoder dumps for all ranks and all ``ProcessGroups``(PGs). This can be narrowed down to certain
192
+
ranks and PGs. An example command is:
192
193
193
194
Caveat: tabulate module is needed, so you might need pip install it first.
0 commit comments