Skip to content

Commit 2b8bff6

Browse files
authored
[doc][mlgo] Document the logger (serialization) and expose the doc (#141094)
1 parent 6a8dde0 commit 2b8bff6

File tree

2 files changed

+92
-6
lines changed

2 files changed

+92
-6
lines changed

llvm/docs/MLGO.rst

Lines changed: 87 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,7 @@ features.
314314
``MLModelRunner`` implementations
315315
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
316316

317-
We currently feature 3 implementations:
317+
We currently feature 4 implementations:
318318

319319
- ``ModelUnderTrainingRunner``. This requires the compiler be built with TFLite
320320
support. It allows loading a TFLite model dynamically and is primarily
@@ -338,15 +338,97 @@ requiring no out of tree build-time dependencies.
338338
presumably a python training algorithm. We do not envision using this in a
339339
production environment.
340340

341+
- ``NoInferenceModelRunner``. This serves as a store for feature values, and its
342+
``evaluate`` should never be called. It's used for training scenarios, when we
343+
want to capture the behavior of the default (non-ML) heuristic.
344+
341345
Note that training leaves it to the training infrastructure to handle
342346
distributed computing. The assumed architecture has python processes
343347
communicating remotely between themselves, but managing local communication with
344348
clang.
345349

346-
..
347-
TODO(mtrofin):
348-
- logging, and the use in interactive mode.
349-
- discuss an example (like the inliner)
350+
Logging Facility
351+
----------------
352+
353+
When training models, we need to expose the features we will want to use during
354+
inference, as well as outcomes, to guide reward-based learning techniques. This
355+
can happen in 2 forms:
356+
357+
- when running the compiler on some input, as a capture of the features and
358+
actions taken by some policy or a model currently being used.
359+
For example, see ``DevelopmentModeInlineAdvisor`` or ``DevelopmentModeEvictAdvisor``
360+
in ``MLRegallocEvictAdvisor.cpp``. In more detail, in the former case, if
361+
``-training-log`` is specified, the features and actions (inline/no inline)
362+
from each inlining decision are saved to the specified file. Since
363+
``MLModelRunner`` implementations hold on to feature values (they don't get
364+
cleared by ``evaluate``), logging is easily supported by just looping over the
365+
model runner's features and passing the tensor buffers to the logger. Note how
366+
we use the ``NoInferenceModelRunner`` to capture the features observed when
367+
using the default policy.
368+
369+
- as a serialization mechanism for the ``InteractiveModelRunner``. Here, we need
370+
to pass the observed features over IPC (a file descriptor, likely a named
371+
pipe).
372+
373+
Both cases require serializing the same kind of data and we support both with
374+
``Analysis/Utils/TrainingLogger``.
375+
376+
The goal of the logger design was avoiding any new dependency, and optimizing
377+
for the tensor scenario - i.e. exchanging potentially large buffers of fixed
378+
size, containing scalars. We explicitly assume the reader of the format has the
379+
same endianness as the compiler host, and we further expect the reader and the
380+
compiler run on the same host. This is because we expect the training scenarios
381+
have a (typically python) process managing the compiler process, and we leave to
382+
the training side to handle remoting.
383+
384+
The logger produces the following sequence:
385+
386+
- a header describing the structure of the log. This is a one-line textual JSON
387+
dictionary with the following elements:
388+
389+
- ``features``: a list of JSON-serialized ``TensorSpec`` values. The position
390+
in the list matters, as it will be the order in which values will be
391+
subsequently recorded. If we are just logging (i.e. not using the
392+
``InteractiveModelRunner``), the last feature should be that of the action
393+
(e.g. "inline/no inline", or "index of evicted live range")
394+
- (optional) ``score``: a ``TensorSpec`` describing a value we will include to
395+
help formulate a reward. This could be a size estimate or a latency estimate.
396+
- (optional) ``advice``: a ``TensorSpec`` describing the action. This is used
397+
for the ``InteractiveModelRunner``, in which case it shouldn't be in the
398+
``features`` list.
399+
- a sequence of ``contexts``. Contexts are independent traces of the optimization
400+
problem. For module passes, there is only one context, for function passes,
401+
there is a context per function. The start of a context is marked with a
402+
one-line JSON dictionary of the form ``{"context": <context name, a string>}``
403+
404+
Each context has a sequence of:
405+
406+
- ``observations``. An observation is:
407+
408+
- one-line JSON ``{"observation": <observation number. 0-indexed>}``
409+
- a binary dump of the tensor buffers, in the order in which they were
410+
specified in the header.
411+
- a new line character
412+
- if ``score`` was specified in the header:
413+
414+
- a one-line JSON object ``{"outcome": <value>}``, where the ``value``
415+
conforms to the ``TensorSpec`` in defined for the ``score`` in the header.
416+
- the outcome value, as a binary dump
417+
- a new line character.
418+
419+
The format uses a mix of textual JSON (for headers) and binary dumps (for tensors)
420+
because the headers are not expected to dominate the payload - the tensor values
421+
are. We wanted to avoid overburdening the log reader - likely python - from
422+
additional dependencies; and the one-line JSON makes it rudimentarily possible
423+
to inspect a log without additional tooling.
424+
425+
A python utility for reading logs, used for tests, is available at
426+
``Analysis/models/log_reader.py``. A utility showcasing the ``InteractiveModelRunner``,
427+
which uses this reader as well, is at ``Analysis/models/interactive_host.py``.
428+
The latter is also used in tests.
429+
430+
There is no C++ implementation of a log reader. We do not have a scenario
431+
motivating one.
350432

351433
IR2Vec Embeddings
352434
=================

llvm/docs/Reference.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,8 @@ LLVM and API reference documentation.
4040
PCSectionsMetadata
4141
PDB/index
4242
PointerAuth
43-
ScudoHardenedAllocator
4443
MLGO
44+
ScudoHardenedAllocator
4545
MemoryModelRelaxationAnnotations
4646
MemTagSanitizer
4747
Security
@@ -239,3 +239,7 @@ Additional Topics
239239
:doc:`ConvergenceAndUniformity`
240240
A description of uniformity analysis in the presence of irreducible
241241
control flow, and its implementation.
242+
243+
:doc:`MLGO`
244+
Facilities for ML-Guided Optimization, such as collecting IR corpora from a
245+
build, interfacing with ML models, an exposing features for training.

0 commit comments

Comments
 (0)