Skip to content

[doc][mlgo] Document the logger (serialization) and expose the doc #141094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 22, 2025

Conversation

mtrofin
Copy link
Member

@mtrofin mtrofin commented May 22, 2025

No description provided.

@mtrofin mtrofin requested a review from boomanaiden154 May 22, 2025 15:55
@llvmbot llvmbot added the mlgo label May 22, 2025
@llvmbot
Copy link
Member

llvmbot commented May 22, 2025

@llvm/pr-subscribers-mlgo

Author: Mircea Trofin (mtrofin)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/141094.diff

2 Files Affected:

  • (modified) llvm/docs/MLGO.rst (+87-5)
  • (modified) llvm/docs/Reference.rst (+4-1)
diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst
index c88d8d68a7ce3..dd219f4d979fa 100644
--- a/llvm/docs/MLGO.rst
+++ b/llvm/docs/MLGO.rst
@@ -314,7 +314,7 @@ features.
 ``MLModelRunner`` implementations
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-We currently feature 3 implementations:
+We currently feature 4 implementations:
 
 - ``ModelUnderTrainingRunner``. This requires the compiler be built with TFLite
   support. It allows loading a TFLite model dynamically and is primarily
@@ -338,12 +338,94 @@ requiring no out of tree build-time dependencies.
   presumably a python training algorithm. We do not envision using this in a
   production environment.
 
+- ``NoInferenceModelRunner``. This serves as a store for feature values, and its
+  ``evaluate`` should never be called. It's used for training scenarios, when we
+  want to capture the behavior of the default (non-ML) heuristic.
+
 Note that training leaves it to the training infrastructure to handle
 distributed computing. The assumed architecture has python processes
 communicating remotely between themselves, but managing local communication with
 clang.
 
-..
-    TODO(mtrofin): 
-        - logging, and the use in interactive mode.
-        - discuss an example (like the inliner)
+Logging Facility
+----------------
+
+When training models, we need to expose the features we will want to use during
+inference, as well as outcomes, to guide reward-based learning techniques. This
+can happen in 2 forms:
+
+- as an effect of running the compiler on some input, as a capture of the
+  features and actions taken by some policy or a model currently being used.
+  For example, see ``DevelopmentModeInlineAdvisor`` or ``DevelopmentModeEvictAdvisor``
+  in ``MLRegallocEvictAdvisor.cpp``. In more detail, in the former case, if
+  ``-training-log`` is specified, the features and actions (inline/no inline)
+  from each inlining decision are saved to the specified file. Since
+  ``MLModelRunner`` implementations hold on to feature values (they don't get
+  cleared by ``evaluate``), logging is easily supported by just looping over the
+  model runner's features and passing the tensor buffers to the logger. Note how
+  we use the ``NoInferenceModelRunner`` to capture the features observed when
+  using the default policy.
+
+- as a serialization mechanism for the ``InteractiveModelRunner``. Here, we need
+  to pass the observed features over IPC (a file descriptor, likely a named
+  pipe).
+
+Both cases require serializing the same kind of data and we support both with
+``Analysis/Utils/TrainingLogger``.
+
+The goal of the logger design was avoiding any new dependency, and optimizing
+for the tensor scenario - i.e. exchanging potentially large buffers of fixed
+size, containing scalars. We explicitly assume the reader of the format has the
+same endianness as the compiler host, and we further expect the reader and the
+compiler run on the same host. This is because we expect the training scenarios
+have a (typically python) process managing the compiler process, and we leave to
+the training side to handle remoting.
+
+The logger produces the following sequence:
+
+- a header describing the structure of the log. This is a one-line textual JSON
+  dictionary with the following elements:
+  
+  - ``features``: a list of JSON-serialized ``TensorSpec`` values. The position
+    in the list matters, as it will be the order in which values will be
+    subsequently recorded. If we are just logging (i.e. not using the
+    ``InteractiveModelRunner``), the last feature should be that of the action
+    (e.g. "inline/no inline", or "index of evicted live range")
+  - (optional) ``score``: a ``TensorSpec`` describing a value we will include to
+    help formulate a reward. This could be a size estimate or a latency estimate.
+  - (optional) ``advice``: a ``TensorSpec`` describing the action. This is used
+    for the ``InteractiveModelRunner``, in which case it shouldn't be in the 
+    ``features`` list.
+- a sequence of ``contexts``. Contexts are independent traces of the optimization
+  problem. For module passes, there is only one context, for function passes,
+  there is a context per function. The start of a context is marked with a
+  one-line JSON dictionary of the form ``{"context": <context name, a string>}``
+  
+  Each context has a sequence of:
+
+  - ``observations``. An observation is:
+    
+    - one-line JSON ``{"observation": <observation number. 0-indexed>}``
+    - a binary dump of the tensor buffers, in the order in which they were
+      specified in the header.
+    - a new line character
+    - if ``score`` was specified in the header:
+    
+      - a one-line JSON object ``{"outcome": <value>}``, where the ``value``
+        conforms to the ``TensorSpec`` in defined for the ``score`` in the header.
+      - the outcome value, as a binary dump
+      - a new line character.
+
+The format uses a mix of textual JSON (for headers) and binary dumps (for tensors)
+because the headers are not expected to dominate the payload - the tensor values
+are. We wanted to avoid overburdening the log reader - likely python - from
+additional dependencies; and the one-line JSON makes it rudimentarily possible
+to inspect a log without additional tooling.
+
+A python utility for reading logs, used for tests, is available at
+``Analysis/models/log_reader.py``. A utility showcasing the ``InteractiveModelRunner``,
+which uses this reader as well, is at ``Analysis/models/``Analysis/models/interactive_host.py``.
+The latter is also used in tests.
+
+There is no C++ implementation of a log reader. We do not have a scenario
+motivating one.
diff --git a/llvm/docs/Reference.rst b/llvm/docs/Reference.rst
index 565d5c6876d66..7f13aec2ec86b 100644
--- a/llvm/docs/Reference.rst
+++ b/llvm/docs/Reference.rst
@@ -41,7 +41,6 @@ LLVM and API reference documentation.
    PDB/index
    PointerAuth
    ScudoHardenedAllocator
-   MLGO
    MemoryModelRelaxationAnnotations
    MemTagSanitizer
    Security
@@ -239,3 +238,7 @@ Additional Topics
 :doc:`ConvergenceAndUniformity`
    A description of uniformity analysis in the presence of irreducible
    control flow, and its implementation.
+
+:doc:`MLGO`
+   Facilities for ML-Guided Optimization, such as collecting IR corpora from a
+   build, interfacing with ML models, an exposing features for training.

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the docs build is failing. That needs to be fixed.

@mtrofin mtrofin merged commit 2b8bff6 into llvm:main May 22, 2025
10 checks passed
@mtrofin mtrofin deleted the docs3 branch May 22, 2025 19:28
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Jun 3, 2025
ajaden-codes pushed a commit to Jaddyen/llvm-project that referenced this pull request Jun 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants