Skip to content

[MLGO][Docs] Add documentation on corpus tooling #139362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 15, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
176 changes: 174 additions & 2 deletions llvm/docs/MLGO.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,180 @@ of models during training.
Corpus Tooling
==============

..
TODO(boomanaiden154): Write this section.
Within the LLVM monorepo, there is the ``mlgo-utils`` python packages that
lives at ``llvm/utils/mlgo-utils``. This package primarily contains tooling
for working with corpora, or collections of LLVM bitcode. We use these corpora
to train and evaluate ML models. Corpora consist of a description in JSON
format at ``corpus_description.json`` in the root of the corpus, and then
a bitcode file and command line flags file for each extracted module. The
corpus structure is designed to contain sufficient information to fully
compile the bitcode to bit-identical object files.

.. program:: extract_ir.py

Synopsis
--------

Extracts a corpus from some form of a structured compilation database. This
tool supports a variety of different scenarios and input types.

Options
-------

.. option:: --input

The path to the input. This should be a path to a supported structured
compilation database. Currently only ``compile_commands.json`` files, linker
parameter files, a directory containing object files (for the local
ThinLTO case only), or a JSON file containing a bazel aquery result are
supported.

.. option:: --input_type

The type of input that has been passed to the ``--input`` flag.

.. option:: --output_dir

The output directory to place the corpus in.

.. option:: --num_workers

The number of workers to use for extracting bitcode into the corpus. This
defaults to the number of hardware threads available on the host system.

.. option:: --llvm_objcopy_path

The path to the llvm-objcopy binary to use when extracting bitcode.

.. option:: --obj_base_dir

The base directory for object files. Bitcode files that get extracted into
the corpus will be placed into the output directory based on where their
source object files are placed relative to this path.

.. option:: --cmd_filter

Allows filtering of modules by command line. If set, only modules that much
the filter will be extracted into the corpus. Regular expressions are
supported in some instances.

.. option:: --thinlto_build

If the build was performed with ThinLTO, this should be set to either
``distributed`` or ``local`` depending upon how the build was performed.

.. option:: --cmd_section_name

This flag allows specifying the command line section name. This is needed
on non-ELF platforms where the section name might differ.

.. option:: --bitcode_section_name

This flag allows specifying the bitcode section name. This is needed on
non-ELF platforms where the section name might differ.

Example: CMake
--------------

CMake can output a ``compilation_commands.json`` compilation database if the
``CMAKE_EXPORT_COMPILE_COMMANDS`` switch is turned on at compile time. It is
also necessary to enable bitcode embedding (done by passing
``-Xclang -fembed-bitcode=all`` to all C/C++ compilation actions in the
non-ThinLTO case). For example, to extract a corpus from clang, you would
run the following commands (assuming that the system C/C++ compiler is clang):

.. code-block:: bash

cmake -GNinja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DCMAKE_C_FLAGS="-Xclang -fembed-bitcode=all" \
-DCMAKE_CXX_FLAGS="-Xclang -fembed-bitcode-all"
../llvm
ninja

After running CMake and building the project, there should be a
``compilation_commands.json`` file within the build directory. You can then
run the following command to create a corpus:

.. code-block:: bash

python3 ./extract_ir.py \
--input=./build/compile_commands.json \
--input_type=json \
--output_dir=./corpus

After running the above command, there should be a full
corpus of bitcode within the ``./corpus`` directory.

Example: Bazel Aquery
---------------------

This tool also supports extracting bitcode from bazel in multiple ways
depending upon the exact configuration. For ThinLTO, a linker parameters file
is preferred. For the non-ThinLTO case, the script will accept the output of
``bazel aquery`` which it will use to find all the object files that are linked
into a specific target and then extract bitcode from them. First, you need
to generate the aquery output:

.. code-block:: bash

bazel aquery --output=jsonproto //path/to:target > /path/to/aquery.json

Afterwards, assuming that the build is already complete, you can run this
script to create a corpus:

.. code-block:: bash

python3 ./extract_ir.py \
--input=/path/to/aquery.json \
--input_type=bazel_aqeury \
--output_dir=./corpus \
--obj_base_dir=./bazel-bin

This will again leave a corpus that contains all the bitcode files. This mode
does not capture all object files in the build however, only the ones that
are involved in the link for the binary passed to the ``bazel aquery``
invocation.

.. program:: make_corpus.py

Synopsis
--------

Creates a corpus from a collection of bitcode files.

Options
-------

.. option:: --input_dir

The input directory to search for bitcode files in.

.. option:: --output_dir

The output directory to place the constructed corpus in.

.. option:: --default_args

A list of space separated flags that are put into the corpus description.
These are used by some tooling when compiling the modules within the corpus.

.. program:: combine_training_corpus.py

Synopsis
--------

Combines two training corpora that share the same parent folder by generating
a new ``corpus_description.json`` that contains all the modules in both corpora.

Options
-------

.. option:: --root_dir

The root directory that contains subfolders consisting of the corpora that
should be combined.

Interacting with ML models
==========================
Expand Down
Loading