Skip to content

Commit a3c4a5c

Browse files
[MLGO][Docs] Add documentation on corpus tooling (llvm#139362)
This adds some documentation on the three corpus tools, some examples, and fixes the TODO telling me to get this done.
1 parent f01f082 commit a3c4a5c

File tree

1 file changed

+174
-2
lines changed

1 file changed

+174
-2
lines changed

llvm/docs/MLGO.rst

Lines changed: 174 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,180 @@ of models during training.
2727
Corpus Tooling
2828
==============
2929

30-
..
31-
TODO(boomanaiden154): Write this section.
30+
Within the LLVM monorepo, there is the ``mlgo-utils`` python packages that
31+
lives at ``llvm/utils/mlgo-utils``. This package primarily contains tooling
32+
for working with corpora, or collections of LLVM bitcode. We use these corpora
33+
to train and evaluate ML models. Corpora consist of a description in JSON
34+
format at ``corpus_description.json`` in the root of the corpus, and then
35+
a bitcode file and command line flags file for each extracted module. The
36+
corpus structure is designed to contain sufficient information to fully
37+
compile the bitcode to bit-identical object files.
38+
39+
.. program:: extract_ir.py
40+
41+
Synopsis
42+
--------
43+
44+
Extracts a corpus from some form of a structured compilation database. This
45+
tool supports a variety of different scenarios and input types.
46+
47+
Options
48+
-------
49+
50+
.. option:: --input
51+
52+
The path to the input. This should be a path to a supported structured
53+
compilation database. Currently only ``compile_commands.json`` files, linker
54+
parameter files, a directory containing object files (for the local
55+
ThinLTO case only), or a JSON file containing a bazel aquery result are
56+
supported.
57+
58+
.. option:: --input_type
59+
60+
The type of input that has been passed to the ``--input`` flag.
61+
62+
.. option:: --output_dir
63+
64+
The output directory to place the corpus in.
65+
66+
.. option:: --num_workers
67+
68+
The number of workers to use for extracting bitcode into the corpus. This
69+
defaults to the number of hardware threads available on the host system.
70+
71+
.. option:: --llvm_objcopy_path
72+
73+
The path to the llvm-objcopy binary to use when extracting bitcode.
74+
75+
.. option:: --obj_base_dir
76+
77+
The base directory for object files. Bitcode files that get extracted into
78+
the corpus will be placed into the output directory based on where their
79+
source object files are placed relative to this path.
80+
81+
.. option:: --cmd_filter
82+
83+
Allows filtering of modules by command line. If set, only modules that much
84+
the filter will be extracted into the corpus. Regular expressions are
85+
supported in some instances.
86+
87+
.. option:: --thinlto_build
88+
89+
If the build was performed with ThinLTO, this should be set to either
90+
``distributed`` or ``local`` depending upon how the build was performed.
91+
92+
.. option:: --cmd_section_name
93+
94+
This flag allows specifying the command line section name. This is needed
95+
on non-ELF platforms where the section name might differ.
96+
97+
.. option:: --bitcode_section_name
98+
99+
This flag allows specifying the bitcode section name. This is needed on
100+
non-ELF platforms where the section name might differ.
101+
102+
Example: CMake
103+
--------------
104+
105+
CMake can output a ``compilation_commands.json`` compilation database if the
106+
``CMAKE_EXPORT_COMPILE_COMMANDS`` switch is turned on at compile time. It is
107+
also necessary to enable bitcode embedding (done by passing
108+
``-Xclang -fembed-bitcode=all`` to all C/C++ compilation actions in the
109+
non-ThinLTO case). For example, to extract a corpus from clang, you would
110+
run the following commands (assuming that the system C/C++ compiler is clang):
111+
112+
.. code-block:: bash
113+
114+
cmake -GNinja \
115+
-DCMAKE_BUILD_TYPE=Release \
116+
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
117+
-DCMAKE_C_FLAGS="-Xclang -fembed-bitcode=all" \
118+
-DCMAKE_CXX_FLAGS="-Xclang -fembed-bitcode-all"
119+
../llvm
120+
ninja
121+
122+
After running CMake and building the project, there should be a
123+
``compilation_commands.json`` file within the build directory. You can then
124+
run the following command to create a corpus:
125+
126+
.. code-block:: bash
127+
128+
python3 ./extract_ir.py \
129+
--input=./build/compile_commands.json \
130+
--input_type=json \
131+
--output_dir=./corpus
132+
133+
After running the above command, there should be a full
134+
corpus of bitcode within the ``./corpus`` directory.
135+
136+
Example: Bazel Aquery
137+
---------------------
138+
139+
This tool also supports extracting bitcode from bazel in multiple ways
140+
depending upon the exact configuration. For ThinLTO, a linker parameters file
141+
is preferred. For the non-ThinLTO case, the script will accept the output of
142+
``bazel aquery`` which it will use to find all the object files that are linked
143+
into a specific target and then extract bitcode from them. First, you need
144+
to generate the aquery output:
145+
146+
.. code-block:: bash
147+
148+
bazel aquery --output=jsonproto //path/to:target > /path/to/aquery.json
149+
150+
Afterwards, assuming that the build is already complete, you can run this
151+
script to create a corpus:
152+
153+
.. code-block:: bash
154+
155+
python3 ./extract_ir.py \
156+
--input=/path/to/aquery.json \
157+
--input_type=bazel_aqeury \
158+
--output_dir=./corpus \
159+
--obj_base_dir=./bazel-bin
160+
161+
This will again leave a corpus that contains all the bitcode files. This mode
162+
does not capture all object files in the build however, only the ones that
163+
are involved in the link for the binary passed to the ``bazel aquery``
164+
invocation.
165+
166+
.. program:: make_corpus.py
167+
168+
Synopsis
169+
--------
170+
171+
Creates a corpus from a collection of bitcode files.
172+
173+
Options
174+
-------
175+
176+
.. option:: --input_dir
177+
178+
The input directory to search for bitcode files in.
179+
180+
.. option:: --output_dir
181+
182+
The output directory to place the constructed corpus in.
183+
184+
.. option:: --default_args
185+
186+
A list of space separated flags that are put into the corpus description.
187+
These are used by some tooling when compiling the modules within the corpus.
188+
189+
.. program:: combine_training_corpus.py
190+
191+
Synopsis
192+
--------
193+
194+
Combines two training corpora that share the same parent folder by generating
195+
a new ``corpus_description.json`` that contains all the modules in both corpora.
196+
197+
Options
198+
-------
199+
200+
.. option:: --root_dir
201+
202+
The root directory that contains subfolders consisting of the corpora that
203+
should be combined.
32204

33205
Interacting with ML models
34206
==========================

0 commit comments

Comments
 (0)