@@ -27,8 +27,180 @@ of models during training.
27
27
Corpus Tooling
28
28
==============
29
29
30
- ..
31
- TODO(boomanaiden154): Write this section.
30
+ Within the LLVM monorepo, there is the ``mlgo-utils `` python packages that
31
+ lives at ``llvm/utils/mlgo-utils ``. This package primarily contains tooling
32
+ for working with corpora, or collections of LLVM bitcode. We use these corpora
33
+ to train and evaluate ML models. Corpora consist of a description in JSON
34
+ format at ``corpus_description.json `` in the root of the corpus, and then
35
+ a bitcode file and command line flags file for each extracted module. The
36
+ corpus structure is designed to contain sufficient information to fully
37
+ compile the bitcode to bit-identical object files.
38
+
39
+ .. program :: extract_ir.py
40
+
41
+ Synopsis
42
+ --------
43
+
44
+ Extracts a corpus from some form of a structured compilation database. This
45
+ tool supports a variety of different scenarios and input types.
46
+
47
+ Options
48
+ -------
49
+
50
+ .. option :: --input
51
+
52
+ The path to the input. This should be a path to a supported structured
53
+ compilation database. Currently only ``compile_commands.json `` files, linker
54
+ parameter files, a directory containing object files (for the local
55
+ ThinLTO case only), or a JSON file containing a bazel aquery result are
56
+ supported.
57
+
58
+ .. option :: --input_type
59
+
60
+ The type of input that has been passed to the ``--input `` flag.
61
+
62
+ .. option :: --output_dir
63
+
64
+ The output directory to place the corpus in.
65
+
66
+ .. option :: --num_workers
67
+
68
+ The number of workers to use for extracting bitcode into the corpus. This
69
+ defaults to the number of hardware threads available on the host system.
70
+
71
+ .. option :: --llvm_objcopy_path
72
+
73
+ The path to the llvm-objcopy binary to use when extracting bitcode.
74
+
75
+ .. option :: --obj_base_dir
76
+
77
+ The base directory for object files. Bitcode files that get extracted into
78
+ the corpus will be placed into the output directory based on where their
79
+ source object files are placed relative to this path.
80
+
81
+ .. option :: --cmd_filter
82
+
83
+ Allows filtering of modules by command line. If set, only modules that much
84
+ the filter will be extracted into the corpus. Regular expressions are
85
+ supported in some instances.
86
+
87
+ .. option :: --thinlto_build
88
+
89
+ If the build was performed with ThinLTO, this should be set to either
90
+ ``distributed `` or ``local `` depending upon how the build was performed.
91
+
92
+ .. option :: --cmd_section_name
93
+
94
+ This flag allows specifying the command line section name. This is needed
95
+ on non-ELF platforms where the section name might differ.
96
+
97
+ .. option :: --bitcode_section_name
98
+
99
+ This flag allows specifying the bitcode section name. This is needed on
100
+ non-ELF platforms where the section name might differ.
101
+
102
+ Example: CMake
103
+ --------------
104
+
105
+ CMake can output a ``compilation_commands.json `` compilation database if the
106
+ ``CMAKE_EXPORT_COMPILE_COMMANDS `` switch is turned on at compile time. It is
107
+ also necessary to enable bitcode embedding (done by passing
108
+ ``-Xclang -fembed-bitcode=all `` to all C/C++ compilation actions in the
109
+ non-ThinLTO case). For example, to extract a corpus from clang, you would
110
+ run the following commands (assuming that the system C/C++ compiler is clang):
111
+
112
+ .. code-block :: bash
113
+
114
+ cmake -GNinja \
115
+ -DCMAKE_BUILD_TYPE=Release \
116
+ -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
117
+ -DCMAKE_C_FLAGS=" -Xclang -fembed-bitcode=all" \
118
+ -DCMAKE_CXX_FLAGS=" -Xclang -fembed-bitcode-all"
119
+ ../llvm
120
+ ninja
121
+
122
+ After running CMake and building the project, there should be a
123
+ ``compilation_commands.json `` file within the build directory. You can then
124
+ run the following command to create a corpus:
125
+
126
+ .. code-block :: bash
127
+
128
+ python3 ./extract_ir.py \
129
+ --input=./build/compile_commands.json \
130
+ --input_type=json \
131
+ --output_dir=./corpus
132
+
133
+ After running the above command, there should be a full
134
+ corpus of bitcode within the ``./corpus `` directory.
135
+
136
+ Example: Bazel Aquery
137
+ ---------------------
138
+
139
+ This tool also supports extracting bitcode from bazel in multiple ways
140
+ depending upon the exact configuration. For ThinLTO, a linker parameters file
141
+ is preferred. For the non-ThinLTO case, the script will accept the output of
142
+ ``bazel aquery `` which it will use to find all the object files that are linked
143
+ into a specific target and then extract bitcode from them. First, you need
144
+ to generate the aquery output:
145
+
146
+ .. code-block :: bash
147
+
148
+ bazel aquery --output=jsonproto //path/to:target > /path/to/aquery.json
149
+
150
+ Afterwards, assuming that the build is already complete, you can run this
151
+ script to create a corpus:
152
+
153
+ .. code-block :: bash
154
+
155
+ python3 ./extract_ir.py \
156
+ --input=/path/to/aquery.json \
157
+ --input_type=bazel_aqeury \
158
+ --output_dir=./corpus \
159
+ --obj_base_dir=./bazel-bin
160
+
161
+ This will again leave a corpus that contains all the bitcode files. This mode
162
+ does not capture all object files in the build however, only the ones that
163
+ are involved in the link for the binary passed to the ``bazel aquery ``
164
+ invocation.
165
+
166
+ .. program :: make_corpus.py
167
+
168
+ Synopsis
169
+ --------
170
+
171
+ Creates a corpus from a collection of bitcode files.
172
+
173
+ Options
174
+ -------
175
+
176
+ .. option :: --input_dir
177
+
178
+ The input directory to search for bitcode files in.
179
+
180
+ .. option :: --output_dir
181
+
182
+ The output directory to place the constructed corpus in.
183
+
184
+ .. option :: --default_args
185
+
186
+ A list of space separated flags that are put into the corpus description.
187
+ These are used by some tooling when compiling the modules within the corpus.
188
+
189
+ .. program :: combine_training_corpus.py
190
+
191
+ Synopsis
192
+ --------
193
+
194
+ Combines two training corpora that share the same parent folder by generating
195
+ a new ``corpus_description.json `` that contains all the modules in both corpora.
196
+
197
+ Options
198
+ -------
199
+
200
+ .. option :: --root_dir
201
+
202
+ The root directory that contains subfolders consisting of the corpora that
203
+ should be combined.
32
204
33
205
Interacting with ML models
34
206
==========================
0 commit comments