@@ -89,18 +89,18 @@ instruction, run:
89
89
90
90
.. code-block :: bash
91
91
92
- $ llvm-exegesis -mode=latency -opcode-name=ADD64rr
92
+ $ llvm-exegesis -- mode=latency - -opcode-name=ADD64rr
93
93
94
94
Measuring the uop decomposition or inverse throughput of an instruction works similarly:
95
95
96
96
.. code-block :: bash
97
97
98
- $ llvm-exegesis -mode=uops -opcode-name=ADD64rr
99
- $ llvm-exegesis -mode=inverse_throughput -opcode-name=ADD64rr
98
+ $ llvm-exegesis -- mode=uops - -opcode-name=ADD64rr
99
+ $ llvm-exegesis -- mode=inverse_throughput - -opcode-name=ADD64rr
100
100
101
101
102
102
The output is a YAML document (the default is to write to stdout, but you can
103
- redirect the output to a file using `-benchmarks-file `):
103
+ redirect the output to a file using `-- benchmarks-file `):
104
104
105
105
.. code-block :: none
106
106
@@ -125,7 +125,7 @@ To measure the latency of all instructions for the host architecture, run:
125
125
126
126
.. code-block :: bash
127
127
128
- $ llvm-exegesis -mode=latency -opcode-index=-1
128
+ $ llvm-exegesis -- mode=latency - -opcode-index=-1
129
129
130
130
131
131
EXAMPLE 2: benchmarking a custom code snippet
@@ -136,7 +136,7 @@ To measure the latency/uops of a custom piece of code, you can specify the
136
136
137
137
.. code-block :: bash
138
138
139
- $ echo " vzeroupper" | llvm-exegesis -mode=uops -snippets-file=-
139
+ $ echo " vzeroupper" | llvm-exegesis -- mode=uops - -snippets-file=-
140
140
141
141
Real-life code snippets typically depend on registers or memory.
142
142
:program: `llvm-exegesis ` checks the liveliness of registers (i.e. any register
@@ -189,10 +189,10 @@ following command:
189
189
190
190
.. code-block :: bash
191
191
192
- $ llvm-exegesis -mode=analysis \
193
- -benchmarks-file=/tmp/benchmarks.yaml \
194
- -analysis-clusters-output-file=/tmp/clusters.csv \
195
- -analysis-inconsistencies-output-file=/tmp/inconsistencies.html
192
+ $ llvm-exegesis -- mode=analysis \
193
+ -- benchmarks-file=/tmp/benchmarks.yaml \
194
+ -- analysis-clusters-output-file=/tmp/clusters.csv \
195
+ -- analysis-inconsistencies-output-file=/tmp/inconsistencies.html
196
196
197
197
This will group the instructions into clusters with the same performance
198
198
characteristics. The clusters will be written out to `/tmp/clusters.csv ` in the
@@ -230,28 +230,28 @@ be shown. This does not invalidate any of the analysis results though.
230
230
OPTIONS
231
231
-------
232
232
233
- .. option :: -help
233
+ .. option :: -- help
234
234
235
235
Print a summary of command line options.
236
236
237
- .. option :: -opcode-index= <LLVM opcode index >
237
+ .. option :: -- opcode-index= <LLVM opcode index >
238
238
239
239
Specify the opcode to measure, by index. Specifying `-1 ` will result
240
240
in measuring every existing opcode. See example 1 for details.
241
241
Either `opcode-index `, `opcode-name ` or `snippets-file ` must be set.
242
242
243
- .. option :: -opcode-name= <opcode name 1 >,<opcode name 2 >,...
243
+ .. option :: -- opcode-name= <opcode name 1 >,<opcode name 2 >,...
244
244
245
245
Specify the opcode to measure, by name. Several opcodes can be specified as
246
246
a comma-separated list. See example 1 for details.
247
247
Either `opcode-index `, `opcode-name ` or `snippets-file ` must be set.
248
248
249
- .. option :: -snippets-file= <filename >
249
+ .. option :: -- snippets-file= <filename >
250
250
251
251
Specify the custom code snippet to measure. See example 2 for details.
252
252
Either `opcode-index `, `opcode-name ` or `snippets-file ` must be set.
253
253
254
- .. option :: -mode=[latency|uops|inverse_throughput|analysis]
254
+ .. option :: -- mode=[latency|uops|inverse_throughput|analysis]
255
255
256
256
Specify the run mode. Note that some modes have additional requirements and options.
257
257
@@ -274,7 +274,7 @@ OPTIONS
274
274
* ``assemble-measured-code ``: Same as ``prepare-and-assemble-snippet ``. but also creates the full sequence that can be dumped to a file using ``--dump-object-to-disk ``.
275
275
* ``measure ``: Same as ``assemble-measured-code ``, but also runs the measurement.
276
276
277
- .. option :: -x86-lbr-sample-period= <nBranches/sample >
277
+ .. option :: -- x86-lbr-sample-period= <nBranches/sample >
278
278
279
279
Specify the LBR sampling period - how many branches before we take a sample.
280
280
When a positive value is specified for this option and when the mode is `latency `,
@@ -283,7 +283,7 @@ OPTIONS
283
283
could occur if the sampling is too frequent. A prime number should be used to
284
284
avoid consistently skipping certain blocks.
285
285
286
- .. option :: -x86-disable-upper-sse-registers
286
+ .. option :: -- x86-disable-upper-sse-registers
287
287
288
288
Using the upper xmm registers (xmm8-xmm15) forces a longer instruction encoding
289
289
which may put greater pressure on the frontend fetch and decode stages,
@@ -292,7 +292,7 @@ OPTIONS
292
292
enabled can help determine the effects of the frontend and can be used to
293
293
improve latency and throughput estimates.
294
294
295
- .. option :: -repetition-mode=[duplicate|loop|min]
295
+ .. option :: -- repetition-mode=[duplicate|loop|min]
296
296
297
297
Specify the repetition mode. `duplicate ` will create a large, straight line
298
298
basic block with `num-repetitions ` instructions (repeating the snippet
@@ -307,21 +307,21 @@ OPTIONS
307
307
instead use the `min ` mode, which will run each other mode,
308
308
and produce the minimal measured result.
309
309
310
- .. option :: -num-repetitions= <Number of repetitions >
310
+ .. option :: -- num-repetitions= <Number of repetitions >
311
311
312
312
Specify the target number of executed instructions. Note that the actual
313
313
repetition count of the snippet will be `num-repetitions `/`snippet size `.
314
314
Higher values lead to more accurate measurements but lengthen the benchmark.
315
315
316
- .. option :: -loop-body-size= <Preferred loop body size >
316
+ .. option :: -- loop-body-size= <Preferred loop body size >
317
317
318
318
Only effective for `-repetition-mode=[loop|min] `.
319
319
Instead of looping over the snippet directly, first duplicate it so that the
320
320
loop body contains at least this many instructions. This potentially results
321
321
in loop body being cached in the CPU Op Cache / Loop Cache, which allows to
322
322
which may have higher throughput than the CPU decoders.
323
323
324
- .. option :: -max-configs-per-opcode= <value >
324
+ .. option :: -- max-configs-per-opcode= <value >
325
325
326
326
Specify the maximum configurations that can be generated for each opcode.
327
327
By default this is `1 `, meaning that we assume that a single measurement is
@@ -333,67 +333,67 @@ OPTIONS
333
333
lead to different performance characteristics.
334
334
335
335
336
- .. option :: -benchmarks-file= </path/to/file >
336
+ .. option :: -- benchmarks-file= </path/to/file >
337
337
338
338
File to read (`analysis ` mode) or write (`latency `/`uops `/`inverse_throughput `
339
339
modes) benchmark results. "-" uses stdin/stdout.
340
340
341
- .. option :: -analysis-clusters-output-file= </path/to/file >
341
+ .. option :: -- analysis-clusters-output-file= </path/to/file >
342
342
343
343
If provided, write the analysis clusters as CSV to this file. "-" prints to
344
344
stdout. By default, this analysis is not run.
345
345
346
- .. option :: -analysis-inconsistencies-output-file= </path/to/file >
346
+ .. option :: -- analysis-inconsistencies-output-file= </path/to/file >
347
347
348
348
If non-empty, write inconsistencies found during analysis to this file. `- `
349
349
prints to stdout. By default, this analysis is not run.
350
350
351
- .. option :: -analysis-filter=[all|reg-only|mem-only]
351
+ .. option :: -- analysis-filter=[all|reg-only|mem-only]
352
352
353
353
By default, all benchmark results are analysed, but sometimes it may be useful
354
354
to only look at those that to not involve memory, or vice versa. This option
355
355
allows to either keep all benchmarks, or filter out (ignore) either all the
356
356
ones that do involve memory (involve instructions that may read or write to
357
357
memory), or the opposite, to only keep such benchmarks.
358
358
359
- .. option :: -analysis-clustering=[dbscan ,naive]
359
+ .. option :: -- analysis-clustering=[dbscan ,naive]
360
360
361
361
Specify the clustering algorithm to use. By default DBSCAN will be used.
362
362
Naive clustering algorithm is better for doing further work on the
363
363
`-analysis-inconsistencies-output-file= ` output, it will create one cluster
364
364
per opcode, and check that the cluster is stable (all points are neighbours).
365
365
366
- .. option :: -analysis-numpoints= <dbscan numPoints parameter >
366
+ .. option :: -- analysis-numpoints= <dbscan numPoints parameter >
367
367
368
368
Specify the numPoints parameters to be used for DBSCAN clustering
369
369
(`analysis ` mode, DBSCAN only).
370
370
371
- .. option :: -analysis-clustering-epsilon= <dbscan epsilon parameter >
371
+ .. option :: -- analysis-clustering-epsilon= <dbscan epsilon parameter >
372
372
373
373
Specify the epsilon parameter used for clustering of benchmark points
374
374
(`analysis ` mode).
375
375
376
- .. option :: -analysis-inconsistency-epsilon= <epsilon >
376
+ .. option :: -- analysis-inconsistency-epsilon= <epsilon >
377
377
378
378
Specify the epsilon parameter used for detection of when the cluster
379
379
is different from the LLVM schedule profile values (`analysis ` mode).
380
380
381
- .. option :: -analysis-display-unstable-clusters
381
+ .. option :: -- analysis-display-unstable-clusters
382
382
383
383
If there is more than one benchmark for an opcode, said benchmarks may end up
384
384
not being clustered into the same cluster if the measured performance
385
385
characteristics are different. by default all such opcodes are filtered out.
386
386
This flag will instead show only such unstable opcodes.
387
387
388
- .. option :: -ignore-invalid-sched-class=false
388
+ .. option :: -- ignore-invalid-sched-class=false
389
389
390
390
If set, ignore instructions that do not have a sched class (class idx = 0).
391
391
392
- .. option :: -mtriple= <triple name >
392
+ .. option :: -- mtriple= <triple name >
393
393
394
394
Target triple. See `-version ` for available targets.
395
395
396
- .. option :: -mcpu= <cpu name >
396
+ .. option :: -- mcpu= <cpu name >
397
397
398
398
If set, measure the cpu characteristics using the counters for this CPU. This
399
399
is useful when creating new sched models (the host CPU is unknown to LLVM).
0 commit comments