Skip to content

Commit cc9e614

Browse files
authored
Merge branch 'main' into aligned_alloc
2 parents 7b1d493 + 54ca5a8 commit cc9e614

File tree

115 files changed

+50438
-28570
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

115 files changed

+50438
-28570
lines changed

bolt/docs/OptimizingLinux.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# Optimizing Linux Kernel with BOLT
2+
3+
4+
## Introduction
5+
6+
Many Linux applications spend a significant amount of their execution time in the kernel. Thus, when we consider code optimization for system performance, it is essential to improve the CPU utilization not only in the user-space applications and libraries but also in the kernel. BOLT has demonstrated double-digit gains while being applied to user-space programs. This guide shows how to apply BOLT to the x86-64 Linux kernel and enhance your system's performance. In our experiments, BOLT boosted database TPS by 2 percent when applied to the kernel compiled with the highest level optimizations, including PGO and LTO. The database spent ~40% of the time in the kernel and was quite sensitive to kernel performance.
7+
8+
BOLT optimizes code layout based on a low-level execution profile collected with the Linux `perf` tool. The best quality profile should include branch history, such as Intel's last branch records (LBR). BOLT runs on a linked binary and reorders the code while combining frequently executed blocks of instructions in a manner best suited for the hardware. Other than branch instructions, most of the code is left unchanged. Additionally, BOLT updates all metadata associated with the modified code, including DWARF debug information and Linux ORC unwind information.
9+
10+
While BOLT optimizations are not specific to the Linux kernel, certain quirks distinguish the kernel from user-level applications.
11+
12+
BOLT has been successfully applied to and tested with several flavors of the x86-64 Linux kernel.
13+
14+
15+
## QuickStart Guide
16+
17+
BOLT operates on a statically-linked kernel executable, a.k.a. `vmlinux` binary. However, most Linux distributions use a `vmlinuz` compressed image for system booting. To use BOLT on the kernel, you must either repackage `vmlinuz` after BOLT optimizations or add steps for running BOLT into the kernel build and rebuild `vmlinuz`. Uncompressing `vmlinuz` and repackaging it with a new `vmlinux` binary falls beyond the scope of this guide, and at some point, we may add the capability to run BOLT directly on `vmlinuz`. Meanwhile, this guide focuses on steps for integrating BOLT into the kernel build process.
18+
19+
20+
### Building the Kernel
21+
22+
After downloading the kernel sources and configuration for your distribution, you should be able to build `vmlinuz` using the `make bzImage` command. Ideally, the kernel should binary match the kernel on the system you are about to optimize (the target system). The binary matching part is critical as BOLT performs profile matching and optimizations at the binary level. We recommend installing a freshly built kernel on the target system to avoid any discrepancies.
23+
24+
Note that the kernel build will produce several artifacts besides bzImage. The most important of them is the uncompressed `vmlinux` binary, which will be used in the next steps. Make sure to save this file.
25+
26+
Build and target systems should have a `perf` tool installed for collecting and processing profiles. If your build system differs from the target, make sure `perf` versions are compatible. The build system should also have the latest BOLT binary and tools (`llvm-bolt`, `perf2bolt`).
27+
28+
Once the target system boots with the freshly-built kernel, start your workload, such as a database benchmark. While the system is under load, collect the kernel profile using perf:
29+
30+
31+
```bash
32+
$ sudo perf record -a -e cycles -j any,k -F 5000 -- sleep 600
33+
```
34+
35+
36+
Convert `perf` profile into a format suitable for BOLT passing the `vmlinux` binary to `perf2bolt`:
37+
38+
39+
```bash
40+
$ sudo chwon $USER perf.data
41+
$ perf2bolt -p perf.data -o perf.fdata vmlinux
42+
```
43+
44+
45+
Under a high load, `perf.data` should be several gigabytes in size and you should expect the converted `perf.fdata` not to exceed 100 MB.
46+
47+
Two changes are required for the kernel build. The first one is optional but highly recommended. It introduces a BOLT-reserved space into `vmlinux` code section:
48+
49+
50+
```diff
51+
--- a/arch/x86/kernel/vmlinux.lds.S
52+
+++ b/arch/x86/kernel/vmlinux.lds.S
53+
@@ -139,6 +139,11 @@ SECTIONS
54+
STATIC_CALL_TEXT
55+
*(.gnu.warning)
56+
57+
+ /* Allocate space for BOLT */
58+
+ __bolt_reserved_start = .;
59+
+ . += 2048 * 1024;
60+
+ __bolt_reserved_end = .;
61+
+
62+
#ifdef CONFIG_RETPOLINE
63+
__indirect_thunk_start = .;
64+
*(.text.__x86.*)
65+
```
66+
67+
68+
The second patch adds a step that runs BOLT on `vmlinux` binary:
69+
70+
71+
```diff
72+
--- a/scripts/link-vmlinux.sh
73+
+++ b/scripts/link-vmlinux.sh
74+
@@ -340,5 +340,13 @@ if is_enabled CONFIG_KALLSYMS; then
75+
fi
76+
fi
77+
78+
+# Apply BOLT
79+
+BOLT=llvm-bolt
80+
+BOLT_PROFILE=perf.fdata
81+
+BOLT_OPTS="--dyno-stats --eliminate-unreachable=0 --reorder-blocks=ext-tsp --simplify-conditional-tail-calls=0 --skip-funcs=__entry_text_start,irq_entries_start --split-functions"
82+
+mv vmlinux vmlinux.pre-bolt
83+
+echo BOLTing vmlinux
84+
+${BOLT} vmlinux.pre-bolt -o vmlinux --data ${BOLT_PROFILE} ${BOLT_OPTS}
85+
+
86+
# For fixdep
87+
echo "vmlinux: $0" > .vmlinux.d
88+
```
89+
90+
91+
If you skipped the first step or are running BOLT on a pre-built `vmlinux` binary, drop the `--split-functions` option.
92+
93+
94+
## Performance Expectations
95+
96+
By improving the code layout, BOLT can boost the kernel's performance by up to 5% by reducing instruction cache misses and branch mispredictions. When measuring total system performance, you should scale this number accordingly based on the time your application spends in the kernel (excluding I/O time).
97+
98+
99+
## Profile Quality
100+
101+
The timing and duration of the profiling may have a significant effect on the performance of the BOLTed kernel. If you don't know your workload well, it's recommended that you profile for the whole duration of the benchmark run. As longer times will result in larger `perf.data` files, you can lower the profiling frequency by providing a smaller value of `-F` flag. E.g., to record the kernel profile for half an hour, use the following command:
102+
103+
104+
```bash
105+
$ sudo perf record -a -e cycles -j any,k -F 1000 -- sleep 1800
106+
```
107+
108+
109+
110+
## BOLT Disassembly
111+
112+
BOLT annotates the disassembly with control-flow information and attaches Linux-specific metadata to the code. To view annotated disassembly, run:
113+
114+
115+
```bash
116+
$ llvm-bolt vmlinux -o /dev/null --print-cfg
117+
```
118+
119+
120+
If you want to limit the disassembly to a set of functions, add `--print-only=<func1regex>,<func2regex>,...`, where a function name is specified using regular expressions.

clang/include/clang/Driver/Options.td

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5421,7 +5421,6 @@ def module_file_info : Flag<["-"], "module-file-info">, Flags<[]>,
54215421
HelpText<"Provide information about a particular module file">;
54225422
def mthumb : Flag<["-"], "mthumb">, Group<m_Group>;
54235423
def mtune_EQ : Joined<["-"], "mtune=">, Group<m_Group>,
5424-
Visibility<[ClangOption, FlangOption]>,
54255424
HelpText<"Only supported on AArch64, PowerPC, RISC-V, SPARC, SystemZ, and X86">;
54265425
def multi__module : Flag<["-"], "multi_module">;
54275426
def multiply__defined__unused : Separate<["-"], "multiply_defined_unused">;
@@ -6738,6 +6737,9 @@ def emit_hlfir : Flag<["-"], "emit-hlfir">, Group<Action_Group>,
67386737

67396738
let Visibility = [CC1Option, CC1AsOption] in {
67406739

6740+
def tune_cpu : Separate<["-"], "tune-cpu">,
6741+
HelpText<"Tune for a specific cpu type">,
6742+
MarshallingInfoString<TargetOpts<"TuneCPU">>;
67416743
def target_abi : Separate<["-"], "target-abi">,
67426744
HelpText<"Target a particular ABI type">,
67436745
MarshallingInfoString<TargetOpts<"ABI">>;
@@ -6764,9 +6766,6 @@ def darwin_target_variant_triple : Separate<["-"], "darwin-target-variant-triple
67646766

67656767
let Visibility = [CC1Option, CC1AsOption, FC1Option] in {
67666768

6767-
def tune_cpu : Separate<["-"], "tune-cpu">,
6768-
HelpText<"Tune for a specific cpu type">,
6769-
MarshallingInfoString<TargetOpts<"TuneCPU">>;
67706769
def target_cpu : Separate<["-"], "target-cpu">,
67716770
HelpText<"Target a specific cpu type">,
67726771
MarshallingInfoString<TargetOpts<"CPU">>;

clang/lib/Driver/ToolChains/Flang.cpp

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@
1515
#include "llvm/Frontend/Debug/Options.h"
1616
#include "llvm/Support/FileSystem.h"
1717
#include "llvm/Support/Path.h"
18-
#include "llvm/TargetParser/Host.h"
1918
#include "llvm/TargetParser/RISCVISAInfo.h"
2019
#include "llvm/TargetParser/RISCVTargetParser.h"
2120

@@ -412,13 +411,6 @@ void Flang::addTargetOptions(const ArgList &Args,
412411
}
413412

414413
// TODO: Add target specific flags, ABI, mtune option etc.
415-
if (const Arg *A = Args.getLastArg(options::OPT_mtune_EQ)) {
416-
CmdArgs.push_back("-tune-cpu");
417-
if (A->getValue() == StringRef{"native"})
418-
CmdArgs.push_back(Args.MakeArgString(llvm::sys::getHostCPUName()));
419-
else
420-
CmdArgs.push_back(A->getValue());
421-
}
422414
}
423415

424416
void Flang::addOffloadOptions(Compilation &C, const InputInfoList &Inputs,
@@ -810,7 +802,7 @@ void Flang::ConstructJob(Compilation &C, const JobAction &JA,
810802
case CodeGenOptions::FramePointerKind::None:
811803
FPKeepKindStr = "-mframe-pointer=none";
812804
break;
813-
case CodeGenOptions::FramePointerKind::Reserved:
805+
case CodeGenOptions::FramePointerKind::Reserved:
814806
FPKeepKindStr = "-mframe-pointer=reserved";
815807
break;
816808
case CodeGenOptions::FramePointerKind::NonLeaf:

clang/test/Preprocessor/embed_weird.cpp

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1-
// RUN: printf "\0" > %S/Inputs/null_byte.bin
2-
// RUN: %clang_cc1 %s -fsyntax-only --embed-dir=%S/Inputs -verify=expected,cxx -Wno-c23-extensions
3-
// RUN: %clang_cc1 -x c -std=c23 %s -fsyntax-only --embed-dir=%S/Inputs -verify=expected,c
4-
// RUN: rm %S/Inputs/null_byte.bin
1+
// RUN: rm -rf %t && mkdir -p %t/media
2+
// RUN: cp %S/Inputs/single_byte.txt %S/Inputs/jk.txt %S/Inputs/numbers.txt %t/
3+
// RUN: cp %S/Inputs/media/empty %t/media/
4+
// RUN: printf "\0" > %t/null_byte.bin
5+
// RUN: %clang_cc1 %s -fsyntax-only --embed-dir=%t -verify=expected,cxx -Wno-c23-extensions
6+
// RUN: %clang_cc1 -x c -std=c23 %s -fsyntax-only --embed-dir=%t -verify=expected,c
57
#embed <media/empty>
68
;
79

compiler-rt/lib/asan/asan_globals.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -344,8 +344,8 @@ void __asan_unregister_image_globals(uptr *flag) {
344344
}
345345

346346
void __asan_register_elf_globals(uptr *flag, void *start, void *stop) {
347-
if (*flag) return;
348-
if (!start) return;
347+
if (*flag || start == stop)
348+
return;
349349
CHECK_EQ(0, ((uptr)stop - (uptr)start) % sizeof(__asan_global));
350350
__asan_global *globals_start = (__asan_global*)start;
351351
__asan_global *globals_stop = (__asan_global*)stop;
@@ -354,8 +354,8 @@ void __asan_register_elf_globals(uptr *flag, void *start, void *stop) {
354354
}
355355

356356
void __asan_unregister_elf_globals(uptr *flag, void *start, void *stop) {
357-
if (!*flag) return;
358-
if (!start) return;
357+
if (!*flag || start == stop)
358+
return;
359359
CHECK_EQ(0, ((uptr)stop - (uptr)start) % sizeof(__asan_global));
360360
__asan_global *globals_start = (__asan_global*)start;
361361
__asan_global *globals_stop = (__asan_global*)stop;

flang/include/flang/Frontend/TargetOptions.h

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,6 @@ class TargetOptions {
3232
/// If given, the name of the target CPU to generate code for.
3333
std::string cpu;
3434

35-
/// If given, the name of the target CPU to tune code for.
36-
std::string cpuToTuneFor;
37-
3835
/// The list of target specific features to enable or disable, as written on
3936
/// the command line.
4037
std::vector<std::string> featuresAsWritten;

flang/include/flang/Lower/Bridge.h

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -65,11 +65,11 @@ class LoweringBridge {
6565
const Fortran::lower::LoweringOptions &loweringOptions,
6666
const std::vector<Fortran::lower::EnvironmentDefault> &envDefaults,
6767
const Fortran::common::LanguageFeatureControl &languageFeatures,
68-
const llvm::TargetMachine &targetMachine, llvm::StringRef tuneCPU) {
68+
const llvm::TargetMachine &targetMachine) {
6969
return LoweringBridge(ctx, semanticsContext, defaultKinds, intrinsics,
7070
targetCharacteristics, allCooked, triple, kindMap,
7171
loweringOptions, envDefaults, languageFeatures,
72-
targetMachine, tuneCPU);
72+
targetMachine);
7373
}
7474

7575
//===--------------------------------------------------------------------===//
@@ -148,7 +148,7 @@ class LoweringBridge {
148148
const Fortran::lower::LoweringOptions &loweringOptions,
149149
const std::vector<Fortran::lower::EnvironmentDefault> &envDefaults,
150150
const Fortran::common::LanguageFeatureControl &languageFeatures,
151-
const llvm::TargetMachine &targetMachine, const llvm::StringRef tuneCPU);
151+
const llvm::TargetMachine &targetMachine);
152152
LoweringBridge() = delete;
153153
LoweringBridge(const LoweringBridge &) = delete;
154154

flang/include/flang/Optimizer/CodeGen/CGPasses.td

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,6 @@ def FIRToLLVMLowering : Pass<"fir-to-llvm-ir", "mlir::ModuleOp"> {
3131
"Override module's data layout.">,
3232
Option<"forcedTargetCPU", "target-cpu", "std::string", /*default=*/"",
3333
"Override module's target CPU.">,
34-
Option<"forcedTuneCPU", "tune-cpu", "std::string", /*default=*/"",
35-
"Override module's tune CPU.">,
3634
Option<"forcedTargetFeatures", "target-features", "std::string",
3735
/*default=*/"", "Override module's target features.">,
3836
Option<"applyTBAA", "apply-tbaa", "bool", /*default=*/"false",
@@ -70,8 +68,6 @@ def TargetRewritePass : Pass<"target-rewrite", "mlir::ModuleOp"> {
7068
"Override module's target triple.">,
7169
Option<"forcedTargetCPU", "target-cpu", "std::string", /*default=*/"",
7270
"Override module's target CPU.">,
73-
Option<"forcedTuneCPU", "tune-cpu", "std::string", /*default=*/"",
74-
"Override module's tune CPU.">,
7571
Option<"forcedTargetFeatures", "target-features", "std::string",
7672
/*default=*/"", "Override module's target features.">,
7773
Option<"noCharacterConversion", "no-character-conversion",

flang/include/flang/Optimizer/CodeGen/Target.h

Lines changed: 1 addition & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -76,29 +76,14 @@ class CodeGenSpecifics {
7676
llvm::StringRef targetCPU, mlir::LLVM::TargetFeaturesAttr targetFeatures,
7777
const mlir::DataLayout &dl);
7878

79-
static std::unique_ptr<CodeGenSpecifics>
80-
get(mlir::MLIRContext *ctx, llvm::Triple &&trp, KindMapping &&kindMap,
81-
llvm::StringRef targetCPU, mlir::LLVM::TargetFeaturesAttr targetFeatures,
82-
const mlir::DataLayout &dl, llvm::StringRef tuneCPU);
83-
8479
static TypeAndAttr getTypeAndAttr(mlir::Type t) { return TypeAndAttr{t, {}}; }
8580

8681
CodeGenSpecifics(mlir::MLIRContext *ctx, llvm::Triple &&trp,
8782
KindMapping &&kindMap, llvm::StringRef targetCPU,
8883
mlir::LLVM::TargetFeaturesAttr targetFeatures,
8984
const mlir::DataLayout &dl)
9085
: context{*ctx}, triple{std::move(trp)}, kindMap{std::move(kindMap)},
91-
targetCPU{targetCPU}, targetFeatures{targetFeatures}, dataLayout{&dl},
92-
tuneCPU{""} {}
93-
94-
CodeGenSpecifics(mlir::MLIRContext *ctx, llvm::Triple &&trp,
95-
KindMapping &&kindMap, llvm::StringRef targetCPU,
96-
mlir::LLVM::TargetFeaturesAttr targetFeatures,
97-
const mlir::DataLayout &dl, llvm::StringRef tuneCPU)
98-
: context{*ctx}, triple{std::move(trp)}, kindMap{std::move(kindMap)},
99-
targetCPU{targetCPU}, targetFeatures{targetFeatures}, dataLayout{&dl},
100-
tuneCPU{tuneCPU} {}
101-
86+
targetCPU{targetCPU}, targetFeatures{targetFeatures}, dataLayout{&dl} {}
10287
CodeGenSpecifics() = delete;
10388
virtual ~CodeGenSpecifics() {}
10489

@@ -180,7 +165,6 @@ class CodeGenSpecifics {
180165
virtual unsigned char getCIntTypeWidth() const = 0;
181166

182167
llvm::StringRef getTargetCPU() const { return targetCPU; }
183-
llvm::StringRef getTuneCPU() const { return tuneCPU; }
184168

185169
mlir::LLVM::TargetFeaturesAttr getTargetFeatures() const {
186170
return targetFeatures;
@@ -198,7 +182,6 @@ class CodeGenSpecifics {
198182
llvm::StringRef targetCPU;
199183
mlir::LLVM::TargetFeaturesAttr targetFeatures;
200184
const mlir::DataLayout *dataLayout = nullptr;
201-
llvm::StringRef tuneCPU;
202185
};
203186

204187
} // namespace fir

flang/include/flang/Optimizer/Dialect/Support/FIRContext.h

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -58,13 +58,6 @@ void setTargetCPU(mlir::ModuleOp mod, llvm::StringRef cpu);
5858
/// Get the target CPU string from the Module or return a null reference.
5959
llvm::StringRef getTargetCPU(mlir::ModuleOp mod);
6060

61-
/// Set the tune CPU for the module. `cpu` must not be deallocated while
62-
/// module `mod` is still live.
63-
void setTuneCPU(mlir::ModuleOp mod, llvm::StringRef cpu);
64-
65-
/// Get the tune CPU string from the Module or return a null reference.
66-
llvm::StringRef getTuneCPU(mlir::ModuleOp mod);
67-
6861
/// Set the target features for the module.
6962
void setTargetFeatures(mlir::ModuleOp mod, llvm::StringRef features);
7063

flang/include/flang/Optimizer/Transforms/Passes.td

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -411,9 +411,6 @@ def FunctionAttr : Pass<"function-attr", "mlir::func::FuncOp"> {
411411
Option<"unsafeFPMath", "unsafe-fp-math",
412412
"bool", /*default=*/"false",
413413
"Set the unsafe-fp-math attribute on functions in the module.">,
414-
Option<"tuneCPU", "tune-cpu",
415-
"llvm::StringRef", /*default=*/"llvm::StringRef{}",
416-
"Set the tune-cpu attribute on functions in the module.">,
417414
];
418415
}
419416

flang/lib/Frontend/CompilerInvocation.cpp

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -407,10 +407,6 @@ static void parseTargetArgs(TargetOptions &opts, llvm::opt::ArgList &args) {
407407
args.getLastArg(clang::driver::options::OPT_target_cpu))
408408
opts.cpu = a->getValue();
409409

410-
if (const llvm::opt::Arg *a =
411-
args.getLastArg(clang::driver::options::OPT_tune_cpu))
412-
opts.cpuToTuneFor = a->getValue();
413-
414410
for (const llvm::opt::Arg *currentArg :
415411
args.filtered(clang::driver::options::OPT_target_feature))
416412
opts.featuresAsWritten.emplace_back(currentArg->getValue());

flang/lib/Frontend/FrontendActions.cpp

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -297,8 +297,7 @@ bool CodeGenAction::beginSourceFileAction() {
297297
ci.getParsing().allCooked(), ci.getInvocation().getTargetOpts().triple,
298298
kindMap, ci.getInvocation().getLoweringOpts(),
299299
ci.getInvocation().getFrontendOpts().envDefaults,
300-
ci.getInvocation().getFrontendOpts().features, targetMachine,
301-
ci.getInvocation().getTargetOpts().cpuToTuneFor);
300+
ci.getInvocation().getFrontendOpts().features, targetMachine);
302301

303302
// Fetch module from lb, so we can set
304303
mlirModule = std::make_unique<mlir::ModuleOp>(lb.getModule());

flang/lib/Lower/Bridge.cpp

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5929,7 +5929,7 @@ Fortran::lower::LoweringBridge::LoweringBridge(
59295929
const Fortran::lower::LoweringOptions &loweringOptions,
59305930
const std::vector<Fortran::lower::EnvironmentDefault> &envDefaults,
59315931
const Fortran::common::LanguageFeatureControl &languageFeatures,
5932-
const llvm::TargetMachine &targetMachine, const llvm::StringRef tuneCPU)
5932+
const llvm::TargetMachine &targetMachine)
59335933
: semanticsContext{semanticsContext}, defaultKinds{defaultKinds},
59345934
intrinsics{intrinsics}, targetCharacteristics{targetCharacteristics},
59355935
cooked{&cooked}, context{context}, kindMap{kindMap},
@@ -5986,7 +5986,6 @@ Fortran::lower::LoweringBridge::LoweringBridge(
59865986
fir::setTargetTriple(*module.get(), triple);
59875987
fir::setKindMapping(*module.get(), kindMap);
59885988
fir::setTargetCPU(*module.get(), targetMachine.getTargetCPU());
5989-
fir::setTuneCPU(*module.get(), tuneCPU);
59905989
fir::setTargetFeatures(*module.get(), targetMachine.getTargetFeatureString());
59915990
fir::support::setMLIRDataLayout(*module.get(),
59925991
targetMachine.createDataLayout());

0 commit comments

Comments
 (0)