-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[HIP] add --offload-compression-level= option #83605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-clang-driver @llvm/pr-subscribers-clang Author: Yaxun (Sam) Liu (yxsamliu) ChangesChange compression level to 20 for zstd better Full diff: https://github.com/llvm/llvm-project/pull/83605.diff 4 Files Affected:
diff --git a/clang/lib/Driver/OffloadBundler.cpp b/clang/lib/Driver/OffloadBundler.cpp
index 99a34d25cfcd56..ef48b2f33d3783 100644
--- a/clang/lib/Driver/OffloadBundler.cpp
+++ b/clang/lib/Driver/OffloadBundler.cpp
@@ -942,12 +942,19 @@ CompressedOffloadBundle::compress(const llvm::MemoryBuffer &Input,
Input.getBuffer().size());
llvm::compression::Format CompressionFormat;
+ int Level;
- if (llvm::compression::zstd::isAvailable())
+ if (llvm::compression::zstd::isAvailable()) {
CompressionFormat = llvm::compression::Format::Zstd;
- else if (llvm::compression::zlib::isAvailable())
+ // Use a high zstd compress level by default for better size reduction.
+ const int DefaultZstdLevel = 20;
+ Level = DefaultZstdLevel;
+ } else if (llvm::compression::zlib::isAvailable()) {
CompressionFormat = llvm::compression::Format::Zlib;
- else
+ // Use default level for zlib since higher level does not have significant
+ // improvement.
+ Level = llvm::compression::zlib::DefaultCompression;
+ } else
return createStringError(llvm::inconvertibleErrorCode(),
"Compression not supported");
@@ -955,7 +962,7 @@ CompressedOffloadBundle::compress(const llvm::MemoryBuffer &Input,
ClangOffloadBundlerTimerGroup);
if (Verbose)
CompressTimer.startTimer();
- llvm::compression::compress(CompressionFormat, BufferUint8, CompressedBuffer);
+ llvm::compression::compress({CompressionFormat, Level}, BufferUint8, CompressedBuffer);
if (Verbose)
CompressTimer.stopTimer();
@@ -980,6 +987,7 @@ CompressedOffloadBundle::compress(const llvm::MemoryBuffer &Input,
CompressionFormat == llvm::compression::Format::Zstd ? "zstd" : "zlib";
llvm::errs() << "Compressed bundle format version: " << Version << "\n"
<< "Compression method used: " << MethodUsed << "\n"
+ << "Compression level: " << Level << "\n"
<< "Binary size before compression: " << UncompressedSize
<< " bytes\n"
<< "Binary size after compression: " << CompressedBuffer.size()
diff --git a/clang/test/Driver/clang-offload-bundler-zlib.c b/clang/test/Driver/clang-offload-bundler-zlib.c
index a57ee6da9a86a6..aed3a03d5341b5 100644
--- a/clang/test/Driver/clang-offload-bundler-zlib.c
+++ b/clang/test/Driver/clang-offload-bundler-zlib.c
@@ -1,4 +1,4 @@
-// REQUIRES: zlib
+// REQUIRES: zlib && !zstd
// REQUIRES: x86-registered-target
// UNSUPPORTED: target={{.*}}-darwin{{.*}}, target={{.*}}-aix{{.*}}
@@ -34,7 +34,8 @@
// RUN: diff %t.tgt2 %t.res.tgt2
//
-// COMPRESS: Compression method used:
+// COMPRESS: Compression method used: zlib
+// COMPRESS: Compression level: 9
// DECOMPRESS: Decompression method:
// NOHOST-NOT: host-
// NOHOST-DAG: hip-amdgcn-amd-amdhsa--gfx900
diff --git a/clang/test/Driver/clang-offload-bundler-zstd.c b/clang/test/Driver/clang-offload-bundler-zstd.c
index 3b577d4d166a3f..200634e05fa145 100644
--- a/clang/test/Driver/clang-offload-bundler-zstd.c
+++ b/clang/test/Driver/clang-offload-bundler-zstd.c
@@ -31,7 +31,8 @@
// RUN: diff %t.tgt1 %t.res.tgt1
// RUN: diff %t.tgt2 %t.res.tgt2
//
-// COMPRESS: Compression method used
+// COMPRESS: Compression method used: zstd
+// COMPRESS: Compression level: 20
// DECOMPRESS: Decompression method
// NOHOST-NOT: host-
// NOHOST-DAG: hip-amdgcn-amd-amdhsa--gfx900
diff --git a/llvm/include/llvm/Support/Compression.h b/llvm/include/llvm/Support/Compression.h
index c3ba3274d6ed87..9c9c3e2cc4a6c0 100644
--- a/llvm/include/llvm/Support/Compression.h
+++ b/llvm/include/llvm/Support/Compression.h
@@ -94,6 +94,8 @@ struct Params {
constexpr Params(Format F)
: format(F), level(F == Format::Zlib ? zlib::DefaultCompression
: zstd::DefaultCompression) {}
+ constexpr Params(Format F, int L)
+ : format(F), level(L) {}
Params(DebugCompressionType Type) : Params(formatFor(Type)) {}
Format format;
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
clang/lib/Driver/OffloadBundler.cpp
Outdated
CompressionFormat = llvm::compression::Format::Zstd; | ||
else if (llvm::compression::zlib::isAvailable()) | ||
// Use a high zstd compress level by default for better size reduction. | ||
const int DefaultZstdLevel = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the default compression level for zstd?
It would be great if we could override the compression level. I'm somewhat reluctant to impose max compression level on everyone by default, without any way out, if it turns out to be a problem.
@MaskRay WDYT?
Max compression level may be fine. If we produce enough stuff for compression to take long, compilation time itself will likely dwarf the compression time. For the small TUs, even slow compression may be fine.
@yxsamliu how long the compilation w/o compression takes in your benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the default compression level for zstd?
default compression level for zstd is 6
It would be great if we could override the compression level. I'm somewhat reluctant to impose max compression level on everyone by default, without any way out, if it turns out to be a problem.
I can add a clang-offload-bundler option -compress-level=n and a clang option --offload-compress-level=n to control the compression level.
@MaskRay WDYT?
Max compression level may be fine. If we produce enough stuff for compression to take long, compilation time itself will likely dwarf the compression time. For the small TUs, even slow compression may be fine.
@yxsamliu how long the compilation w/o compression takes in your benchmarks?
For Blender 4.1, compiling kernels to bitcode for 6 GPU takes 30s. compression with zstd level 20 takes 2s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete this used-once variable which is also confusing const int DefaultZstdLevel = 20;
, since 20 is not the default.
High levels decrease compression speed substantially. Have you checked that 20 hits a good sweet spot between 9 to 22 (max)?
Is OffloadBundler used more like a linker than a compile action where parallelism helps? If yes, lld/ELF/OutputSections.cpp has some code, which can be copied here. (Please don't share the code, I think some duplication in lld/ELF makes lld code easier to change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
level 20 is a sweet spot for both compression rate and compression time, as shown in the following table for Blender 4.1 bundled bitcode for 6 GPU archs:
Zstd Level | Size Before (bytes) | Size After (bytes) | Compression Rate | Compression Time (s) | Decompression Time (s) |
---|---|---|---|---|---|
6 | 68,459,756 | 32,612,291 | 2.10 | 0.8891 | 0.1809 |
9 | 68,459,756 | 31,445,373 | 2.18 | 1.4200 | 0.1742 |
15 | 68,459,756 | 28,063,493 | 2.44 | 9.7994 | 0.1712 |
18 | 68,459,756 | 24,952,891 | 2.74 | 11.4201 | 0.1796 |
19 | 68,459,756 | 24,690,733 | 2.77 | 13.4060 | 0.1820 |
20 | 68,459,756 | 4,394,993 | 15.58 | 2.0946 | 0.1320 |
21 | 68,459,756 | 4,394,399 | 15.59 | 2.1500 | 0.1318 |
22 | 68,459,756 | 4,394,429 | 15.59 | 2.6635 | 0.1309 |
Level 20 and level 19 has some compression parameter differences (https://github.com/facebook/zstd/blob/a58b48ef0e543980888a4d9d16c9072ff22135ca/lib/compress/clevels.h#L48 ) the meaning of these parameters are defined at https://github.com/facebook/zstd/blob/a58b48ef0e543980888a4d9d16c9072ff22135ca/lib/zstd.h#L1299. It seems either the largest match distance or fully searched segment makes the difference.
clang-offload-bundler just concatenates the binaries for different GPU arch's together. Parallelization does not help much, unless zstd can be parallelized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compiling kernels to bitcode for 6 GPU takes 30s. compression with zstd level 20 takes 2s.
This looks acceptable for me.
unless zstd can be parallelized.
zstd does support multithreaded compression, but enabling it would run into the same issue we had with enabling multi-threaded compilation -- it will interfere with the build system's idea of resource usage.
@@ -1,4 +1,4 @@ | |||
// REQUIRES: zlib | |||
// REQUIRES: zlib && !zstd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since zstd configurations are becoming more popular, zlib && !zstd
would essentially disable the test for increasingly more bots. But I guess this cannot be improved.
clang/lib/Driver/OffloadBundler.cpp
Outdated
@@ -906,6 +906,16 @@ CreateFileHandler(MemoryBuffer &FirstInput, | |||
} | |||
|
|||
OffloadBundlerConfig::OffloadBundlerConfig() { | |||
if (llvm::compression::zstd::isAvailable()) { | |||
CompressionFormat = llvm::compression::Format::Zstd; | |||
// Use a high zstd compress level by default for better size reduction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add more details here. While higher compression levels usually do improve compression ratio, in typical use case it's an incremental improvement. Here, we do it to achieve dramatic increase in compression ratio by exploiting the fact that we carry multiple sets of very similar large bitcode blobs, and that we need compression level high enough to fit one complete blob into compression window. At least that's the theory.
Should we print a warning (or just document it?) when compression level ends up being below of what we'd expect? Considering that good compression starts at zstd-20, I suspect that compression level will go back to ~2.5x if the binary size for one GPU doubles in size and no longer fits. On top of that compression time will also increase, a lot. That will be a rather unpleasant surprise for whoever runs into it.
ZSTD's current compression parameters are set this way:
https://github.com/facebook/zstd/blob/dev/lib/compress/clevels.h#L47
{ 23, 24, 22, 7, 3,256, ZSTD_btultra2}, /* level 19 */
{ 25, 25, 23, 7, 3,256, ZSTD_btultra2, /* level 20 */
First three numbers are log2 of (largest match distance, fully searched segment, dispatch table).
2^25 = 32MB which happens to be about the size of the single GPU binary in your example. I'm pretty sure this explains why zstd-20
works so well on it, while zstd-19 does not. It will work well for the smaller binaries, but I'm pretty sure it will regress for a slightly larger binary.
I think it may be worth experimenting with fine-tuning compression settings and instead of blindly setting zstd-20
, consider the size of the binary we need to deal with, and adjust only windowLog/chainLog appropriately.
Or we could set the default to lower compression level + large windowLog. This should still give us most of the compression benefits for the binaries that would fit into the window, but would avoid the performance cliff if the binary is too large.
I may be overcomplicating it too much, too. If someone does run into the problem, they now have a way to work around it by tweaking the compression level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I've just discovered that zstd already has --long
option:
https://github.com/facebook/zstd/blob/b293d2ebc3a5d29309390a70b3e7861b6f5133ec/lib/zstd.h#L394
ZSTD_c_enableLongDistanceMatching=160, /* Enable long distance matching.
* This parameter is designed to improve compression ratio
* for large inputs, by finding large matches at long distance.
* It increases memory usage and window size.
* Note: enabling this parameter increases default ZSTD_c_windowLog to 128 MB
* except when expressly set to a different value.
* Note: will be enabled by default if ZSTD_c_windowLog >= 128 MB and
* compression strategy >= ZSTD_btopt (== compression level 16+) */
This sounds like something we could use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I've just discovered that zstd already has
--long
option: https://github.com/facebook/zstd/blob/b293d2ebc3a5d29309390a70b3e7861b6f5133ec/lib/zstd.h#L394ZSTD_c_enableLongDistanceMatching=160, /* Enable long distance matching. * This parameter is designed to improve compression ratio * for large inputs, by finding large matches at long distance. * It increases memory usage and window size. * Note: enabling this parameter increases default ZSTD_c_windowLog to 128 MB * except when expressly set to a different value. * Note: will be enabled by default if ZSTD_c_windowLog >= 128 MB and * compression strategy >= ZSTD_btopt (== compression level 16+) */
This sounds like something we could use here.
Thanks this option is promising. Here is some benchmark result of a fat binary containing 13 code objects each of which is about 2.7MB.
The following data is without --long
. The numbers are compression level, original size -> compressed size (compression rate), compression speed, decompression speed.
$ zstd -b1 -e22 -f --ultra tmp.o
1#tmp.o : 34864866 -> 9169246 (3.802), 657.0 MB/s ,1691.0 MB/s
2#tmp.o : 34864866 -> 7352667 (4.742), 626.3 MB/s ,1903.8 MB/s
3#tmp.o : 34864866 -> 6885718 (5.063), 488.1 MB/s ,1900.2 MB/s
4#tmp.o : 34864866 -> 6700508 (5.203), 416.7 MB/s ,1897.2 MB/s
5#tmp.o : 34864866 -> 6405252 (5.443), 236.4 MB/s ,1918.8 MB/s
6#tmp.o : 34864866 -> 6336706 (5.502), 211.8 MB/s ,1941.4 MB/s
7#tmp.o : 34864866 -> 6170409 (5.650), 153.5 MB/s ,2032.5 MB/s
8#tmp.o : 34864866 -> 6121226 (5.696), 131.1 MB/s ,2071.5 MB/s
9#tmp.o : 34864866 -> 6098948 (5.717), 124.9 MB/s ,2080.4 MB/s
10#tmp.o : 34864866 -> 2555599 (13.64), 179.4 MB/s ,3504.2 MB/s
11#tmp.o : 34864866 -> 2545375 (13.70), 119.4 MB/s ,3516.8 MB/s
12#tmp.o : 34864866 -> 2542711 (13.71), 107.2 MB/s ,3518.4 MB/s
13#tmp.o : 34864866 -> 2601619 (13.40), 58.4 MB/s ,3507.6 MB/s
14#tmp.o : 34864866 -> 2590656 (13.46), 46.2 MB/s ,3520.4 MB/s
15#tmp.o : 34864866 -> 2518599 (13.84), 28.4 MB/s ,3557.4 MB/s
16#tmp.o : 34864866 -> 2527122 (13.80), 20.8 MB/s ,3348.5 MB/s
17#tmp.o : 34864866 -> 2277125 (15.31), 19.0 MB/s ,3370.6 MB/s
18#tmp.o : 34864866 -> 2138918 (16.30), 15.0 MB/s ,3182.2 MB/s
19#tmp.o : 34864866 -> 2118238 (16.46), 8.82 MB/s ,3194.5 MB/s
20#tmp.o : 34864866 -> 2041007 (17.08), 8.31 MB/s ,3178.4 MB/s
21#tmp.o : 34864866 -> 2039075 (17.10), 5.21 MB/s ,3170.6 MB/s
22#tmp.o : 34864866 -> 2038568 (17.10), 3.60 MB/s ,3171.5 MB/s
The following data are with --long
:
$ zstd --long -b1 -e22 -f --ultra tmp.o
1#tmp.o : 34864866 -> 3281430 (10.62), 375.0 MB/s ,3531.9 MB/s
2#tmp.o : 34864866 -> 2854143 (12.22), 360.6 MB/s ,3536.7 MB/s
3#tmp.o : 34864866 -> 2648807 (13.16), 325.4 MB/s ,3462.7 MB/s
4#tmp.o : 34864866 -> 2548618 (13.68), 309.6 MB/s ,3345.9 MB/s
5#tmp.o : 34864866 -> 2540406 (13.72), 265.8 MB/s ,3297.8 MB/s
6#tmp.o : 34864866 -> 2518788 (13.84), 251.9 MB/s ,3296.0 MB/s
7#tmp.o : 34864866 -> 2451360 (14.22), 206.5 MB/s ,3446.9 MB/s
8#tmp.o : 34864866 -> 2421083 (14.40), 186.5 MB/s ,3522.7 MB/s
9#tmp.o : 34864866 -> 2406717 (14.49), 172.0 MB/s ,3472.2 MB/s
10#tmp.o : 34864866 -> 2392819 (14.57), 139.6 MB/s ,3439.4 MB/s
11#tmp.o : 34864866 -> 2386599 (14.61), 113.0 MB/s ,3415.2 MB/s
12#tmp.o : 34864866 -> 2385088 (14.62), 104.5 MB/s ,3430.0 MB/s
13#tmp.o : 34864866 -> 2389264 (14.59), 69.5 MB/s ,3422.9 MB/s
14#tmp.o : 34864866 -> 2382705 (14.63), 61.2 MB/s ,3428.6 MB/s
15#tmp.o : 34864866 -> 2372640 (14.69), 51.2 MB/s ,3446.7 MB/s
16#tmp.o : 34864866 -> 2209022 (15.78), 20.5 MB/s ,3483.3 MB/s
17#tmp.o : 34864866 -> 2168474 (16.08), 18.2 MB/s ,3381.5 MB/s
18#tmp.o : 34864866 -> 2065724 (16.88), 14.2 MB/s ,3187.8 MB/s
19#tmp.o : 34864866 -> 2042810 (17.07), 8.50 MB/s ,3195.7 MB/s
20#tmp.o : 34864866 -> 2040443 (17.09), 8.13 MB/s ,3173.0 MB/s
21#tmp.o : 34864866 -> 2038794 (17.10), 5.11 MB/s ,3174.3 MB/s
22#tmp.o : 34864866 -> 2038375 (17.10), 3.54 MB/s ,3177.8 MB/s
From the data we can see, with --long
, even with compression level 3, the compressed file size is below one code object, whereas the compression speed is very fast. Higher compression level only improves compression rate slightly but with a much slower speed. Therefore compression level 3 seems to be more favorable.
I wonder how much this is overfitting for kernels of a particular size, though? (is it making the window just large enough that there's some "memory" from one kernel to the next - but a slightly larger kernel would cause it to fail to see the similarities of two or more kernels - and equally, would a lower level have a smaller window which would be adequate for smaller kernels?) If level does imply window size and that does imply the size of the kernels that can be efficiently compressed across multiple similar copies - it might be interesting to know what range of kernel sizes fit in the new default, and if that size is representative of the majority of kernels. |
It may be worth asking on https://github.com/facebook/zstd/ . I am sure zstd maintainers are happy to see more adoption:) |
Here is the size distribution of individual code object file for ROCm 6.0 libraries (each code object file is for one GPU arch, and a fat binary contains a bunch of code object files, therefore the optimal compression parameter is mostly related to code object file size ).
From the table we can see 99.9% of code object files are below 32MB. Also all code object files are below 128MB. |
Posted a question to zstd facebook/zstd#3932 |
zstd developers suggest to enable long distance matching (LDM), i.e. the |
Should an option like in #84337 be added for the new driver? |
Yes please |
Oh. I can add it |
added the option to linker wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, I'll wait a bit to see if Artem or Fangrui have anything to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
TCArgs.getLastArg(options::OPT_offload_compression_level_EQ)) { | ||
std::string CompressionLevelArg = | ||
std::string("-compression-level=") + Arg->getValue(); | ||
CmdArgs.push_back(TCArgs.MakeArgString(CompressionLevelArg)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be collapsed to just
CmdArgs.push_back(TCArgs.MakeArgString("-compression-level=" + Arg->getValue()))`.
Maybe with a Twine
or StringRef
wrapping the string literal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
Added --offload-compression-level= option to clang and -compression-level= option to clang-offload-bundler and clang-linker-wrapper for controlling compression level. Added support of long distance matching (LDM) for llvm::zstd which is off by default. Enable it for clang-offload-bundler by default since it improves compression rate in general. Change default compression level to 3 for zstd for clang-offload-bundler since it works well for bundle entry size from 1KB to 32MB, which should cover most of the clang-offload-bundler usage. Users can still specify compression level by -compression-level= option if necessary. Change-Id: I5e2d7abcaa11a2a37a5e798476e3f572bba11cab
Added --offload-compression-level= option to clang and -compression-level= option to clang-offload-bundler for controlling compression level. Added support of long distance matching (LDM) for llvm::zstd which is off by default. Enable it for clang-offload-bundler by default since it improves compression rate in general. Change default compression level to 3 for zstd for clang-offload-bundler since it works well for bundle entry size from 1KB to 32MB, which should cover most of the clang-offload-bundler usage. Users can still specify compression level by -compression-level= option if necessary. Signed-off-by: Gavin Zhao <[email protected]>
Added --offload-compression-level= option to clang and -compression-level=
option to clang-offload-bundler for controlling compression level.
Added support of long distance matching (LDM) for llvm::zstd which is off
by default. Enable it for clang-offload-bundler by default since it
improves compression rate in general.
Change default compression level to 3 for zstd for clang-offload-bundler
since it works well for bundle entry size from 1KB to 32MB, which should
cover most of the clang-offload-bundler usage. Users can still specify
compression level by -compression-level= option if necessary.