-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[X86] Add missing vNbf16 handling in X86CallingConv.td file #127102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-x86 Author: Mikołaj Piróg (mikolaj-pirog) ChangesLack of these entries caused clang to crash on the following code: __m256bh fun(__m256bh arg) {
return arg;
}
__m256bh run() {
__m256bh arg= {0};
fun(arg);
} It caused the FastISel to fail since it handled the call lowering basing on the X86CallingConv table. Curiously, if FastISel fails somewhere down the line and selectionDAGISel fallbacks, the crash does not occur. Following code does not crash: __m256bh fun(__m256bh arg) {
return arg;
}
__m256bh run() {
__m256bh arg= {0};
return fun(arg);
} This is puzzling to me. Obviously, if FastISel fails then compiler fallbacks to something else to lower these calls -- but since the X86callingConv table doesn't have entries for vNbf16 how does this other thing manage not to crash? It has to use some other mechanism, one which doesn't use the table. This rises following questions:
Nonetheless, this PR fixes the crash, though I didn't create a test for it, since I am unsure yet how it should look like. I would like to learn how the working non-FastISel mechanism works; I tried looking for it, but didn't yet manage to find anything Full diff: https://github.com/llvm/llvm-project/pull/127102.diff 1 Files Affected:
diff --git a/llvm/lib/Target/X86/X86CallingConv.td b/llvm/lib/Target/X86/X86CallingConv.td
index 72b103b0bb0c5..cf164acba9ec0 100644
--- a/llvm/lib/Target/X86/X86CallingConv.td
+++ b/llvm/lib/Target/X86/X86CallingConv.td
@@ -267,19 +267,19 @@ def RetCC_X86Common : CallingConv<[
// Vector types are returned in XMM0 and XMM1, when they fit. XMM2 and XMM3
// can only be used by ABI non-compliant code. If the target doesn't have XMM
// registers, it won't have vector types.
- CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v4f32, v2f64],
+ CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v8bf16, v4f32, v2f64],
CCAssignToReg<[XMM0,XMM1,XMM2,XMM3]>>,
// 256-bit vectors are returned in YMM0 and XMM1, when they fit. YMM2 and YMM3
// can only be used by ABI non-compliant code. This vector type is only
// supported while using the AVX target feature.
- CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v8f32, v4f64],
+ CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v16bf16, v8f32, v4f64],
CCAssignToReg<[YMM0,YMM1,YMM2,YMM3]>>,
// 512-bit vectors are returned in ZMM0 and ZMM1, when they fit. ZMM2 and ZMM3
// can only be used by ABI non-compliant code. This vector type is only
// supported while using the AVX-512 target feature.
- CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v16f32, v8f64],
+ CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v32bf16, v16f32, v8f64],
CCAssignToReg<[ZMM0,ZMM1,ZMM2,ZMM3]>>,
// Long double types are always returned in FP0 (even with SSE),
@@ -565,7 +565,7 @@ def CC_X86_64_C : CallingConv<[
CCIfType<[v64i1], CCPromoteToType<v64i8>>,
// The first 8 FP/Vector arguments are passed in XMM registers.
- CCIfType<[f16, f32, f64, f128, v16i8, v8i16, v4i32, v2i64, v8f16, v4f32, v2f64],
+ CCIfType<[f16, f32, f64, f128, v16i8, v8i16, v4i32, v2i64, v8f16, v8bf16, v4f32, v2f64],
CCIfSubtarget<"hasSSE1()",
CCAssignToReg<[XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7]>>>,
@@ -574,13 +574,13 @@ def CC_X86_64_C : CallingConv<[
// FIXME: This isn't precisely correct; the x86-64 ABI document says that
// fixed arguments to vararg functions are supposed to be passed in
// registers. Actually modeling that would be a lot of work, though.
- CCIfNotVarArg<CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v8f32, v4f64],
+ CCIfNotVarArg<CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v16bf16, v8f32, v4f64],
CCIfSubtarget<"hasAVX()",
CCAssignToReg<[YMM0, YMM1, YMM2, YMM3,
YMM4, YMM5, YMM6, YMM7]>>>>,
// The first 8 512-bit vector arguments are passed in ZMM registers.
- CCIfNotVarArg<CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v16f32, v8f64],
+ CCIfNotVarArg<CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v32bf16, v16f32, v8f64],
CCIfSubtarget<"hasAVX512()",
CCAssignToReg<[ZMM0, ZMM1, ZMM2, ZMM3, ZMM4, ZMM5, ZMM6, ZMM7]>>>>,
@@ -593,14 +593,14 @@ def CC_X86_64_C : CallingConv<[
CCIfType<[f80, f128], CCAssignToStack<0, 0>>,
// Vectors get 16-byte stack slots that are 16-byte aligned.
- CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v4f32, v2f64], CCAssignToStack<16, 16>>,
+ CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v8bf16, v4f32, v2f64], CCAssignToStack<16, 16>>,
// 256-bit vectors get 32-byte stack slots that are 32-byte aligned.
- CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v8f32, v4f64],
+ CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v16bf16, v8f32, v4f64],
CCAssignToStack<32, 32>>,
// 512-bit vectors get 64-byte stack slots that are 64-byte aligned.
- CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v16f32, v8f64],
+ CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v32bf16, v16f32, v8f64],
CCAssignToStack<64, 64>>
]>;
@@ -631,13 +631,13 @@ def CC_X86_Win64_C : CallingConv<[
CCIfCFGuardTarget<CCAssignToReg<[RAX]>>,
// 128 bit vectors are passed by pointer
- CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v4f32, v2f64], CCPassIndirect<i64>>,
+ CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v8bf16, v4f32, v2f64], CCPassIndirect<i64>>,
// 256 bit vectors are passed by pointer
- CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v8f32, v4f64], CCPassIndirect<i64>>,
+ CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v16bf16, v8f32, v4f64], CCPassIndirect<i64>>,
// 512 bit vectors are passed by pointer
- CCIfType<[v64i8, v32i16, v16i32, v32f16, v16f32, v8f64, v8i64], CCPassIndirect<i64>>,
+ CCIfType<[v64i8, v32i16, v16i32, v32f16, v32bf16, v16f32, v8f64, v8i64], CCPassIndirect<i64>>,
// Long doubles are passed by pointer
CCIfType<[f80], CCPassIndirect<i64>>,
@@ -734,15 +734,15 @@ def CC_X86_64_AnyReg : CallingConv<[
/// values are spilled on the stack.
def CC_X86_32_Vector_Common : CallingConv<[
// Other SSE vectors get 16-byte stack slots that are 16-byte aligned.
- CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v4f32, v2f64],
+ CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v8bf16, v4f32, v2f64],
CCAssignToStack<16, 16>>,
// 256-bit AVX vectors get 32-byte stack slots that are 32-byte aligned.
- CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v8f32, v4f64],
+ CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v16bf16, v8f32, v4f64],
CCAssignToStack<32, 32>>,
// 512-bit AVX 512-bit vectors get 64-byte stack slots that are 64-byte aligned.
- CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v16f32, v8f64],
+ CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v32bf16, v16f32, v8f64],
CCAssignToStack<64, 64>>
]>;
@@ -750,15 +750,15 @@ def CC_X86_32_Vector_Common : CallingConv<[
/// values are spilled on the stack.
def CC_X86_Win32_Vector : CallingConv<[
// Other SSE vectors get 16-byte stack slots that are 4-byte aligned.
- CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v4f32, v2f64],
+ CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v8bf16, v4f32, v2f64],
CCAssignToStack<16, 4>>,
// 256-bit AVX vectors get 32-byte stack slots that are 4-byte aligned.
- CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v8f32, v4f64],
+ CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v16bf16, v8f32, v4f64],
CCAssignToStack<32, 4>>,
// 512-bit AVX 512-bit vectors get 64-byte stack slots that are 4-byte aligned.
- CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v16f32, v8f64],
+ CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v32bf16, v16f32, v8f64],
CCAssignToStack<64, 4>>
]>;
@@ -766,16 +766,16 @@ def CC_X86_Win32_Vector : CallingConv<[
// vector registers
def CC_X86_32_Vector_Standard : CallingConv<[
// SSE vector arguments are passed in XMM registers.
- CCIfNotVarArg<CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v4f32, v2f64],
+ CCIfNotVarArg<CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v8bf16, v4f32, v2f64],
CCAssignToReg<[XMM0, XMM1, XMM2]>>>,
// AVX 256-bit vector arguments are passed in YMM registers.
- CCIfNotVarArg<CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v8f32, v4f64],
+ CCIfNotVarArg<CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v16bf16, v8f32, v4f64],
CCIfSubtarget<"hasAVX()",
CCAssignToReg<[YMM0, YMM1, YMM2]>>>>,
// AVX 512-bit vector arguments are passed in ZMM registers.
- CCIfNotVarArg<CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v16f32, v8f64],
+ CCIfNotVarArg<CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v32bf16, v16f32, v8f64],
CCAssignToReg<[ZMM0, ZMM1, ZMM2]>>>,
CCIfIsVarArgOnWin<CCDelegateTo<CC_X86_Win32_Vector>>,
@@ -786,16 +786,16 @@ def CC_X86_32_Vector_Standard : CallingConv<[
// vector registers.
def CC_X86_32_Vector_Darwin : CallingConv<[
// SSE vector arguments are passed in XMM registers.
- CCIfNotVarArg<CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v4f32, v2f64],
+ CCIfNotVarArg<CCIfType<[v16i8, v8i16, v4i32, v2i64, v8f16, v8bf16, v4f32, v2f64],
CCAssignToReg<[XMM0, XMM1, XMM2, XMM3]>>>,
// AVX 256-bit vector arguments are passed in YMM registers.
- CCIfNotVarArg<CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v8f32, v4f64],
+ CCIfNotVarArg<CCIfType<[v32i8, v16i16, v8i32, v4i64, v16f16, v16bf16, v8f32, v4f64],
CCIfSubtarget<"hasAVX()",
CCAssignToReg<[YMM0, YMM1, YMM2, YMM3]>>>>,
// AVX 512-bit vector arguments are passed in ZMM registers.
- CCIfNotVarArg<CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v16f32, v8f64],
+ CCIfNotVarArg<CCIfType<[v64i8, v32i16, v16i32, v8i64, v32f16, v32bf16, v16f32, v8f64],
CCAssignToReg<[ZMM0, ZMM1, ZMM2, ZMM3]>>>,
CCDelegateTo<CC_X86_32_Vector_Common>
|
The DAGISel uses in this way: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/X86/X86ISelLoweringCall.cpp#L126-L129
We don't have enough coverage for FastISel, but I think it's ok as long as we didn't touch FastISel code, because it will fallback to DAGISel anyway. The ABI handling may have difference, and that's why it failed. So I think we may create a single test file for bf16 ABI and test both FastISel and DAGISel, or maybe even GISel. It may look like:
The issue can be reproduced with |
please can you regenerate the breaking tests? |
Take a look at the tests, we cannot simply remove the handling in X86ISelLoweringCall. Unlike FP16, BF16 vectors are legal only when AVX512BF16/AVXNECONVERT are ready. Though we may make it legal for SSE2, we may just leave the code as is. |
This reverts commit 57c5ca4.
I have reverted the removal of the handling, it caused tests to fail (some crashing the compiler along the way). So the idea behind this piece of code is to pretend that the bf16 is fp16 for the call lowering purpose? Regarding the testing of this change, there already is a test for it, |
The ISel will do type legalization first. It can save some of our later work if we don't declare type legal on SSE2.
The test only tests an illegal scenario (arguably just my personal opinion). What we should test is the meaningful combinations: SSE2, AVX512BF16+AVX512VL or AVXNECONVERT as I commented above. |
Thanks for the explanation, I will create the test for this then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/51/builds/11104 Here is the relevant piece of the build log for the reference
|
Lack of these entries caused clang to crash on the following code:
It caused the FastISel to fail since it handled the call lowering basing on the X86CallingConv table.
Curiously, if FastISel fails somewhere down the line and selectionDAGISel fallbacks, the crash does not occur. Following code does not crash:
This is puzzling to me. Obviously, if FastISel fails then compiler fallbacks to something else to lower these calls -- but since the X86callingConv table doesn't have entries for vNbf16 how does this other thing manage not to crash? It has to use some other mechanism, one which doesn't use the table. This rises following questions:
Nonetheless, this PR fixes the crash, though I didn't create a test for it, since I am unsure yet how it should look like. I would like to learn how the working non-FastISel mechanism works; I tried looking for it, but didn't yet manage to find anything