[AArch64] Prevent v1f16 vselect/setcc type expansion. #72048

davemgreen · 2023-11-12T11:27:41Z

PR #71614 identified an issue in the lowering of v1f16 vector compares, where the v1i1 setcc is expanded to v1i16 setcc, and the v1i16 setcc tries to be expanded to a v2i16 setcc which fails. For floating point types we can let them scalarize instead though, generating a setcc f16 that can be lowered using normal fp16 lowering.

07a8ff4 added a special case combine for v1 vselect to expand the predicate type to the same size as the fcmp operands. This turns that off for float types, allowing them to scalarize naturally, which hopefully fixes the issue by preventing the v1i16 setcc, meaning it wont try to widen to larger vectors.

The codegen might not be optimal, but as far as I can tell everything generated successfully, providing that no v1i16 setcc v1f16 instructions get generated.

PR llvm#71614 identified an issue in the lowering of v1f16 vector compares, where the `v1i1 setcc` is expanded to `v1i16 setcc`, and the `v1i16 setcc` tries to be expanded to a `v2i16 setcc` which fails. For floating point types we can let them scalarize instead though, generating a `setcc f16` that can be lowered using normal fp16 lowering. 07a8ff4 added a special case combine for v1 vselect to expand the predicate type to the same size as the fcmp operands. This turns that off for float types, allowing them to scalarize naturally, which hopefully fixes the issue by preventing the v1i16 setcc, meaning it wont try to widen to larger vectors. The codegen might not be optimal, but as far as I can tell everything generated successfully, providing that no `v1i16 setcc v1f16` instructions get generated.

llvmbot · 2023-11-12T11:28:13Z

@llvm/pr-subscribers-backend-aarch64

Author: David Green (davemgreen)

Changes

PR #71614 identified an issue in the lowering of v1f16 vector compares, where the v1i1 setcc is expanded to v1i16 setcc, and the v1i16 setcc tries to be expanded to a v2i16 setcc which fails. For floating point types we can let them scalarize instead though, generating a setcc f16 that can be lowered using normal fp16 lowering.

07a8ff4 added a special case combine for v1 vselect to expand the predicate type to the same size as the fcmp operands. This turns that off for float types, allowing them to scalarize naturally, which hopefully fixes the issue by preventing the v1i16 setcc, meaning it wont try to widen to larger vectors.

The codegen might not be optimal, but as far as I can tell everything generated successfully, providing that no v1i16 setcc v1f16 instructions get generated.

Full diff: https://github.com/llvm/llvm-project/pull/72048.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+7-4)
(modified) llvm/test/CodeGen/AArch64/arm64-neon-v1i1-setcc.ll (+144)

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index b61e4be705709e2..3bff2845b7a1342 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -22389,13 +22389,14 @@ static SDValue performVSelectCombine(SDNode *N, SelectionDAG &DAG) {
     }
   }
 
+  EVT CmpVT = N0.getOperand(0).getValueType();
   if (N0.getOpcode() != ISD::SETCC ||
       CCVT.getVectorElementCount() != ElementCount::getFixed(1) ||
-      CCVT.getVectorElementType() != MVT::i1)
+      CCVT.getVectorElementType() != MVT::i1 ||
+      CmpVT.getVectorElementType().isFloatingPoint())
     return SDValue();
 
   EVT ResVT = N->getValueType(0);
-  EVT CmpVT = N0.getOperand(0).getValueType();
   // Only combine when the result type is of the same size as the compared
   // operands.
   if (ResVT.getSizeInBits() != CmpVT.getSizeInBits())
@@ -22438,8 +22439,10 @@ static SDValue performSelectCombine(SDNode *N,
   EVT SrcVT = N0.getOperand(0).getValueType();
 
   // Don't try to do this optimization when the setcc itself has i1 operands.
-  // There are no legal vectors of i1, so this would be pointless.
-  if (SrcVT == MVT::i1)
+  // There are no legal vectors of i1, so this would be pointless. v1f16 is
+  // ruled out to prevent the creation of setcc that need to be scalarized.
+  if (SrcVT == MVT::i1 ||
+      (SrcVT.isFloatingPoint() && SrcVT.getSizeInBits() <= 16))
     return SDValue();
 
   int NumMaskElts = ResVT.getSizeInBits() / SrcVT.getSizeInBits();
diff --git a/llvm/test/CodeGen/AArch64/arm64-neon-v1i1-setcc.ll b/llvm/test/CodeGen/AArch64/arm64-neon-v1i1-setcc.ll
index 18392a26d846d4f..6c70d19a977a53d 100644
--- a/llvm/test/CodeGen/AArch64/arm64-neon-v1i1-setcc.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-neon-v1i1-setcc.ll
@@ -105,3 +105,147 @@ if.then:
 if.end:
   ret i32 1;
 }
+
+
+define <1 x float> @test_vselect_f32(<1 x float> %i105, <1 x float> %in) {
+; CHECK-LABEL: test_vselect_f32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-NEXT:    fcmp s0, s0
+; CHECK-NEXT:    cset w8, vs
+; CHECK-NEXT:    fmov s2, w8
+; CHECK-NEXT:    shl v2.2s, v2.2s, #31
+; CHECK-NEXT:    cmlt v2.2s, v2.2s, #0
+; CHECK-NEXT:    bit v0.8b, v1.8b, v2.8b
+; CHECK-NEXT:    ret
+  %i179 = fcmp uno <1 x float> %i105, zeroinitializer
+  %i180 = select <1 x i1> %i179, <1 x float> %in, <1 x float> %i105
+  ret <1 x float> %i180
+}
+
+define <1 x half> @test_vselect_f16(<1 x half> %i105, <1 x half> %in) {
+; CHECK-LABEL: test_vselect_f16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $h0 killed $h0 def $s0
+; CHECK-NEXT:    fcvt s2, h0
+; CHECK-NEXT:    // kill: def $h1 killed $h1 def $s1
+; CHECK-NEXT:    fcmp s2, s2
+; CHECK-NEXT:    fcsel s0, s1, s0, vs
+; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $s0
+; CHECK-NEXT:    ret
+  %i179 = fcmp uno <1 x half> %i105, zeroinitializer
+  %i180 = select <1 x i1> %i179, <1 x half> %in, <1 x half> %i105
+  ret <1 x half> %i180
+}
+
+define <1 x half> @test_select_f16(half %a, half %b, <1 x half> %c, <1 x half> %d ) {
+; CHECK-LABEL: test_select_f16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fcvt s1, h1
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    // kill: def $h3 killed $h3 def $s3
+; CHECK-NEXT:    // kill: def $h2 killed $h2 def $s2
+; CHECK-NEXT:    fcmp s0, s1
+; CHECK-NEXT:    fcsel s0, s2, s3, eq
+; CHECK-NEXT:    // kill: def $h0 killed $h0 killed $s0
+; CHECK-NEXT:    ret
+  %cmp31 = fcmp oeq half %a, %b
+  %e = select i1 %cmp31, <1 x half> %c, <1 x half> %d
+  ret <1 x half> %e
+}
+
+define <1 x i16> @test_vselect_f16_i16(<1 x half> %i105, <1 x half> %in, <1 x i16> %x, <1 x i16> %y) {
+; CHECK-LABEL: test_vselect_f16_i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    fcmp s0, s0
+; CHECK-NEXT:    cset w8, vs
+; CHECK-NEXT:    fmov s0, w8
+; CHECK-NEXT:    shl v0.4h, v0.4h, #15
+; CHECK-NEXT:    cmlt v0.4h, v0.4h, #0
+; CHECK-NEXT:    bsl v0.8b, v2.8b, v3.8b
+; CHECK-NEXT:    ret
+  %i179 = fcmp uno <1 x half> %i105, zeroinitializer
+  %i180 = select <1 x i1> %i179, <1 x i16> %x, <1 x i16> %y
+  ret <1 x i16> %i180
+}
+
+define <1 x i16> @test_select_f16_i16(half %i105, half %in, <1 x i16> %x, <1 x i16> %y) {
+; CHECK-LABEL: test_select_f16_i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    fcmp s0, s0
+; CHECK-NEXT:    csetm w8, vs
+; CHECK-NEXT:    dup v0.4h, w8
+; CHECK-NEXT:    bsl v0.8b, v2.8b, v3.8b
+; CHECK-NEXT:    ret
+  %i179 = fcmp uno half %i105, zeroinitializer
+  %i180 = select i1 %i179, <1 x i16> %x, <1 x i16> %y
+  ret <1 x i16> %i180
+}
+
+define <1 x i32> @test_vselect_f16_i32(<1 x half> %i105, <1 x half> %in, <1 x i32> %x, <1 x i32> %y) {
+; CHECK-LABEL: test_vselect_f16_i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    fcmp s0, s0
+; CHECK-NEXT:    cset w8, vs
+; CHECK-NEXT:    fmov s0, w8
+; CHECK-NEXT:    shl v0.2s, v0.2s, #31
+; CHECK-NEXT:    cmlt v0.2s, v0.2s, #0
+; CHECK-NEXT:    bsl v0.8b, v2.8b, v3.8b
+; CHECK-NEXT:    ret
+  %i179 = fcmp uno <1 x half> %i105, zeroinitializer
+  %i180 = select <1 x i1> %i179, <1 x i32> %x, <1 x i32> %y
+  ret <1 x i32> %i180
+}
+
+define i64 @test_sext_extr_cmp_half(<1 x half> %v1, <1 x half> %v2) {
+; CHECK-LABEL: test_sext_extr_cmp_half:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fcvt s1, h1
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    fcmp s0, s1
+; CHECK-NEXT:    cset w8, eq
+; CHECK-NEXT:    sbfx x0, x8, #0, #1
+; CHECK-NEXT:    ret
+  %1 = fcmp oeq <1 x half> %v1, %v2
+  %2 = extractelement <1 x i1> %1, i32 0
+  %vget_lane = sext i1 %2 to i64
+  ret i64 %vget_lane
+}
+
+define <1 x i64> @test_select_v1i1_half(half %lhs, half %rhs, <1 x i64> %v3) {
+; CHECK-LABEL: test_select_v1i1_half:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    fcvt s1, h1
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    fcmp s0, s1
+; CHECK-NEXT:    csetm x8, eq
+; CHECK-NEXT:    fmov d0, x8
+; CHECK-NEXT:    bic v0.8b, v2.8b, v0.8b
+; CHECK-NEXT:    ret
+  %tst = fcmp oeq half %lhs, %rhs
+  %evil = insertelement <1 x i1> undef, i1 %tst, i32 0
+  %res = select <1 x i1> %evil, <1 x i64> zeroinitializer, <1 x i64> %v3
+  ret <1 x i64> %res
+}
+
+define i32 @test_br_extr_cmp_half(<1 x half> %v1, <1 x half> %v2) {
+; CHECK-LABEL: test_br_extr_cmp_half:
+; CHECK:       // %bb.0: // %common.ret
+; CHECK-NEXT:    fcvt s1, h1
+; CHECK-NEXT:    fcvt s0, h0
+; CHECK-NEXT:    fcmp s0, s1
+; CHECK-NEXT:    cset w0, eq
+; CHECK-NEXT:    ret
+  %1 = fcmp oeq <1 x half> %v1, %v2
+  %2 = extractelement <1 x i1> %1, i32 0
+  br i1 %2, label %if.end, label %if.then
+
+if.then:
+  ret i32 0;
+
+if.end:
+  ret i32 1;
+}

qcolombet

Works for me.
Thanks for the fix

davemgreen · 2023-11-13T14:42:40Z

Thanks.

PR llvm#71614 identified an issue in the lowering of v1f16 vector compares, where the `v1i1 setcc` is expanded to `v1i16 setcc`, and the `v1i16 setcc` tries to be expanded to a `v2i16 setcc` which fails. For floating point types we can let them scalarize instead though, generating a `setcc f16` that can be lowered using normal fp16 lowering. 07a8ff4 added a special case combine for v1 vselect to expand the predicate type to the same size as the fcmp operands. This turns that off for float types, allowing them to scalarize naturally, which hopefully fixes the issue by preventing the v1i16 setcc, meaning it wont try to widen to larger vectors. The codegen might not be optimal, but as far as I can tell everything generated successfully, providing that no `v1i16 setcc v1f16` instructions get generated.

davemgreen requested review from TNorthover, SamTebbs33, rsuderman, qcolombet and sjoerdmeijer November 12, 2023 11:27

llvmbot added the backend:AArch64 label Nov 12, 2023

davemgreen mentioned this pull request Nov 12, 2023

[SDAG] Adding scalarization of llvm.vector.insert #71614

Closed

qcolombet approved these changes Nov 12, 2023

View reviewed changes

qcolombet mentioned this pull request Nov 13, 2023

[CPU][ARM] Failed to compile FP16 model with DT+UK for Android iree-org/iree#15505

Closed

davemgreen merged commit 2238363 into llvm:main Nov 13, 2023

davemgreen deleted the gh-a64-v1f16cmp branch November 13, 2023 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Prevent v1f16 vselect/setcc type expansion. #72048

[AArch64] Prevent v1f16 vselect/setcc type expansion. #72048

Uh oh!

davemgreen commented Nov 12, 2023

Uh oh!

llvmbot commented Nov 12, 2023

Uh oh!

qcolombet left a comment

Uh oh!

davemgreen commented Nov 13, 2023

Uh oh!

Uh oh!

[AArch64] Prevent v1f16 vselect/setcc type expansion. #72048

[AArch64] Prevent v1f16 vselect/setcc type expansion. #72048

Uh oh!

Conversation

davemgreen commented Nov 12, 2023

Uh oh!

llvmbot commented Nov 12, 2023

Uh oh!

qcolombet left a comment

Choose a reason for hiding this comment

Uh oh!

davemgreen commented Nov 13, 2023

Uh oh!

Uh oh!