[X86] LowerSelect - use BLENDV for scalar selection on all SSE41+ targets #125853

RKSimon · 2025-02-05T13:27:00Z

When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask.

Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. In most cases we will replace 3 logic instruction with the BLENDV node and (up to 3) additional moves. Although the BLENDV is often more expensive on original SSE41 targets, this should still be an improvement in a majority of cases.

We also have no equivalent restrictions for SSE41 for v2f64/v4f32 vector selection.

Fixes #105807

…ds are multi use When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask. Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. If we allow SSE41 cases where the condition and another operand has one use, then the extra moves should never be as bad as the avoided logic ops (we still assume SSE41 BLENDV is more expensive than general logic). Fixes llvm#105807

llvmbot · 2025-02-05T13:27:36Z

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask.

Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. If we allow SSE41 cases where the condition and another operand has one use, then the extra moves should never be as bad as the avoided logic ops (we still assume SSE41 BLENDV is more expensive than general logic).

We also have no equivalent restrictions for SSE41 for v2f64/v4f32 vector selection.

Fixes #105807

Patch is 67.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/125853.diff

10 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+6-5)
(modified) llvm/test/CodeGen/X86/fmaxnum.ll (+80-44)
(modified) llvm/test/CodeGen/X86/fminnum.ll (+80-44)
(modified) llvm/test/CodeGen/X86/fp-select-cmp-and.ll (+4-6)
(modified) llvm/test/CodeGen/X86/setcc-combine.ll (+38-18)
(modified) llvm/test/CodeGen/X86/vec_floor.ll (+12-16)
(modified) llvm/test/CodeGen/X86/vector-reduce-fmax.ll (+123-135)
(modified) llvm/test/CodeGen/X86/vector-reduce-fmaximum.ll (+229-238)
(modified) llvm/test/CodeGen/X86/vector-reduce-fmin.ll (+119-130)
(modified) llvm/test/CodeGen/X86/vselect-zero.ll (+31-16)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6cf6061deba702..bc55d772b86b8d 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -24630,8 +24630,8 @@ SDValue X86TargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
       SDValue Cmp = DAG.getNode(X86ISD::FSETCC, DL, VT, CondOp0, CondOp1,
                                 DAG.getTargetConstant(SSECC, DL, MVT::i8));
 
-      // If we have AVX, we can use a variable vector select (VBLENDV) instead
-      // of 3 logic instructions for size savings and potentially speed.
+      // If we have SSE41/AVX, we can use a variable vector select (VBLENDV)
+      // instead of 3 logic instructions for size savings and potentially speed.
       // Unfortunately, there is no scalar form of VBLENDV.
 
       // If either operand is a +0.0 constant, don't try this. We can expect to
@@ -24641,9 +24641,10 @@ SDValue X86TargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
       // BLENDV was introduced with SSE 4.1, but the 2 register form implicitly
       // uses XMM0 as the selection register. That may need just as many
       // instructions as the AND/ANDN/OR sequence due to register moves, so
-      // don't bother.
-      if (Subtarget.hasAVX() && !isNullFPConstant(Op1) &&
-          !isNullFPConstant(Op2)) {
+      // only attempt this if at least one of ops (+ condition) are one use.
+      if (Subtarget.hasSSE41() && !isNullFPConstant(Op1) &&
+          !isNullFPConstant(Op2) &&
+          (Subtarget.hasAVX() || Op1->hasOneUse() || Op2->hasOneUse())) {
         // Convert to vectors, do a VSELECT, and convert back to scalar.
         // All of the conversions should be optimized away.
         MVT VecVT = VT == MVT::f32 ? MVT::v4f32 : MVT::v2f64;
diff --git a/llvm/test/CodeGen/X86/fmaxnum.ll b/llvm/test/CodeGen/X86/fmaxnum.ll
index 2e1af1e84e0762..d6252cc85e8b45 100644
--- a/llvm/test/CodeGen/X86/fmaxnum.ll
+++ b/llvm/test/CodeGen/X86/fmaxnum.ll
@@ -22,17 +22,26 @@ declare <8 x double> @llvm.maxnum.v8f64(<8 x double>, <8 x double>)
 ; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
 
 define float @test_fmaxf(float %x, float %y) {
-; SSE-LABEL: test_fmaxf:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movaps %xmm0, %xmm2
-; SSE-NEXT:    cmpunordss %xmm0, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm3
-; SSE-NEXT:    andps %xmm1, %xmm3
-; SSE-NEXT:    maxss %xmm0, %xmm1
-; SSE-NEXT:    andnps %xmm1, %xmm2
-; SSE-NEXT:    orps %xmm3, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_fmaxf:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm1, %xmm2
+; SSE2-NEXT:    orps %xmm3, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_fmaxf:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movaps %xmm1, %xmm2
+; SSE4-NEXT:    maxss %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movaps %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_fmaxf:
 ; AVX1:       # %bb.0:
@@ -63,17 +72,26 @@ define float @test_fmaxf_minsize(float %x, float %y) minsize {
 ; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
 
 define double @test_fmax(double %x, double %y) {
-; SSE-LABEL: test_fmax:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movapd %xmm0, %xmm2
-; SSE-NEXT:    cmpunordsd %xmm0, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm3
-; SSE-NEXT:    andpd %xmm1, %xmm3
-; SSE-NEXT:    maxsd %xmm0, %xmm1
-; SSE-NEXT:    andnpd %xmm1, %xmm2
-; SSE-NEXT:    orpd %xmm3, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_fmax:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm3
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    maxsd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm3, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_fmax:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movapd %xmm1, %xmm2
+; SSE4-NEXT:    maxsd %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movapd %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_fmax:
 ; AVX1:       # %bb.0:
@@ -111,17 +129,26 @@ define x86_fp80 @test_fmaxl(x86_fp80 %x, x86_fp80 %y) {
 }
 
 define float @test_intrinsic_fmaxf(float %x, float %y) {
-; SSE-LABEL: test_intrinsic_fmaxf:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movaps %xmm0, %xmm2
-; SSE-NEXT:    cmpunordss %xmm0, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm3
-; SSE-NEXT:    andps %xmm1, %xmm3
-; SSE-NEXT:    maxss %xmm0, %xmm1
-; SSE-NEXT:    andnps %xmm1, %xmm2
-; SSE-NEXT:    orps %xmm3, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_intrinsic_fmaxf:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm1, %xmm2
+; SSE2-NEXT:    orps %xmm3, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_intrinsic_fmaxf:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movaps %xmm1, %xmm2
+; SSE4-NEXT:    maxss %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movaps %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_intrinsic_fmaxf:
 ; AVX1:       # %bb.0:
@@ -142,17 +169,26 @@ define float @test_intrinsic_fmaxf(float %x, float %y) {
 }
 
 define double @test_intrinsic_fmax(double %x, double %y) {
-; SSE-LABEL: test_intrinsic_fmax:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movapd %xmm0, %xmm2
-; SSE-NEXT:    cmpunordsd %xmm0, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm3
-; SSE-NEXT:    andpd %xmm1, %xmm3
-; SSE-NEXT:    maxsd %xmm0, %xmm1
-; SSE-NEXT:    andnpd %xmm1, %xmm2
-; SSE-NEXT:    orpd %xmm3, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_intrinsic_fmax:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm3
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    maxsd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm3, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_intrinsic_fmax:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movapd %xmm1, %xmm2
+; SSE4-NEXT:    maxsd %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movapd %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_intrinsic_fmax:
 ; AVX1:       # %bb.0:
diff --git a/llvm/test/CodeGen/X86/fminnum.ll b/llvm/test/CodeGen/X86/fminnum.ll
index 1290a7b8191067..0ef8fdec33d937 100644
--- a/llvm/test/CodeGen/X86/fminnum.ll
+++ b/llvm/test/CodeGen/X86/fminnum.ll
@@ -22,17 +22,26 @@ declare <8 x double> @llvm.minnum.v8f64(<8 x double>, <8 x double>)
 ; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
 
 define float @test_fminf(float %x, float %y) {
-; SSE-LABEL: test_fminf:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movaps %xmm0, %xmm2
-; SSE-NEXT:    cmpunordss %xmm0, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm3
-; SSE-NEXT:    andps %xmm1, %xmm3
-; SSE-NEXT:    minss %xmm0, %xmm1
-; SSE-NEXT:    andnps %xmm1, %xmm2
-; SSE-NEXT:    orps %xmm3, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_fminf:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm1, %xmm2
+; SSE2-NEXT:    orps %xmm3, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_fminf:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movaps %xmm1, %xmm2
+; SSE4-NEXT:    minss %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movaps %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_fminf:
 ; AVX1:       # %bb.0:
@@ -63,17 +72,26 @@ define float @test_fminf_minsize(float %x, float %y) minsize {
 ; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
 
 define double @test_fmin(double %x, double %y) {
-; SSE-LABEL: test_fmin:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movapd %xmm0, %xmm2
-; SSE-NEXT:    cmpunordsd %xmm0, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm3
-; SSE-NEXT:    andpd %xmm1, %xmm3
-; SSE-NEXT:    minsd %xmm0, %xmm1
-; SSE-NEXT:    andnpd %xmm1, %xmm2
-; SSE-NEXT:    orpd %xmm3, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_fmin:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm3
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    minsd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm3, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_fmin:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movapd %xmm1, %xmm2
+; SSE4-NEXT:    minsd %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movapd %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_fmin:
 ; AVX1:       # %bb.0:
@@ -111,17 +129,26 @@ define x86_fp80 @test_fminl(x86_fp80 %x, x86_fp80 %y) {
 }
 
 define float @test_intrinsic_fminf(float %x, float %y) {
-; SSE-LABEL: test_intrinsic_fminf:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movaps %xmm0, %xmm2
-; SSE-NEXT:    cmpunordss %xmm0, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm3
-; SSE-NEXT:    andps %xmm1, %xmm3
-; SSE-NEXT:    minss %xmm0, %xmm1
-; SSE-NEXT:    andnps %xmm1, %xmm2
-; SSE-NEXT:    orps %xmm3, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_intrinsic_fminf:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm1, %xmm2
+; SSE2-NEXT:    orps %xmm3, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_intrinsic_fminf:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movaps %xmm1, %xmm2
+; SSE4-NEXT:    minss %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movaps %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_intrinsic_fminf:
 ; AVX1:       # %bb.0:
@@ -142,17 +169,26 @@ define float @test_intrinsic_fminf(float %x, float %y) {
 }
 
 define double @test_intrinsic_fmin(double %x, double %y) {
-; SSE-LABEL: test_intrinsic_fmin:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movapd %xmm0, %xmm2
-; SSE-NEXT:    cmpunordsd %xmm0, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm3
-; SSE-NEXT:    andpd %xmm1, %xmm3
-; SSE-NEXT:    minsd %xmm0, %xmm1
-; SSE-NEXT:    andnpd %xmm1, %xmm2
-; SSE-NEXT:    orpd %xmm3, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_intrinsic_fmin:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm3
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    minsd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm3, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_intrinsic_fmin:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movapd %xmm1, %xmm2
+; SSE4-NEXT:    minsd %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movapd %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_intrinsic_fmin:
 ; AVX1:       # %bb.0:
diff --git a/llvm/test/CodeGen/X86/fp-select-cmp-and.ll b/llvm/test/CodeGen/X86/fp-select-cmp-and.ll
index 0f6159d36ea818..1d006f725ca34d 100644
--- a/llvm/test/CodeGen/X86/fp-select-cmp-and.ll
+++ b/llvm/test/CodeGen/X86/fp-select-cmp-and.ll
@@ -189,10 +189,9 @@ define float @test17(float %a, float %b, float %c, float %eps) {
 ; CHECK-LABEL: test17:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    cmpless %xmm0, %xmm3
-; CHECK-NEXT:    andps %xmm3, %xmm2
-; CHECK-NEXT:    andnps %xmm1, %xmm3
-; CHECK-NEXT:    orps %xmm2, %xmm3
 ; CHECK-NEXT:    movaps %xmm3, %xmm0
+; CHECK-NEXT:    blendvps %xmm0, %xmm2, %xmm1
+; CHECK-NEXT:    movaps %xmm1, %xmm0
 ; CHECK-NEXT:    retq
   %cmp = fcmp oge float %a, %eps
   %cond = select i1 %cmp, float %c, float %b
@@ -203,10 +202,9 @@ define double @test18(double %a, double %b, double %c, double %eps) {
 ; CHECK-LABEL: test18:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    cmplesd %xmm0, %xmm3
-; CHECK-NEXT:    andpd %xmm3, %xmm2
-; CHECK-NEXT:    andnpd %xmm1, %xmm3
-; CHECK-NEXT:    orpd %xmm2, %xmm3
 ; CHECK-NEXT:    movapd %xmm3, %xmm0
+; CHECK-NEXT:    blendvpd %xmm0, %xmm2, %xmm1
+; CHECK-NEXT:    movapd %xmm1, %xmm0
 ; CHECK-NEXT:    retq
   %cmp = fcmp oge double %a, %eps
   %cond = select i1 %cmp, double %c, double %b
diff --git a/llvm/test/CodeGen/X86/setcc-combine.ll b/llvm/test/CodeGen/X86/setcc-combine.ll
index e723569bda8a12..f526db00df6062 100644
--- a/llvm/test/CodeGen/X86/setcc-combine.ll
+++ b/llvm/test/CodeGen/X86/setcc-combine.ll
@@ -463,14 +463,23 @@ define <2 x double> @oge(<2 x double> %x) {
 ; negative test - don't create an fneg to replace 0.0 operand
 
 define double @ogt_no_fneg(double %x, double %y) {
-; CHECK-LABEL: ogt_no_fneg:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    xorpd %xmm2, %xmm2
-; CHECK-NEXT:    cmpltsd %xmm0, %xmm2
-; CHECK-NEXT:    andpd %xmm2, %xmm0
-; CHECK-NEXT:    andnpd %xmm1, %xmm2
-; CHECK-NEXT:    orpd %xmm2, %xmm0
-; CHECK-NEXT:    retq
+; SSE2-LABEL: ogt_no_fneg:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    xorpd %xmm2, %xmm2
+; SSE2-NEXT:    cmpltsd %xmm0, %xmm2
+; SSE2-NEXT:    andpd %xmm2, %xmm0
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: ogt_no_fneg:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm0, %xmm2
+; SSE41-NEXT:    xorpd %xmm0, %xmm0
+; SSE41-NEXT:    cmpltsd %xmm2, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm2, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    retq
   %cmp = fcmp ogt double %x, 0.0
   %r = select i1 %cmp, double %x, double %y
   ret double %r
@@ -479,16 +488,27 @@ define double @ogt_no_fneg(double %x, double %y) {
 ; negative test - can't change the setcc for non-zero constant
 
 define double @ogt_no_zero(double %x) {
-; CHECK-LABEL: ogt_no_zero:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    movapd {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0]
-; CHECK-NEXT:    xorpd %xmm0, %xmm1
-; CHECK-NEXT:    movsd {{.*#+}} xmm2 = [1.0E+0,0.0E+0]
-; CHECK-NEXT:    cmpltsd %xmm0, %xmm2
-; CHECK-NEXT:    andpd %xmm2, %xmm0
-; CHECK-NEXT:    andnpd %xmm1, %xmm2
-; CHECK-NEXT:    orpd %xmm2, %xmm0
-; CHECK-NEXT:    retq
+; SSE2-LABEL: ogt_no_zero:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0]
+; SSE2-NEXT:    xorpd %xmm0, %xmm1
+; SSE2-NEXT:    movsd {{.*#+}} xmm2 = [1.0E+0,0.0E+0]
+; SSE2-NEXT:    cmpltsd %xmm0, %xmm2
+; SSE2-NEXT:    andpd %xmm2, %xmm0
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: ogt_no_zero:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm0, %xmm1
+; SSE41-NEXT:    movapd {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0]
+; SSE41-NEXT:    xorpd %xmm0, %xmm2
+; SSE41-NEXT:    movsd {{.*#+}} xmm0 = [1.0E+0,0.0E+0]
+; SSE41-NEXT:    cmpltsd %xmm1, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm0
+; SSE41-NEXT:    retq
   %neg = fneg double %x
   %cmp = fcmp ogt double %x, 1.0
   %r = select i1 %cmp, double %x, double %neg
diff --git a/llvm/test/CodeGen/X86/vec_floor.ll b/llvm/test/CodeGen/X86/vec_floor.ll
index 65cde6ac91106b..abb85ac83464cb 100644
--- a/llvm/test/CodeGen/X86/vec_floor.ll
+++ b/llvm/test/CodeGen/X86/vec_floor.ll
@@ -1679,10 +1679,9 @@ define <4 x float> @floor_mask_ss_mask8(<4 x float> %x, <4 x float> %y, <4 x flo
 ; SSE41:       ## %bb.0:
 ; SSE41-NEXT:    roundss $9, %xmm0, %xmm3
 ; SSE41-NEXT:    cmpeqss %xmm1, %xmm0
-; SSE41-NEXT:    andps %xmm0, %xmm3
-; SSE41-NEXT:    andnps %xmm2, %xmm0
-; SSE41-NEXT:    orps %xmm3, %xmm0
-; SSE41-NEXT:    blendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3]
+; SSE41-NEXT:    blendvps %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    blendps {{.*#+}} xmm2 = xmm2[0],xmm1[1,2,3]
+; SSE41-NEXT:    movaps %xmm2, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: floor_mask_ss_mask8:
@@ -1747,10 +1746,9 @@ define <2 x double> @floor_mask_sd_mask8(<2 x double> %x, <2 x double> %y, <2 x
 ; SSE41:       ## %bb.0:
 ; SSE41-NEXT:    roundsd $9, %xmm0, %xmm3
 ; SSE41-NEXT:    cmpeqsd %xmm1, %xmm0
-; SSE41-NEXT:    andpd %xmm0, %xmm3
-; SSE41-NEXT:    andnpd %xmm2, %xmm0
-; SSE41-NEXT:    orpd %xmm3, %xmm0
-; SSE41-NEXT:    blendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
+; SSE41-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    blendpd {{.*#+}} xmm2 = xmm2[0],xmm1[1]
+; SSE41-NEXT:    movapd %xmm2, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: floor_mask_sd_mask8:
@@ -2671,10 +2669,9 @@ define <4 x float> @ceil_mask_ss_mask8(<4 x float> %x, <4 x float> %y, <4 x floa
 ; SSE41:       ## %bb.0:
 ; SSE41-NEXT:    roundss $10, %xmm0, %xmm3
 ; SSE41-NEXT:    cmpeqss %xmm1, %xmm0
-; SSE41-NEXT:    andps %xmm0, %xmm3
-; SSE41-NEXT:    andnps %xmm2, %xmm0
-; SSE41-NEXT:    orps %xmm3, %xmm0
-; SSE41-NEXT:    blendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3]
+; SSE41-NEXT:    blendvps %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    blendps {{.*#+}} xmm2 = xmm2[0],xmm1[1,2,3]
+; SSE41-NEXT:    movaps %xmm2, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: ceil_mask_ss_mask8:
@@ -2739,10 +2736,9 @@ define <2 x double> @ceil_mask_sd_mask8(<2 x double> %x, <2 x double> %y, <2 x d
 ; SSE41:       ## %bb.0:
 ; SSE41-NEXT:    roundsd $10, %xmm0, %xmm3
 ; SSE41-NEXT:    cmpeqsd %xmm1, %xmm0
-; SSE41-NEXT:    andpd %xmm0, %xmm3
-; SSE41-NEXT:    andnpd %xmm2, %xmm0
-; SSE41-NEXT:    orpd %xmm3, %xmm0
-; SSE41-NEXT:    blendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
+; SSE41-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    blendpd {{.*#+}} xmm2 = xmm2[0],xmm1[1]
+; SSE41-NEXT:    movapd %xmm2, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: ceil_mask_sd_mask8:
diff --git a/llvm/test/CodeGen/X86/vector-reduce-fmax.ll b/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
index fe2c41f57cfab1..7048b98227620f 100644
--- a/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
+++ b/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 | FileCheck %s --check-prefixes=ALL,SSE,SSE2
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 | FileCheck %s --check-prefixes=ALL,SSE,SSE41
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 | FileCheck %s --check-prefixes=ALL,SSE2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 | FileCheck %s --check-prefixes=ALL,SSE41
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx | FileCheck %s --check-prefixes=ALL,AVX
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 | FileCheck %s --check-prefixes=ALL,AVX
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512bw | FileCheck %s --check-prefixes=ALL,AVX512,AVX512BW
@@ -36,13 +36,10 @@ define float @test_v2f32(<2 x float> %a0) {
 ; SSE41-LABEL: test_v2f32:
 ; SSE41:       # %bb.0:
 ; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
-; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
-; SSE41-NEXT:    movaps %xmm1, %xmm3
-; SSE41-NEXT:    andps %xmm2, %xmm3
-; SSE41-NEXT:    maxss %xmm0, %xmm2
-; SSE41-NEXT:    andnps %xmm2, %xmm1
-; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm2, %xmm1
+; SSE41-NEXT:    maxss %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss...
[truncated]

goldsteinn · 2025-02-05T19:31:17Z

Do any of these tests actually handle the case where one of the ops is multi-use?

RKSimon · 2025-02-05T21:05:15Z

Yes, removing the limit enables additional folds, some are actually beneficial but others cause extra moves.

goldsteinn · 2025-02-05T21:56:44Z

Yes, removing the limit enables additional folds, some are actually beneficial but others cause extra moves.

Which tests?

RKSimon · 2025-02-05T22:32:43Z

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

phoebewang · 2025-02-06T03:17:46Z

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

We assume move have negligible cost in uarch and the total instrcution count is not increased. Why it is not prefered?

RKSimon · 2025-02-06T13:56:10Z

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

We assume move have negligible cost in uarch and the total instrcution count is not increased. Why it is not prefered?

Not all uarchs form the SSE4 era had move elimination, and often the BLENDV instructions were 2 uops or more - so the total uop count could increase if the 3 x 1uop logic ops (+maybe 1uop move for the ANDNP mask) were replaced with 3 x 1uop moves + 1 x 2uop BLENDV - that's the worse case scenario. But we already always take that chance with BLENDV for vector select, its just the scalar selects that for some reason we were more cautious. I was trying to find a compromise, but I'm not against dropping the multiuse limit for SSE4 entirely.

goldsteinn · 2025-02-06T17:52:20Z

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

We assume move have negligible cost in uarch and the total instrcution count is not increased. Why it is not prefered?

Not all uarchs form the SSE4 era had move elimination, and often the BLENDV instructions were 2 uops or more - so the total uop count could increase if the 3 x 1uop logic ops (+maybe 1uop move for the ANDNP mask) were replaced with 3 x 1uop moves + 1 x 2uop BLENDV - that's the worse case scenario. But we already always take that chance with BLENDV for vector select, its just the scalar selects that for some reason we were more cautious. I was trying to find a compromise, but I'm not against dropping the multiuse limit for SSE4 entirely.

The either op having one-use seems reasonable compromise to me, although I also think removing the requirement entirely is sensible. My guess is most stuff compiled for SSE4 is probably running on newer hardware (mov elim, fast blendv) and just compiled with SSE4 for compatibilities sake.

phoebewang · 2025-02-07T01:18:21Z

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

We assume move have negligible cost in uarch and the total instrcution count is not increased. Why it is not prefered?

Not all uarchs form the SSE4 era had move elimination, and often the BLENDV instructions were 2 uops or more - so the total uop count could increase if the 3 x 1uop logic ops (+maybe 1uop move for the ANDNP mask) were replaced with 3 x 1uop moves + 1 x 2uop BLENDV - that's the worse case scenario. But we already always take that chance with BLENDV for vector select, its just the scalar selects that for some reason we were more cautious. I was trying to find a compromise, but I'm not against dropping the multiuse limit for SSE4 entirely.

The either op having one-use seems reasonable compromise to me, although I also think removing the requirement entirely is sensible. My guess is most stuff compiled for SSE4 is probably running on newer hardware (mov elim, fast blendv) and just compiled with SSE4 for compatibilities sake.

+1. If it's ok either way, we can assume newer hardware performance is important than the older ones. We used this strategy when bumping the general tuning.

…gets (llvm#125853) When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask. Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. In most cases we will replace 3 logic instruction with the BLENDV node and (up to 3) additional moves. Although the BLENDV is often more expensive on original SSE41 targets, this should still be an improvement in a majority of cases. We also have no equivalent restrictions for SSE41 for v2f64/v4f32 vector selection. Fixes llvm#105807

RKSimon requested review from phoebewang, KanRobert and goldsteinn February 5, 2025 13:27

llvmbot added the backend:X86 label Feb 5, 2025

[TEST] Drop SSE41 constraints

f56c768

phoebewang approved these changes Feb 7, 2025

View reviewed changes

RKSimon changed the title ~~[X86] LowerSelect - use BLENDV for scalar selection if not all operands are multi use~~ [X86] LowerSelect - use BLENDV for scalar selection on all SSE41+ targets Feb 10, 2025

Merge branch 'upstream' into sse41-scalar-blendv

63a85a1

RKSimon merged commit d9183fd into llvm:main Feb 10, 2025
8 checks passed

RKSimon deleted the sse41-scalar-blendv branch February 10, 2025 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] LowerSelect - use BLENDV for scalar selection on all SSE41+ targets #125853

[X86] LowerSelect - use BLENDV for scalar selection on all SSE41+ targets #125853

Uh oh!

RKSimon commented Feb 5, 2025 •

edited

Loading

Uh oh!

llvmbot commented Feb 5, 2025

Uh oh!

goldsteinn commented Feb 5, 2025

Uh oh!

RKSimon commented Feb 5, 2025

Uh oh!

goldsteinn commented Feb 5, 2025

Uh oh!

RKSimon commented Feb 5, 2025

Uh oh!

phoebewang commented Feb 6, 2025

Uh oh!

RKSimon commented Feb 6, 2025

Uh oh!

goldsteinn commented Feb 6, 2025

Uh oh!

phoebewang commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

[X86] LowerSelect - use BLENDV for scalar selection on all SSE41+ targets #125853

[X86] LowerSelect - use BLENDV for scalar selection on all SSE41+ targets #125853

Uh oh!

Conversation

RKSimon commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Feb 5, 2025

Uh oh!

goldsteinn commented Feb 5, 2025

Uh oh!

RKSimon commented Feb 5, 2025

Uh oh!

goldsteinn commented Feb 5, 2025

Uh oh!

RKSimon commented Feb 5, 2025

Uh oh!

phoebewang commented Feb 6, 2025

Uh oh!

RKSimon commented Feb 6, 2025

Uh oh!

goldsteinn commented Feb 6, 2025

Uh oh!

phoebewang commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

RKSimon commented Feb 5, 2025 •

edited

Loading