Skip to content

[X86] LowerSelect - use BLENDV for scalar selection on all SSE41+ targets #125853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 10, 2025

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Feb 5, 2025

When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask.

Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. In most cases we will replace 3 logic instruction with the BLENDV node and (up to 3) additional moves. Although the BLENDV is often more expensive on original SSE41 targets, this should still be an improvement in a majority of cases.

We also have no equivalent restrictions for SSE41 for v2f64/v4f32 vector selection.

Fixes #105807

…ds are multi use

When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask.

Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. If we allow SSE41 cases where the condition and another operand has one use, then the extra moves should never be as bad as the avoided logic ops (we still assume SSE41 BLENDV is more expensive than general logic).

Fixes llvm#105807
@llvmbot
Copy link
Member

llvmbot commented Feb 5, 2025

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask.

Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. If we allow SSE41 cases where the condition and another operand has one use, then the extra moves should never be as bad as the avoided logic ops (we still assume SSE41 BLENDV is more expensive than general logic).

We also have no equivalent restrictions for SSE41 for v2f64/v4f32 vector selection.

Fixes #105807


Patch is 67.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/125853.diff

10 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+6-5)
  • (modified) llvm/test/CodeGen/X86/fmaxnum.ll (+80-44)
  • (modified) llvm/test/CodeGen/X86/fminnum.ll (+80-44)
  • (modified) llvm/test/CodeGen/X86/fp-select-cmp-and.ll (+4-6)
  • (modified) llvm/test/CodeGen/X86/setcc-combine.ll (+38-18)
  • (modified) llvm/test/CodeGen/X86/vec_floor.ll (+12-16)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-fmax.ll (+123-135)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-fmaximum.ll (+229-238)
  • (modified) llvm/test/CodeGen/X86/vector-reduce-fmin.ll (+119-130)
  • (modified) llvm/test/CodeGen/X86/vselect-zero.ll (+31-16)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6cf6061deba702..bc55d772b86b8d 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -24630,8 +24630,8 @@ SDValue X86TargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
       SDValue Cmp = DAG.getNode(X86ISD::FSETCC, DL, VT, CondOp0, CondOp1,
                                 DAG.getTargetConstant(SSECC, DL, MVT::i8));
 
-      // If we have AVX, we can use a variable vector select (VBLENDV) instead
-      // of 3 logic instructions for size savings and potentially speed.
+      // If we have SSE41/AVX, we can use a variable vector select (VBLENDV)
+      // instead of 3 logic instructions for size savings and potentially speed.
       // Unfortunately, there is no scalar form of VBLENDV.
 
       // If either operand is a +0.0 constant, don't try this. We can expect to
@@ -24641,9 +24641,10 @@ SDValue X86TargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
       // BLENDV was introduced with SSE 4.1, but the 2 register form implicitly
       // uses XMM0 as the selection register. That may need just as many
       // instructions as the AND/ANDN/OR sequence due to register moves, so
-      // don't bother.
-      if (Subtarget.hasAVX() && !isNullFPConstant(Op1) &&
-          !isNullFPConstant(Op2)) {
+      // only attempt this if at least one of ops (+ condition) are one use.
+      if (Subtarget.hasSSE41() && !isNullFPConstant(Op1) &&
+          !isNullFPConstant(Op2) &&
+          (Subtarget.hasAVX() || Op1->hasOneUse() || Op2->hasOneUse())) {
         // Convert to vectors, do a VSELECT, and convert back to scalar.
         // All of the conversions should be optimized away.
         MVT VecVT = VT == MVT::f32 ? MVT::v4f32 : MVT::v2f64;
diff --git a/llvm/test/CodeGen/X86/fmaxnum.ll b/llvm/test/CodeGen/X86/fmaxnum.ll
index 2e1af1e84e0762..d6252cc85e8b45 100644
--- a/llvm/test/CodeGen/X86/fmaxnum.ll
+++ b/llvm/test/CodeGen/X86/fmaxnum.ll
@@ -22,17 +22,26 @@ declare <8 x double> @llvm.maxnum.v8f64(<8 x double>, <8 x double>)
 ; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
 
 define float @test_fmaxf(float %x, float %y) {
-; SSE-LABEL: test_fmaxf:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movaps %xmm0, %xmm2
-; SSE-NEXT:    cmpunordss %xmm0, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm3
-; SSE-NEXT:    andps %xmm1, %xmm3
-; SSE-NEXT:    maxss %xmm0, %xmm1
-; SSE-NEXT:    andnps %xmm1, %xmm2
-; SSE-NEXT:    orps %xmm3, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_fmaxf:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm1, %xmm2
+; SSE2-NEXT:    orps %xmm3, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_fmaxf:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movaps %xmm1, %xmm2
+; SSE4-NEXT:    maxss %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movaps %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_fmaxf:
 ; AVX1:       # %bb.0:
@@ -63,17 +72,26 @@ define float @test_fmaxf_minsize(float %x, float %y) minsize {
 ; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
 
 define double @test_fmax(double %x, double %y) {
-; SSE-LABEL: test_fmax:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movapd %xmm0, %xmm2
-; SSE-NEXT:    cmpunordsd %xmm0, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm3
-; SSE-NEXT:    andpd %xmm1, %xmm3
-; SSE-NEXT:    maxsd %xmm0, %xmm1
-; SSE-NEXT:    andnpd %xmm1, %xmm2
-; SSE-NEXT:    orpd %xmm3, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_fmax:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm3
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    maxsd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm3, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_fmax:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movapd %xmm1, %xmm2
+; SSE4-NEXT:    maxsd %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movapd %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_fmax:
 ; AVX1:       # %bb.0:
@@ -111,17 +129,26 @@ define x86_fp80 @test_fmaxl(x86_fp80 %x, x86_fp80 %y) {
 }
 
 define float @test_intrinsic_fmaxf(float %x, float %y) {
-; SSE-LABEL: test_intrinsic_fmaxf:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movaps %xmm0, %xmm2
-; SSE-NEXT:    cmpunordss %xmm0, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm3
-; SSE-NEXT:    andps %xmm1, %xmm3
-; SSE-NEXT:    maxss %xmm0, %xmm1
-; SSE-NEXT:    andnps %xmm1, %xmm2
-; SSE-NEXT:    orps %xmm3, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_intrinsic_fmaxf:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    maxss %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm1, %xmm2
+; SSE2-NEXT:    orps %xmm3, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_intrinsic_fmaxf:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movaps %xmm1, %xmm2
+; SSE4-NEXT:    maxss %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movaps %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_intrinsic_fmaxf:
 ; AVX1:       # %bb.0:
@@ -142,17 +169,26 @@ define float @test_intrinsic_fmaxf(float %x, float %y) {
 }
 
 define double @test_intrinsic_fmax(double %x, double %y) {
-; SSE-LABEL: test_intrinsic_fmax:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movapd %xmm0, %xmm2
-; SSE-NEXT:    cmpunordsd %xmm0, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm3
-; SSE-NEXT:    andpd %xmm1, %xmm3
-; SSE-NEXT:    maxsd %xmm0, %xmm1
-; SSE-NEXT:    andnpd %xmm1, %xmm2
-; SSE-NEXT:    orpd %xmm3, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_intrinsic_fmax:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm3
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    maxsd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm3, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_intrinsic_fmax:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movapd %xmm1, %xmm2
+; SSE4-NEXT:    maxsd %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movapd %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_intrinsic_fmax:
 ; AVX1:       # %bb.0:
diff --git a/llvm/test/CodeGen/X86/fminnum.ll b/llvm/test/CodeGen/X86/fminnum.ll
index 1290a7b8191067..0ef8fdec33d937 100644
--- a/llvm/test/CodeGen/X86/fminnum.ll
+++ b/llvm/test/CodeGen/X86/fminnum.ll
@@ -22,17 +22,26 @@ declare <8 x double> @llvm.minnum.v8f64(<8 x double>, <8 x double>)
 ; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
 
 define float @test_fminf(float %x, float %y) {
-; SSE-LABEL: test_fminf:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movaps %xmm0, %xmm2
-; SSE-NEXT:    cmpunordss %xmm0, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm3
-; SSE-NEXT:    andps %xmm1, %xmm3
-; SSE-NEXT:    minss %xmm0, %xmm1
-; SSE-NEXT:    andnps %xmm1, %xmm2
-; SSE-NEXT:    orps %xmm3, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_fminf:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm1, %xmm2
+; SSE2-NEXT:    orps %xmm3, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_fminf:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movaps %xmm1, %xmm2
+; SSE4-NEXT:    minss %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movaps %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_fminf:
 ; AVX1:       # %bb.0:
@@ -63,17 +72,26 @@ define float @test_fminf_minsize(float %x, float %y) minsize {
 ; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
 
 define double @test_fmin(double %x, double %y) {
-; SSE-LABEL: test_fmin:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movapd %xmm0, %xmm2
-; SSE-NEXT:    cmpunordsd %xmm0, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm3
-; SSE-NEXT:    andpd %xmm1, %xmm3
-; SSE-NEXT:    minsd %xmm0, %xmm1
-; SSE-NEXT:    andnpd %xmm1, %xmm2
-; SSE-NEXT:    orpd %xmm3, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_fmin:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm3
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    minsd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm3, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_fmin:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movapd %xmm1, %xmm2
+; SSE4-NEXT:    minsd %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movapd %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_fmin:
 ; AVX1:       # %bb.0:
@@ -111,17 +129,26 @@ define x86_fp80 @test_fminl(x86_fp80 %x, x86_fp80 %y) {
 }
 
 define float @test_intrinsic_fminf(float %x, float %y) {
-; SSE-LABEL: test_intrinsic_fminf:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movaps %xmm0, %xmm2
-; SSE-NEXT:    cmpunordss %xmm0, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm3
-; SSE-NEXT:    andps %xmm1, %xmm3
-; SSE-NEXT:    minss %xmm0, %xmm1
-; SSE-NEXT:    andnps %xmm1, %xmm2
-; SSE-NEXT:    orps %xmm3, %xmm2
-; SSE-NEXT:    movaps %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_intrinsic_fminf:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movaps %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordss %xmm0, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm3
+; SSE2-NEXT:    andps %xmm1, %xmm3
+; SSE2-NEXT:    minss %xmm0, %xmm1
+; SSE2-NEXT:    andnps %xmm1, %xmm2
+; SSE2-NEXT:    orps %xmm3, %xmm2
+; SSE2-NEXT:    movaps %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_intrinsic_fminf:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movaps %xmm1, %xmm2
+; SSE4-NEXT:    minss %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordss %xmm0, %xmm0
+; SSE4-NEXT:    blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movaps %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_intrinsic_fminf:
 ; AVX1:       # %bb.0:
@@ -142,17 +169,26 @@ define float @test_intrinsic_fminf(float %x, float %y) {
 }
 
 define double @test_intrinsic_fmin(double %x, double %y) {
-; SSE-LABEL: test_intrinsic_fmin:
-; SSE:       # %bb.0:
-; SSE-NEXT:    movapd %xmm0, %xmm2
-; SSE-NEXT:    cmpunordsd %xmm0, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm3
-; SSE-NEXT:    andpd %xmm1, %xmm3
-; SSE-NEXT:    minsd %xmm0, %xmm1
-; SSE-NEXT:    andnpd %xmm1, %xmm2
-; SSE-NEXT:    orpd %xmm3, %xmm2
-; SSE-NEXT:    movapd %xmm2, %xmm0
-; SSE-NEXT:    retq
+; SSE2-LABEL: test_intrinsic_fmin:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd %xmm0, %xmm2
+; SSE2-NEXT:    cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm3
+; SSE2-NEXT:    andpd %xmm1, %xmm3
+; SSE2-NEXT:    minsd %xmm0, %xmm1
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm3, %xmm2
+; SSE2-NEXT:    movapd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE4-LABEL: test_intrinsic_fmin:
+; SSE4:       # %bb.0:
+; SSE4-NEXT:    movapd %xmm1, %xmm2
+; SSE4-NEXT:    minsd %xmm0, %xmm2
+; SSE4-NEXT:    cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT:    movapd %xmm2, %xmm0
+; SSE4-NEXT:    retq
 ;
 ; AVX1-LABEL: test_intrinsic_fmin:
 ; AVX1:       # %bb.0:
diff --git a/llvm/test/CodeGen/X86/fp-select-cmp-and.ll b/llvm/test/CodeGen/X86/fp-select-cmp-and.ll
index 0f6159d36ea818..1d006f725ca34d 100644
--- a/llvm/test/CodeGen/X86/fp-select-cmp-and.ll
+++ b/llvm/test/CodeGen/X86/fp-select-cmp-and.ll
@@ -189,10 +189,9 @@ define float @test17(float %a, float %b, float %c, float %eps) {
 ; CHECK-LABEL: test17:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    cmpless %xmm0, %xmm3
-; CHECK-NEXT:    andps %xmm3, %xmm2
-; CHECK-NEXT:    andnps %xmm1, %xmm3
-; CHECK-NEXT:    orps %xmm2, %xmm3
 ; CHECK-NEXT:    movaps %xmm3, %xmm0
+; CHECK-NEXT:    blendvps %xmm0, %xmm2, %xmm1
+; CHECK-NEXT:    movaps %xmm1, %xmm0
 ; CHECK-NEXT:    retq
   %cmp = fcmp oge float %a, %eps
   %cond = select i1 %cmp, float %c, float %b
@@ -203,10 +202,9 @@ define double @test18(double %a, double %b, double %c, double %eps) {
 ; CHECK-LABEL: test18:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    cmplesd %xmm0, %xmm3
-; CHECK-NEXT:    andpd %xmm3, %xmm2
-; CHECK-NEXT:    andnpd %xmm1, %xmm3
-; CHECK-NEXT:    orpd %xmm2, %xmm3
 ; CHECK-NEXT:    movapd %xmm3, %xmm0
+; CHECK-NEXT:    blendvpd %xmm0, %xmm2, %xmm1
+; CHECK-NEXT:    movapd %xmm1, %xmm0
 ; CHECK-NEXT:    retq
   %cmp = fcmp oge double %a, %eps
   %cond = select i1 %cmp, double %c, double %b
diff --git a/llvm/test/CodeGen/X86/setcc-combine.ll b/llvm/test/CodeGen/X86/setcc-combine.ll
index e723569bda8a12..f526db00df6062 100644
--- a/llvm/test/CodeGen/X86/setcc-combine.ll
+++ b/llvm/test/CodeGen/X86/setcc-combine.ll
@@ -463,14 +463,23 @@ define <2 x double> @oge(<2 x double> %x) {
 ; negative test - don't create an fneg to replace 0.0 operand
 
 define double @ogt_no_fneg(double %x, double %y) {
-; CHECK-LABEL: ogt_no_fneg:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    xorpd %xmm2, %xmm2
-; CHECK-NEXT:    cmpltsd %xmm0, %xmm2
-; CHECK-NEXT:    andpd %xmm2, %xmm0
-; CHECK-NEXT:    andnpd %xmm1, %xmm2
-; CHECK-NEXT:    orpd %xmm2, %xmm0
-; CHECK-NEXT:    retq
+; SSE2-LABEL: ogt_no_fneg:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    xorpd %xmm2, %xmm2
+; SSE2-NEXT:    cmpltsd %xmm0, %xmm2
+; SSE2-NEXT:    andpd %xmm2, %xmm0
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: ogt_no_fneg:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm0, %xmm2
+; SSE41-NEXT:    xorpd %xmm0, %xmm0
+; SSE41-NEXT:    cmpltsd %xmm2, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm2, %xmm1
+; SSE41-NEXT:    movapd %xmm1, %xmm0
+; SSE41-NEXT:    retq
   %cmp = fcmp ogt double %x, 0.0
   %r = select i1 %cmp, double %x, double %y
   ret double %r
@@ -479,16 +488,27 @@ define double @ogt_no_fneg(double %x, double %y) {
 ; negative test - can't change the setcc for non-zero constant
 
 define double @ogt_no_zero(double %x) {
-; CHECK-LABEL: ogt_no_zero:
-; CHECK:       # %bb.0:
-; CHECK-NEXT:    movapd {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0]
-; CHECK-NEXT:    xorpd %xmm0, %xmm1
-; CHECK-NEXT:    movsd {{.*#+}} xmm2 = [1.0E+0,0.0E+0]
-; CHECK-NEXT:    cmpltsd %xmm0, %xmm2
-; CHECK-NEXT:    andpd %xmm2, %xmm0
-; CHECK-NEXT:    andnpd %xmm1, %xmm2
-; CHECK-NEXT:    orpd %xmm2, %xmm0
-; CHECK-NEXT:    retq
+; SSE2-LABEL: ogt_no_zero:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movapd {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0]
+; SSE2-NEXT:    xorpd %xmm0, %xmm1
+; SSE2-NEXT:    movsd {{.*#+}} xmm2 = [1.0E+0,0.0E+0]
+; SSE2-NEXT:    cmpltsd %xmm0, %xmm2
+; SSE2-NEXT:    andpd %xmm2, %xmm0
+; SSE2-NEXT:    andnpd %xmm1, %xmm2
+; SSE2-NEXT:    orpd %xmm2, %xmm0
+; SSE2-NEXT:    retq
+;
+; SSE41-LABEL: ogt_no_zero:
+; SSE41:       # %bb.0:
+; SSE41-NEXT:    movapd %xmm0, %xmm1
+; SSE41-NEXT:    movapd {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0]
+; SSE41-NEXT:    xorpd %xmm0, %xmm2
+; SSE41-NEXT:    movsd {{.*#+}} xmm0 = [1.0E+0,0.0E+0]
+; SSE41-NEXT:    cmpltsd %xmm1, %xmm0
+; SSE41-NEXT:    blendvpd %xmm0, %xmm1, %xmm2
+; SSE41-NEXT:    movapd %xmm2, %xmm0
+; SSE41-NEXT:    retq
   %neg = fneg double %x
   %cmp = fcmp ogt double %x, 1.0
   %r = select i1 %cmp, double %x, double %neg
diff --git a/llvm/test/CodeGen/X86/vec_floor.ll b/llvm/test/CodeGen/X86/vec_floor.ll
index 65cde6ac91106b..abb85ac83464cb 100644
--- a/llvm/test/CodeGen/X86/vec_floor.ll
+++ b/llvm/test/CodeGen/X86/vec_floor.ll
@@ -1679,10 +1679,9 @@ define <4 x float> @floor_mask_ss_mask8(<4 x float> %x, <4 x float> %y, <4 x flo
 ; SSE41:       ## %bb.0:
 ; SSE41-NEXT:    roundss $9, %xmm0, %xmm3
 ; SSE41-NEXT:    cmpeqss %xmm1, %xmm0
-; SSE41-NEXT:    andps %xmm0, %xmm3
-; SSE41-NEXT:    andnps %xmm2, %xmm0
-; SSE41-NEXT:    orps %xmm3, %xmm0
-; SSE41-NEXT:    blendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3]
+; SSE41-NEXT:    blendvps %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    blendps {{.*#+}} xmm2 = xmm2[0],xmm1[1,2,3]
+; SSE41-NEXT:    movaps %xmm2, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: floor_mask_ss_mask8:
@@ -1747,10 +1746,9 @@ define <2 x double> @floor_mask_sd_mask8(<2 x double> %x, <2 x double> %y, <2 x
 ; SSE41:       ## %bb.0:
 ; SSE41-NEXT:    roundsd $9, %xmm0, %xmm3
 ; SSE41-NEXT:    cmpeqsd %xmm1, %xmm0
-; SSE41-NEXT:    andpd %xmm0, %xmm3
-; SSE41-NEXT:    andnpd %xmm2, %xmm0
-; SSE41-NEXT:    orpd %xmm3, %xmm0
-; SSE41-NEXT:    blendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
+; SSE41-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    blendpd {{.*#+}} xmm2 = xmm2[0],xmm1[1]
+; SSE41-NEXT:    movapd %xmm2, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: floor_mask_sd_mask8:
@@ -2671,10 +2669,9 @@ define <4 x float> @ceil_mask_ss_mask8(<4 x float> %x, <4 x float> %y, <4 x floa
 ; SSE41:       ## %bb.0:
 ; SSE41-NEXT:    roundss $10, %xmm0, %xmm3
 ; SSE41-NEXT:    cmpeqss %xmm1, %xmm0
-; SSE41-NEXT:    andps %xmm0, %xmm3
-; SSE41-NEXT:    andnps %xmm2, %xmm0
-; SSE41-NEXT:    orps %xmm3, %xmm0
-; SSE41-NEXT:    blendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3]
+; SSE41-NEXT:    blendvps %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    blendps {{.*#+}} xmm2 = xmm2[0],xmm1[1,2,3]
+; SSE41-NEXT:    movaps %xmm2, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: ceil_mask_ss_mask8:
@@ -2739,10 +2736,9 @@ define <2 x double> @ceil_mask_sd_mask8(<2 x double> %x, <2 x double> %y, <2 x d
 ; SSE41:       ## %bb.0:
 ; SSE41-NEXT:    roundsd $10, %xmm0, %xmm3
 ; SSE41-NEXT:    cmpeqsd %xmm1, %xmm0
-; SSE41-NEXT:    andpd %xmm0, %xmm3
-; SSE41-NEXT:    andnpd %xmm2, %xmm0
-; SSE41-NEXT:    orpd %xmm3, %xmm0
-; SSE41-NEXT:    blendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
+; SSE41-NEXT:    blendvpd %xmm0, %xmm3, %xmm2
+; SSE41-NEXT:    blendpd {{.*#+}} xmm2 = xmm2[0],xmm1[1]
+; SSE41-NEXT:    movapd %xmm2, %xmm0
 ; SSE41-NEXT:    retq
 ;
 ; AVX-LABEL: ceil_mask_sd_mask8:
diff --git a/llvm/test/CodeGen/X86/vector-reduce-fmax.ll b/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
index fe2c41f57cfab1..7048b98227620f 100644
--- a/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
+++ b/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 | FileCheck %s --check-prefixes=ALL,SSE,SSE2
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 | FileCheck %s --check-prefixes=ALL,SSE,SSE41
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 | FileCheck %s --check-prefixes=ALL,SSE2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 | FileCheck %s --check-prefixes=ALL,SSE41
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx | FileCheck %s --check-prefixes=ALL,AVX
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 | FileCheck %s --check-prefixes=ALL,AVX
 ; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512bw | FileCheck %s --check-prefixes=ALL,AVX512,AVX512BW
@@ -36,13 +36,10 @@ define float @test_v2f32(<2 x float> %a0) {
 ; SSE41-LABEL: test_v2f32:
 ; SSE41:       # %bb.0:
 ; SSE41-NEXT:    movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
-; SSE41-NEXT:    movaps %xmm0, %xmm1
-; SSE41-NEXT:    cmpunordss %xmm0, %xmm1
-; SSE41-NEXT:    movaps %xmm1, %xmm3
-; SSE41-NEXT:    andps %xmm2, %xmm3
-; SSE41-NEXT:    maxss %xmm0, %xmm2
-; SSE41-NEXT:    andnps %xmm2, %xmm1
-; SSE41-NEXT:    orps %xmm3, %xmm1
+; SSE41-NEXT:    movaps %xmm2, %xmm1
+; SSE41-NEXT:    maxss %xmm0, %xmm1
+; SSE41-NEXT:    cmpunordss...
[truncated]

@goldsteinn
Copy link
Contributor

Do any of these tests actually handle the case where one of the ops is multi-use?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Feb 5, 2025

Yes, removing the limit enables additional folds, some are actually beneficial but others cause extra moves.

@goldsteinn
Copy link
Contributor

Yes, removing the limit enables additional folds, some are actually beneficial but others cause extra moves.

Which tests?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Feb 5, 2025

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

@phoebewang
Copy link
Contributor

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

We assume move have negligible cost in uarch and the total instrcution count is not increased. Why it is not prefered?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Feb 6, 2025

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

We assume move have negligible cost in uarch and the total instrcution count is not increased. Why it is not prefered?

Not all uarchs form the SSE4 era had move elimination, and often the BLENDV instructions were 2 uops or more - so the total uop count could increase if the 3 x 1uop logic ops (+maybe 1uop move for the ANDNP mask) were replaced with 3 x 1uop moves + 1 x 2uop BLENDV - that's the worse case scenario. But we already always take that chance with BLENDV for vector select, its just the scalar selects that for some reason we were more cautious. I was trying to find a compromise, but I'm not against dropping the multiuse limit for SSE4 entirely.

@goldsteinn
Copy link
Contributor

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

We assume move have negligible cost in uarch and the total instrcution count is not increased. Why it is not prefered?

Not all uarchs form the SSE4 era had move elimination, and often the BLENDV instructions were 2 uops or more - so the total uop count could increase if the 3 x 1uop logic ops (+maybe 1uop move for the ANDNP mask) were replaced with 3 x 1uop moves + 1 x 2uop BLENDV - that's the worse case scenario. But we already always take that chance with BLENDV for vector select, its just the scalar selects that for some reason we were more cautious. I was trying to find a compromise, but I'm not against dropping the multiuse limit for SSE4 entirely.

The either op having one-use seems reasonable compromise to me, although I also think removing the requirement entirely is sensible. My guess is most stuff compiled for SSE4 is probably running on newer hardware (mov elim, fast blendv) and just compiled with SSE4 for compatibilities sake.

@phoebewang
Copy link
Contributor

Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later)

We assume move have negligible cost in uarch and the total instrcution count is not increased. Why it is not prefered?

Not all uarchs form the SSE4 era had move elimination, and often the BLENDV instructions were 2 uops or more - so the total uop count could increase if the 3 x 1uop logic ops (+maybe 1uop move for the ANDNP mask) were replaced with 3 x 1uop moves + 1 x 2uop BLENDV - that's the worse case scenario. But we already always take that chance with BLENDV for vector select, its just the scalar selects that for some reason we were more cautious. I was trying to find a compromise, but I'm not against dropping the multiuse limit for SSE4 entirely.

The either op having one-use seems reasonable compromise to me, although I also think removing the requirement entirely is sensible. My guess is most stuff compiled for SSE4 is probably running on newer hardware (mov elim, fast blendv) and just compiled with SSE4 for compatibilities sake.

+1. If it's ok either way, we can assume newer hardware performance is important than the older ones. We used this strategy when bumping the general tuning.

@RKSimon RKSimon changed the title [X86] LowerSelect - use BLENDV for scalar selection if not all operands are multi use [X86] LowerSelect - use BLENDV for scalar selection on all SSE41+ targets Feb 10, 2025
@RKSimon RKSimon merged commit d9183fd into llvm:main Feb 10, 2025
8 checks passed
@RKSimon RKSimon deleted the sse41-scalar-blendv branch February 10, 2025 11:24
Icohedron pushed a commit to Icohedron/llvm-project that referenced this pull request Feb 11, 2025
…gets (llvm#125853)

When we first began (2015) to lower f32/f64 selects to
X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()),
we limited it to AVX targets to avoid issues with SSE41's xmm0
constraint for the condition mask.

Since then we've seen general improvements in TwoAddressInstruction and
better handling of condition commutation for X86ISD::BLENDV nodes, which
should address many of the original concerns of using SSE41 BLENDVPD/S.
In most cases we will replace 3 logic instruction with the BLENDV node
and (up to 3) additional moves. Although the BLENDV is often more
expensive on original SSE41 targets, this should still be an improvement
in a majority of cases.

We also have no equivalent restrictions for SSE41 for v2f64/v4f32 vector
selection.

Fixes llvm#105807
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clang's "__builtin_reduce_min" algorithm for float vectors needs improvement when setting "-msse4.1".
4 participants