-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[X86] LowerSelect - use BLENDV for scalar selection on all SSE41+ targets #125853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ds are multi use When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask. Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. If we allow SSE41 cases where the condition and another operand has one use, then the extra moves should never be as bad as the avoided logic ops (we still assume SSE41 BLENDV is more expensive than general logic). Fixes llvm#105807
@llvm/pr-subscribers-backend-x86 Author: Simon Pilgrim (RKSimon) ChangesWhen we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask. Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. If we allow SSE41 cases where the condition and another operand has one use, then the extra moves should never be as bad as the avoided logic ops (we still assume SSE41 BLENDV is more expensive than general logic). We also have no equivalent restrictions for SSE41 for v2f64/v4f32 vector selection. Fixes #105807 Patch is 67.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/125853.diff 10 Files Affected:
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 6cf6061deba702..bc55d772b86b8d 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -24630,8 +24630,8 @@ SDValue X86TargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
SDValue Cmp = DAG.getNode(X86ISD::FSETCC, DL, VT, CondOp0, CondOp1,
DAG.getTargetConstant(SSECC, DL, MVT::i8));
- // If we have AVX, we can use a variable vector select (VBLENDV) instead
- // of 3 logic instructions for size savings and potentially speed.
+ // If we have SSE41/AVX, we can use a variable vector select (VBLENDV)
+ // instead of 3 logic instructions for size savings and potentially speed.
// Unfortunately, there is no scalar form of VBLENDV.
// If either operand is a +0.0 constant, don't try this. We can expect to
@@ -24641,9 +24641,10 @@ SDValue X86TargetLowering::LowerSELECT(SDValue Op, SelectionDAG &DAG) const {
// BLENDV was introduced with SSE 4.1, but the 2 register form implicitly
// uses XMM0 as the selection register. That may need just as many
// instructions as the AND/ANDN/OR sequence due to register moves, so
- // don't bother.
- if (Subtarget.hasAVX() && !isNullFPConstant(Op1) &&
- !isNullFPConstant(Op2)) {
+ // only attempt this if at least one of ops (+ condition) are one use.
+ if (Subtarget.hasSSE41() && !isNullFPConstant(Op1) &&
+ !isNullFPConstant(Op2) &&
+ (Subtarget.hasAVX() || Op1->hasOneUse() || Op2->hasOneUse())) {
// Convert to vectors, do a VSELECT, and convert back to scalar.
// All of the conversions should be optimized away.
MVT VecVT = VT == MVT::f32 ? MVT::v4f32 : MVT::v2f64;
diff --git a/llvm/test/CodeGen/X86/fmaxnum.ll b/llvm/test/CodeGen/X86/fmaxnum.ll
index 2e1af1e84e0762..d6252cc85e8b45 100644
--- a/llvm/test/CodeGen/X86/fmaxnum.ll
+++ b/llvm/test/CodeGen/X86/fmaxnum.ll
@@ -22,17 +22,26 @@ declare <8 x double> @llvm.maxnum.v8f64(<8 x double>, <8 x double>)
; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
define float @test_fmaxf(float %x, float %y) {
-; SSE-LABEL: test_fmaxf:
-; SSE: # %bb.0:
-; SSE-NEXT: movaps %xmm0, %xmm2
-; SSE-NEXT: cmpunordss %xmm0, %xmm2
-; SSE-NEXT: movaps %xmm2, %xmm3
-; SSE-NEXT: andps %xmm1, %xmm3
-; SSE-NEXT: maxss %xmm0, %xmm1
-; SSE-NEXT: andnps %xmm1, %xmm2
-; SSE-NEXT: orps %xmm3, %xmm2
-; SSE-NEXT: movaps %xmm2, %xmm0
-; SSE-NEXT: retq
+; SSE2-LABEL: test_fmaxf:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movaps %xmm0, %xmm2
+; SSE2-NEXT: cmpunordss %xmm0, %xmm2
+; SSE2-NEXT: movaps %xmm2, %xmm3
+; SSE2-NEXT: andps %xmm1, %xmm3
+; SSE2-NEXT: maxss %xmm0, %xmm1
+; SSE2-NEXT: andnps %xmm1, %xmm2
+; SSE2-NEXT: orps %xmm3, %xmm2
+; SSE2-NEXT: movaps %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE4-LABEL: test_fmaxf:
+; SSE4: # %bb.0:
+; SSE4-NEXT: movaps %xmm1, %xmm2
+; SSE4-NEXT: maxss %xmm0, %xmm2
+; SSE4-NEXT: cmpunordss %xmm0, %xmm0
+; SSE4-NEXT: blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT: movaps %xmm2, %xmm0
+; SSE4-NEXT: retq
;
; AVX1-LABEL: test_fmaxf:
; AVX1: # %bb.0:
@@ -63,17 +72,26 @@ define float @test_fmaxf_minsize(float %x, float %y) minsize {
; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
define double @test_fmax(double %x, double %y) {
-; SSE-LABEL: test_fmax:
-; SSE: # %bb.0:
-; SSE-NEXT: movapd %xmm0, %xmm2
-; SSE-NEXT: cmpunordsd %xmm0, %xmm2
-; SSE-NEXT: movapd %xmm2, %xmm3
-; SSE-NEXT: andpd %xmm1, %xmm3
-; SSE-NEXT: maxsd %xmm0, %xmm1
-; SSE-NEXT: andnpd %xmm1, %xmm2
-; SSE-NEXT: orpd %xmm3, %xmm2
-; SSE-NEXT: movapd %xmm2, %xmm0
-; SSE-NEXT: retq
+; SSE2-LABEL: test_fmax:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movapd %xmm0, %xmm2
+; SSE2-NEXT: cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT: movapd %xmm2, %xmm3
+; SSE2-NEXT: andpd %xmm1, %xmm3
+; SSE2-NEXT: maxsd %xmm0, %xmm1
+; SSE2-NEXT: andnpd %xmm1, %xmm2
+; SSE2-NEXT: orpd %xmm3, %xmm2
+; SSE2-NEXT: movapd %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE4-LABEL: test_fmax:
+; SSE4: # %bb.0:
+; SSE4-NEXT: movapd %xmm1, %xmm2
+; SSE4-NEXT: maxsd %xmm0, %xmm2
+; SSE4-NEXT: cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT: blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT: movapd %xmm2, %xmm0
+; SSE4-NEXT: retq
;
; AVX1-LABEL: test_fmax:
; AVX1: # %bb.0:
@@ -111,17 +129,26 @@ define x86_fp80 @test_fmaxl(x86_fp80 %x, x86_fp80 %y) {
}
define float @test_intrinsic_fmaxf(float %x, float %y) {
-; SSE-LABEL: test_intrinsic_fmaxf:
-; SSE: # %bb.0:
-; SSE-NEXT: movaps %xmm0, %xmm2
-; SSE-NEXT: cmpunordss %xmm0, %xmm2
-; SSE-NEXT: movaps %xmm2, %xmm3
-; SSE-NEXT: andps %xmm1, %xmm3
-; SSE-NEXT: maxss %xmm0, %xmm1
-; SSE-NEXT: andnps %xmm1, %xmm2
-; SSE-NEXT: orps %xmm3, %xmm2
-; SSE-NEXT: movaps %xmm2, %xmm0
-; SSE-NEXT: retq
+; SSE2-LABEL: test_intrinsic_fmaxf:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movaps %xmm0, %xmm2
+; SSE2-NEXT: cmpunordss %xmm0, %xmm2
+; SSE2-NEXT: movaps %xmm2, %xmm3
+; SSE2-NEXT: andps %xmm1, %xmm3
+; SSE2-NEXT: maxss %xmm0, %xmm1
+; SSE2-NEXT: andnps %xmm1, %xmm2
+; SSE2-NEXT: orps %xmm3, %xmm2
+; SSE2-NEXT: movaps %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE4-LABEL: test_intrinsic_fmaxf:
+; SSE4: # %bb.0:
+; SSE4-NEXT: movaps %xmm1, %xmm2
+; SSE4-NEXT: maxss %xmm0, %xmm2
+; SSE4-NEXT: cmpunordss %xmm0, %xmm0
+; SSE4-NEXT: blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT: movaps %xmm2, %xmm0
+; SSE4-NEXT: retq
;
; AVX1-LABEL: test_intrinsic_fmaxf:
; AVX1: # %bb.0:
@@ -142,17 +169,26 @@ define float @test_intrinsic_fmaxf(float %x, float %y) {
}
define double @test_intrinsic_fmax(double %x, double %y) {
-; SSE-LABEL: test_intrinsic_fmax:
-; SSE: # %bb.0:
-; SSE-NEXT: movapd %xmm0, %xmm2
-; SSE-NEXT: cmpunordsd %xmm0, %xmm2
-; SSE-NEXT: movapd %xmm2, %xmm3
-; SSE-NEXT: andpd %xmm1, %xmm3
-; SSE-NEXT: maxsd %xmm0, %xmm1
-; SSE-NEXT: andnpd %xmm1, %xmm2
-; SSE-NEXT: orpd %xmm3, %xmm2
-; SSE-NEXT: movapd %xmm2, %xmm0
-; SSE-NEXT: retq
+; SSE2-LABEL: test_intrinsic_fmax:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movapd %xmm0, %xmm2
+; SSE2-NEXT: cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT: movapd %xmm2, %xmm3
+; SSE2-NEXT: andpd %xmm1, %xmm3
+; SSE2-NEXT: maxsd %xmm0, %xmm1
+; SSE2-NEXT: andnpd %xmm1, %xmm2
+; SSE2-NEXT: orpd %xmm3, %xmm2
+; SSE2-NEXT: movapd %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE4-LABEL: test_intrinsic_fmax:
+; SSE4: # %bb.0:
+; SSE4-NEXT: movapd %xmm1, %xmm2
+; SSE4-NEXT: maxsd %xmm0, %xmm2
+; SSE4-NEXT: cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT: blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT: movapd %xmm2, %xmm0
+; SSE4-NEXT: retq
;
; AVX1-LABEL: test_intrinsic_fmax:
; AVX1: # %bb.0:
diff --git a/llvm/test/CodeGen/X86/fminnum.ll b/llvm/test/CodeGen/X86/fminnum.ll
index 1290a7b8191067..0ef8fdec33d937 100644
--- a/llvm/test/CodeGen/X86/fminnum.ll
+++ b/llvm/test/CodeGen/X86/fminnum.ll
@@ -22,17 +22,26 @@ declare <8 x double> @llvm.minnum.v8f64(<8 x double>, <8 x double>)
; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
define float @test_fminf(float %x, float %y) {
-; SSE-LABEL: test_fminf:
-; SSE: # %bb.0:
-; SSE-NEXT: movaps %xmm0, %xmm2
-; SSE-NEXT: cmpunordss %xmm0, %xmm2
-; SSE-NEXT: movaps %xmm2, %xmm3
-; SSE-NEXT: andps %xmm1, %xmm3
-; SSE-NEXT: minss %xmm0, %xmm1
-; SSE-NEXT: andnps %xmm1, %xmm2
-; SSE-NEXT: orps %xmm3, %xmm2
-; SSE-NEXT: movaps %xmm2, %xmm0
-; SSE-NEXT: retq
+; SSE2-LABEL: test_fminf:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movaps %xmm0, %xmm2
+; SSE2-NEXT: cmpunordss %xmm0, %xmm2
+; SSE2-NEXT: movaps %xmm2, %xmm3
+; SSE2-NEXT: andps %xmm1, %xmm3
+; SSE2-NEXT: minss %xmm0, %xmm1
+; SSE2-NEXT: andnps %xmm1, %xmm2
+; SSE2-NEXT: orps %xmm3, %xmm2
+; SSE2-NEXT: movaps %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE4-LABEL: test_fminf:
+; SSE4: # %bb.0:
+; SSE4-NEXT: movaps %xmm1, %xmm2
+; SSE4-NEXT: minss %xmm0, %xmm2
+; SSE4-NEXT: cmpunordss %xmm0, %xmm0
+; SSE4-NEXT: blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT: movaps %xmm2, %xmm0
+; SSE4-NEXT: retq
;
; AVX1-LABEL: test_fminf:
; AVX1: # %bb.0:
@@ -63,17 +72,26 @@ define float @test_fminf_minsize(float %x, float %y) minsize {
; FIXME: As the vector tests show, the SSE run shouldn't need this many moves.
define double @test_fmin(double %x, double %y) {
-; SSE-LABEL: test_fmin:
-; SSE: # %bb.0:
-; SSE-NEXT: movapd %xmm0, %xmm2
-; SSE-NEXT: cmpunordsd %xmm0, %xmm2
-; SSE-NEXT: movapd %xmm2, %xmm3
-; SSE-NEXT: andpd %xmm1, %xmm3
-; SSE-NEXT: minsd %xmm0, %xmm1
-; SSE-NEXT: andnpd %xmm1, %xmm2
-; SSE-NEXT: orpd %xmm3, %xmm2
-; SSE-NEXT: movapd %xmm2, %xmm0
-; SSE-NEXT: retq
+; SSE2-LABEL: test_fmin:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movapd %xmm0, %xmm2
+; SSE2-NEXT: cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT: movapd %xmm2, %xmm3
+; SSE2-NEXT: andpd %xmm1, %xmm3
+; SSE2-NEXT: minsd %xmm0, %xmm1
+; SSE2-NEXT: andnpd %xmm1, %xmm2
+; SSE2-NEXT: orpd %xmm3, %xmm2
+; SSE2-NEXT: movapd %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE4-LABEL: test_fmin:
+; SSE4: # %bb.0:
+; SSE4-NEXT: movapd %xmm1, %xmm2
+; SSE4-NEXT: minsd %xmm0, %xmm2
+; SSE4-NEXT: cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT: blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT: movapd %xmm2, %xmm0
+; SSE4-NEXT: retq
;
; AVX1-LABEL: test_fmin:
; AVX1: # %bb.0:
@@ -111,17 +129,26 @@ define x86_fp80 @test_fminl(x86_fp80 %x, x86_fp80 %y) {
}
define float @test_intrinsic_fminf(float %x, float %y) {
-; SSE-LABEL: test_intrinsic_fminf:
-; SSE: # %bb.0:
-; SSE-NEXT: movaps %xmm0, %xmm2
-; SSE-NEXT: cmpunordss %xmm0, %xmm2
-; SSE-NEXT: movaps %xmm2, %xmm3
-; SSE-NEXT: andps %xmm1, %xmm3
-; SSE-NEXT: minss %xmm0, %xmm1
-; SSE-NEXT: andnps %xmm1, %xmm2
-; SSE-NEXT: orps %xmm3, %xmm2
-; SSE-NEXT: movaps %xmm2, %xmm0
-; SSE-NEXT: retq
+; SSE2-LABEL: test_intrinsic_fminf:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movaps %xmm0, %xmm2
+; SSE2-NEXT: cmpunordss %xmm0, %xmm2
+; SSE2-NEXT: movaps %xmm2, %xmm3
+; SSE2-NEXT: andps %xmm1, %xmm3
+; SSE2-NEXT: minss %xmm0, %xmm1
+; SSE2-NEXT: andnps %xmm1, %xmm2
+; SSE2-NEXT: orps %xmm3, %xmm2
+; SSE2-NEXT: movaps %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE4-LABEL: test_intrinsic_fminf:
+; SSE4: # %bb.0:
+; SSE4-NEXT: movaps %xmm1, %xmm2
+; SSE4-NEXT: minss %xmm0, %xmm2
+; SSE4-NEXT: cmpunordss %xmm0, %xmm0
+; SSE4-NEXT: blendvps %xmm0, %xmm1, %xmm2
+; SSE4-NEXT: movaps %xmm2, %xmm0
+; SSE4-NEXT: retq
;
; AVX1-LABEL: test_intrinsic_fminf:
; AVX1: # %bb.0:
@@ -142,17 +169,26 @@ define float @test_intrinsic_fminf(float %x, float %y) {
}
define double @test_intrinsic_fmin(double %x, double %y) {
-; SSE-LABEL: test_intrinsic_fmin:
-; SSE: # %bb.0:
-; SSE-NEXT: movapd %xmm0, %xmm2
-; SSE-NEXT: cmpunordsd %xmm0, %xmm2
-; SSE-NEXT: movapd %xmm2, %xmm3
-; SSE-NEXT: andpd %xmm1, %xmm3
-; SSE-NEXT: minsd %xmm0, %xmm1
-; SSE-NEXT: andnpd %xmm1, %xmm2
-; SSE-NEXT: orpd %xmm3, %xmm2
-; SSE-NEXT: movapd %xmm2, %xmm0
-; SSE-NEXT: retq
+; SSE2-LABEL: test_intrinsic_fmin:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movapd %xmm0, %xmm2
+; SSE2-NEXT: cmpunordsd %xmm0, %xmm2
+; SSE2-NEXT: movapd %xmm2, %xmm3
+; SSE2-NEXT: andpd %xmm1, %xmm3
+; SSE2-NEXT: minsd %xmm0, %xmm1
+; SSE2-NEXT: andnpd %xmm1, %xmm2
+; SSE2-NEXT: orpd %xmm3, %xmm2
+; SSE2-NEXT: movapd %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE4-LABEL: test_intrinsic_fmin:
+; SSE4: # %bb.0:
+; SSE4-NEXT: movapd %xmm1, %xmm2
+; SSE4-NEXT: minsd %xmm0, %xmm2
+; SSE4-NEXT: cmpunordsd %xmm0, %xmm0
+; SSE4-NEXT: blendvpd %xmm0, %xmm1, %xmm2
+; SSE4-NEXT: movapd %xmm2, %xmm0
+; SSE4-NEXT: retq
;
; AVX1-LABEL: test_intrinsic_fmin:
; AVX1: # %bb.0:
diff --git a/llvm/test/CodeGen/X86/fp-select-cmp-and.ll b/llvm/test/CodeGen/X86/fp-select-cmp-and.ll
index 0f6159d36ea818..1d006f725ca34d 100644
--- a/llvm/test/CodeGen/X86/fp-select-cmp-and.ll
+++ b/llvm/test/CodeGen/X86/fp-select-cmp-and.ll
@@ -189,10 +189,9 @@ define float @test17(float %a, float %b, float %c, float %eps) {
; CHECK-LABEL: test17:
; CHECK: # %bb.0:
; CHECK-NEXT: cmpless %xmm0, %xmm3
-; CHECK-NEXT: andps %xmm3, %xmm2
-; CHECK-NEXT: andnps %xmm1, %xmm3
-; CHECK-NEXT: orps %xmm2, %xmm3
; CHECK-NEXT: movaps %xmm3, %xmm0
+; CHECK-NEXT: blendvps %xmm0, %xmm2, %xmm1
+; CHECK-NEXT: movaps %xmm1, %xmm0
; CHECK-NEXT: retq
%cmp = fcmp oge float %a, %eps
%cond = select i1 %cmp, float %c, float %b
@@ -203,10 +202,9 @@ define double @test18(double %a, double %b, double %c, double %eps) {
; CHECK-LABEL: test18:
; CHECK: # %bb.0:
; CHECK-NEXT: cmplesd %xmm0, %xmm3
-; CHECK-NEXT: andpd %xmm3, %xmm2
-; CHECK-NEXT: andnpd %xmm1, %xmm3
-; CHECK-NEXT: orpd %xmm2, %xmm3
; CHECK-NEXT: movapd %xmm3, %xmm0
+; CHECK-NEXT: blendvpd %xmm0, %xmm2, %xmm1
+; CHECK-NEXT: movapd %xmm1, %xmm0
; CHECK-NEXT: retq
%cmp = fcmp oge double %a, %eps
%cond = select i1 %cmp, double %c, double %b
diff --git a/llvm/test/CodeGen/X86/setcc-combine.ll b/llvm/test/CodeGen/X86/setcc-combine.ll
index e723569bda8a12..f526db00df6062 100644
--- a/llvm/test/CodeGen/X86/setcc-combine.ll
+++ b/llvm/test/CodeGen/X86/setcc-combine.ll
@@ -463,14 +463,23 @@ define <2 x double> @oge(<2 x double> %x) {
; negative test - don't create an fneg to replace 0.0 operand
define double @ogt_no_fneg(double %x, double %y) {
-; CHECK-LABEL: ogt_no_fneg:
-; CHECK: # %bb.0:
-; CHECK-NEXT: xorpd %xmm2, %xmm2
-; CHECK-NEXT: cmpltsd %xmm0, %xmm2
-; CHECK-NEXT: andpd %xmm2, %xmm0
-; CHECK-NEXT: andnpd %xmm1, %xmm2
-; CHECK-NEXT: orpd %xmm2, %xmm0
-; CHECK-NEXT: retq
+; SSE2-LABEL: ogt_no_fneg:
+; SSE2: # %bb.0:
+; SSE2-NEXT: xorpd %xmm2, %xmm2
+; SSE2-NEXT: cmpltsd %xmm0, %xmm2
+; SSE2-NEXT: andpd %xmm2, %xmm0
+; SSE2-NEXT: andnpd %xmm1, %xmm2
+; SSE2-NEXT: orpd %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE41-LABEL: ogt_no_fneg:
+; SSE41: # %bb.0:
+; SSE41-NEXT: movapd %xmm0, %xmm2
+; SSE41-NEXT: xorpd %xmm0, %xmm0
+; SSE41-NEXT: cmpltsd %xmm2, %xmm0
+; SSE41-NEXT: blendvpd %xmm0, %xmm2, %xmm1
+; SSE41-NEXT: movapd %xmm1, %xmm0
+; SSE41-NEXT: retq
%cmp = fcmp ogt double %x, 0.0
%r = select i1 %cmp, double %x, double %y
ret double %r
@@ -479,16 +488,27 @@ define double @ogt_no_fneg(double %x, double %y) {
; negative test - can't change the setcc for non-zero constant
define double @ogt_no_zero(double %x) {
-; CHECK-LABEL: ogt_no_zero:
-; CHECK: # %bb.0:
-; CHECK-NEXT: movapd {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0]
-; CHECK-NEXT: xorpd %xmm0, %xmm1
-; CHECK-NEXT: movsd {{.*#+}} xmm2 = [1.0E+0,0.0E+0]
-; CHECK-NEXT: cmpltsd %xmm0, %xmm2
-; CHECK-NEXT: andpd %xmm2, %xmm0
-; CHECK-NEXT: andnpd %xmm1, %xmm2
-; CHECK-NEXT: orpd %xmm2, %xmm0
-; CHECK-NEXT: retq
+; SSE2-LABEL: ogt_no_zero:
+; SSE2: # %bb.0:
+; SSE2-NEXT: movapd {{.*#+}} xmm1 = [-0.0E+0,-0.0E+0]
+; SSE2-NEXT: xorpd %xmm0, %xmm1
+; SSE2-NEXT: movsd {{.*#+}} xmm2 = [1.0E+0,0.0E+0]
+; SSE2-NEXT: cmpltsd %xmm0, %xmm2
+; SSE2-NEXT: andpd %xmm2, %xmm0
+; SSE2-NEXT: andnpd %xmm1, %xmm2
+; SSE2-NEXT: orpd %xmm2, %xmm0
+; SSE2-NEXT: retq
+;
+; SSE41-LABEL: ogt_no_zero:
+; SSE41: # %bb.0:
+; SSE41-NEXT: movapd %xmm0, %xmm1
+; SSE41-NEXT: movapd {{.*#+}} xmm2 = [-0.0E+0,-0.0E+0]
+; SSE41-NEXT: xorpd %xmm0, %xmm2
+; SSE41-NEXT: movsd {{.*#+}} xmm0 = [1.0E+0,0.0E+0]
+; SSE41-NEXT: cmpltsd %xmm1, %xmm0
+; SSE41-NEXT: blendvpd %xmm0, %xmm1, %xmm2
+; SSE41-NEXT: movapd %xmm2, %xmm0
+; SSE41-NEXT: retq
%neg = fneg double %x
%cmp = fcmp ogt double %x, 1.0
%r = select i1 %cmp, double %x, double %neg
diff --git a/llvm/test/CodeGen/X86/vec_floor.ll b/llvm/test/CodeGen/X86/vec_floor.ll
index 65cde6ac91106b..abb85ac83464cb 100644
--- a/llvm/test/CodeGen/X86/vec_floor.ll
+++ b/llvm/test/CodeGen/X86/vec_floor.ll
@@ -1679,10 +1679,9 @@ define <4 x float> @floor_mask_ss_mask8(<4 x float> %x, <4 x float> %y, <4 x flo
; SSE41: ## %bb.0:
; SSE41-NEXT: roundss $9, %xmm0, %xmm3
; SSE41-NEXT: cmpeqss %xmm1, %xmm0
-; SSE41-NEXT: andps %xmm0, %xmm3
-; SSE41-NEXT: andnps %xmm2, %xmm0
-; SSE41-NEXT: orps %xmm3, %xmm0
-; SSE41-NEXT: blendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3]
+; SSE41-NEXT: blendvps %xmm0, %xmm3, %xmm2
+; SSE41-NEXT: blendps {{.*#+}} xmm2 = xmm2[0],xmm1[1,2,3]
+; SSE41-NEXT: movaps %xmm2, %xmm0
; SSE41-NEXT: retq
;
; AVX-LABEL: floor_mask_ss_mask8:
@@ -1747,10 +1746,9 @@ define <2 x double> @floor_mask_sd_mask8(<2 x double> %x, <2 x double> %y, <2 x
; SSE41: ## %bb.0:
; SSE41-NEXT: roundsd $9, %xmm0, %xmm3
; SSE41-NEXT: cmpeqsd %xmm1, %xmm0
-; SSE41-NEXT: andpd %xmm0, %xmm3
-; SSE41-NEXT: andnpd %xmm2, %xmm0
-; SSE41-NEXT: orpd %xmm3, %xmm0
-; SSE41-NEXT: blendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
+; SSE41-NEXT: blendvpd %xmm0, %xmm3, %xmm2
+; SSE41-NEXT: blendpd {{.*#+}} xmm2 = xmm2[0],xmm1[1]
+; SSE41-NEXT: movapd %xmm2, %xmm0
; SSE41-NEXT: retq
;
; AVX-LABEL: floor_mask_sd_mask8:
@@ -2671,10 +2669,9 @@ define <4 x float> @ceil_mask_ss_mask8(<4 x float> %x, <4 x float> %y, <4 x floa
; SSE41: ## %bb.0:
; SSE41-NEXT: roundss $10, %xmm0, %xmm3
; SSE41-NEXT: cmpeqss %xmm1, %xmm0
-; SSE41-NEXT: andps %xmm0, %xmm3
-; SSE41-NEXT: andnps %xmm2, %xmm0
-; SSE41-NEXT: orps %xmm3, %xmm0
-; SSE41-NEXT: blendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2,3]
+; SSE41-NEXT: blendvps %xmm0, %xmm3, %xmm2
+; SSE41-NEXT: blendps {{.*#+}} xmm2 = xmm2[0],xmm1[1,2,3]
+; SSE41-NEXT: movaps %xmm2, %xmm0
; SSE41-NEXT: retq
;
; AVX-LABEL: ceil_mask_ss_mask8:
@@ -2739,10 +2736,9 @@ define <2 x double> @ceil_mask_sd_mask8(<2 x double> %x, <2 x double> %y, <2 x d
; SSE41: ## %bb.0:
; SSE41-NEXT: roundsd $10, %xmm0, %xmm3
; SSE41-NEXT: cmpeqsd %xmm1, %xmm0
-; SSE41-NEXT: andpd %xmm0, %xmm3
-; SSE41-NEXT: andnpd %xmm2, %xmm0
-; SSE41-NEXT: orpd %xmm3, %xmm0
-; SSE41-NEXT: blendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
+; SSE41-NEXT: blendvpd %xmm0, %xmm3, %xmm2
+; SSE41-NEXT: blendpd {{.*#+}} xmm2 = xmm2[0],xmm1[1]
+; SSE41-NEXT: movapd %xmm2, %xmm0
; SSE41-NEXT: retq
;
; AVX-LABEL: ceil_mask_sd_mask8:
diff --git a/llvm/test/CodeGen/X86/vector-reduce-fmax.ll b/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
index fe2c41f57cfab1..7048b98227620f 100644
--- a/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
+++ b/llvm/test/CodeGen/X86/vector-reduce-fmax.ll
@@ -1,6 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 | FileCheck %s --check-prefixes=ALL,SSE,SSE2
-; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 | FileCheck %s --check-prefixes=ALL,SSE,SSE41
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 | FileCheck %s --check-prefixes=ALL,SSE2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse4.1 | FileCheck %s --check-prefixes=ALL,SSE41
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx | FileCheck %s --check-prefixes=ALL,AVX
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx2 | FileCheck %s --check-prefixes=ALL,AVX
; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512bw | FileCheck %s --check-prefixes=ALL,AVX512,AVX512BW
@@ -36,13 +36,10 @@ define float @test_v2f32(<2 x float> %a0) {
; SSE41-LABEL: test_v2f32:
; SSE41: # %bb.0:
; SSE41-NEXT: movshdup {{.*#+}} xmm2 = xmm0[1,1,3,3]
-; SSE41-NEXT: movaps %xmm0, %xmm1
-; SSE41-NEXT: cmpunordss %xmm0, %xmm1
-; SSE41-NEXT: movaps %xmm1, %xmm3
-; SSE41-NEXT: andps %xmm2, %xmm3
-; SSE41-NEXT: maxss %xmm0, %xmm2
-; SSE41-NEXT: andnps %xmm2, %xmm1
-; SSE41-NEXT: orps %xmm3, %xmm1
+; SSE41-NEXT: movaps %xmm2, %xmm1
+; SSE41-NEXT: maxss %xmm0, %xmm1
+; SSE41-NEXT: cmpunordss...
[truncated]
|
Do any of these tests actually handle the case where one of the ops is multi-use? |
Yes, removing the limit enables additional folds, some are actually beneficial but others cause extra moves. |
Which tests? |
Changes to sse-minmax.ll - Ill push the diff (tmp commit for review - I'll remove it again later) |
We assume move have negligible cost in uarch and the total instrcution count is not increased. Why it is not prefered? |
Not all uarchs form the SSE4 era had move elimination, and often the BLENDV instructions were 2 uops or more - so the total uop count could increase if the 3 x 1uop logic ops (+maybe 1uop move for the ANDNP mask) were replaced with 3 x 1uop moves + 1 x 2uop BLENDV - that's the worse case scenario. But we already always take that chance with BLENDV for vector select, its just the scalar selects that for some reason we were more cautious. I was trying to find a compromise, but I'm not against dropping the multiuse limit for SSE4 entirely. |
The either op having one-use seems reasonable compromise to me, although I also think removing the requirement entirely is sensible. My guess is most stuff compiled for SSE4 is probably running on newer hardware (mov elim, fast blendv) and just compiled with SSE4 for compatibilities sake. |
+1. If it's ok either way, we can assume newer hardware performance is important than the older ones. We used this strategy when bumping the general tuning. |
…gets (llvm#125853) When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask. Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. In most cases we will replace 3 logic instruction with the BLENDV node and (up to 3) additional moves. Although the BLENDV is often more expensive on original SSE41 targets, this should still be an improvement in a majority of cases. We also have no equivalent restrictions for SSE41 for v2f64/v4f32 vector selection. Fixes llvm#105807
When we first began (2015) to lower f32/f64 selects to X86ISD::BLENDV(scalar_to_vector(),scalar_to_vector(),scalar_to_vector()), we limited it to AVX targets to avoid issues with SSE41's xmm0 constraint for the condition mask.
Since then we've seen general improvements in TwoAddressInstruction and better handling of condition commutation for X86ISD::BLENDV nodes, which should address many of the original concerns of using SSE41 BLENDVPD/S. In most cases we will replace 3 logic instruction with the BLENDV node and (up to 3) additional moves. Although the BLENDV is often more expensive on original SSE41 targets, this should still be an improvement in a majority of cases.
We also have no equivalent restrictions for SSE41 for v2f64/v4f32 vector selection.
Fixes #105807