Skip to content

[X86] Fold AND(Y, XOR(X, SUB(0, X))) to ANDN(Y, BLSMSK(X)) #128348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 25, 2025

Conversation

mskamp
Copy link
Contributor

@mskamp mskamp commented Feb 22, 2025

XOR(X, SUB(0, X)) corresponds to a bitwise-negated BLSMSK instruction
(i.e., x ^ (x - 1)). On its own, this transformation is probably not
really profitable but when the XOR operation is an operand of an AND
operation, we can use an ANDN instruction to reduce the number of
emitted instructions by one.

Fixes #103501.

@llvmbot
Copy link
Member

llvmbot commented Feb 22, 2025

@llvm/pr-subscribers-backend-x86

Author: Marius Kamp (mskamp)

Changes

XOR(X, SUB(0, X)) corresponds to a bitwise-negated BLSMSK instruction
(i.e., x ^ (x - 1)). On its own, this transformation is probably not
really profitable but when the XOR operation is an operand of an AND
operation, we can use an ANDN instruction to reduce the number of
emitted instructions by one.

Fixes #103501.


Full diff: https://github.com/llvm/llvm-project/pull/128348.diff

2 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+40)
  • (added) llvm/test/CodeGen/X86/andnot-blsmsk.ll (+346)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 1c9d43ce4c062..81072f3a8995e 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -51045,6 +51045,41 @@ static SDValue combineBMILogicOp(SDNode *N, SelectionDAG &DAG,
   return SDValue();
 }
 
+/// Fold AND(Y, XOR(X, NEG(X))) -> ANDN(Y, BLSMSK(X)) if BMI is available.
+static SDValue combineAndXorSubWithBMI(SDValue Op, SDValue OtherOp, SDLoc DL,
+                                       SelectionDAG &DAG,
+                                       const X86Subtarget &Subtarget) {
+  EVT VT = Op.getValueType();
+  // Make sure this node is a candidate for BMI instructions.
+  if (!Subtarget.hasBMI() || !VT.isScalarInteger() ||
+      (VT != MVT::i32 && VT != MVT::i64))
+    return SDValue();
+
+  if (Op.getOpcode() != ISD::XOR || !Op.hasOneUse())
+    return SDValue();
+
+  // Make sure that the XOR operation corresponds to a negated BLSMSK
+  // instruction.
+  SDValue X = Op.getOperand(0);
+  SDValue Sub = Op.getOperand(1);
+  auto CheckOps = [&] {
+    return Sub.getOpcode() == ISD::SUB && Sub->hasOneUse() &&
+           isNullConstant(Sub.getOperand(0)) && Sub.getOperand(1) == X;
+  };
+  if (!CheckOps()) {
+    std::swap(X, Sub);
+    if (!CheckOps())
+      return SDValue();
+  }
+
+  SDValue BLSMSK =
+      DAG.getNode(ISD::XOR, DL, VT, X,
+                  DAG.getNode(ISD::SUB, DL, VT, X, DAG.getConstant(1, DL, VT)));
+  SDValue AndN = DAG.getNode(ISD::AND, SDLoc(Op), VT, OtherOp,
+                             DAG.getNOT(SDLoc(Op), BLSMSK, VT));
+  return AndN;
+}
+
 static SDValue combineX86SubCmpForFlags(SDNode *N, SDValue Flag,
                                         SelectionDAG &DAG,
                                         TargetLowering::DAGCombinerInfo &DCI,
@@ -51453,6 +51488,11 @@ static SDValue combineAnd(SDNode *N, SelectionDAG &DAG,
   if (SDValue R = combineBMILogicOp(N, DAG, Subtarget))
     return R;
 
+  if (SDValue R = combineAndXorSubWithBMI(N0, N1, dl, DAG, Subtarget))
+    return R;
+  if (SDValue R = combineAndXorSubWithBMI(N1, N0, dl, DAG, Subtarget))
+    return R;
+
   return SDValue();
 }
 
diff --git a/llvm/test/CodeGen/X86/andnot-blsmsk.ll b/llvm/test/CodeGen/X86/andnot-blsmsk.ll
new file mode 100644
index 0000000000000..1e31ee075bfee
--- /dev/null
+++ b/llvm/test/CodeGen/X86/andnot-blsmsk.ll
@@ -0,0 +1,346 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -mtriple=i686-- -mattr=-bmi,+sse2 | FileCheck %s --check-prefixes=X86,X86-NOBMI
+; RUN: llc < %s -mtriple=i686-- -mattr=+bmi,+sse2 | FileCheck %s --check-prefixes=X86,X86-BMI
+; RUN: llc < %s -mtriple=x86_64-- -mattr=-bmi | FileCheck %s --check-prefixes=X64,X64-NOBMI
+; RUN: llc < %s -mtriple=x86_64-- -mattr=+bmi | FileCheck %s --check-prefixes=X64,X64-BMI
+
+declare void @use(i32)
+
+define i32 @fold_and_xor_neg_v1_32(i32 %x, i32 %y) {
+; X86-NOBMI-LABEL: fold_and_xor_neg_v1_32:
+; X86-NOBMI:       # %bb.0:
+; X86-NOBMI-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; X86-NOBMI-NEXT:    movl %ecx, %eax
+; X86-NOBMI-NEXT:    negl %eax
+; X86-NOBMI-NEXT:    xorl %ecx, %eax
+; X86-NOBMI-NEXT:    andl {{[0-9]+}}(%esp), %eax
+; X86-NOBMI-NEXT:    retl
+;
+; X86-BMI-LABEL: fold_and_xor_neg_v1_32:
+; X86-BMI:       # %bb.0:
+; X86-BMI-NEXT:    blsmskl {{[0-9]+}}(%esp), %eax
+; X86-BMI-NEXT:    andnl {{[0-9]+}}(%esp), %eax, %eax
+; X86-BMI-NEXT:    retl
+;
+; X64-NOBMI-LABEL: fold_and_xor_neg_v1_32:
+; X64-NOBMI:       # %bb.0:
+; X64-NOBMI-NEXT:    movl %edi, %eax
+; X64-NOBMI-NEXT:    negl %eax
+; X64-NOBMI-NEXT:    xorl %edi, %eax
+; X64-NOBMI-NEXT:    andl %esi, %eax
+; X64-NOBMI-NEXT:    retq
+;
+; X64-BMI-LABEL: fold_and_xor_neg_v1_32:
+; X64-BMI:       # %bb.0:
+; X64-BMI-NEXT:    blsmskl %edi, %eax
+; X64-BMI-NEXT:    andnl %esi, %eax, %eax
+; X64-BMI-NEXT:    retq
+  %neg = sub i32 0, %x
+  %xor = xor i32 %x, %neg
+  %and = and i32 %xor, %y
+  ret i32 %and
+}
+
+define i32 @fold_and_xor_neg_v2_32(i32 %x, i32 %y) {
+; X86-NOBMI-LABEL: fold_and_xor_neg_v2_32:
+; X86-NOBMI:       # %bb.0:
+; X86-NOBMI-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; X86-NOBMI-NEXT:    movl %ecx, %eax
+; X86-NOBMI-NEXT:    negl %eax
+; X86-NOBMI-NEXT:    xorl %ecx, %eax
+; X86-NOBMI-NEXT:    andl {{[0-9]+}}(%esp), %eax
+; X86-NOBMI-NEXT:    retl
+;
+; X86-BMI-LABEL: fold_and_xor_neg_v2_32:
+; X86-BMI:       # %bb.0:
+; X86-BMI-NEXT:    blsmskl {{[0-9]+}}(%esp), %eax
+; X86-BMI-NEXT:    andnl {{[0-9]+}}(%esp), %eax, %eax
+; X86-BMI-NEXT:    retl
+;
+; X64-NOBMI-LABEL: fold_and_xor_neg_v2_32:
+; X64-NOBMI:       # %bb.0:
+; X64-NOBMI-NEXT:    movl %edi, %eax
+; X64-NOBMI-NEXT:    negl %eax
+; X64-NOBMI-NEXT:    xorl %edi, %eax
+; X64-NOBMI-NEXT:    andl %esi, %eax
+; X64-NOBMI-NEXT:    retq
+;
+; X64-BMI-LABEL: fold_and_xor_neg_v2_32:
+; X64-BMI:       # %bb.0:
+; X64-BMI-NEXT:    blsmskl %edi, %eax
+; X64-BMI-NEXT:    andnl %esi, %eax, %eax
+; X64-BMI-NEXT:    retq
+  %neg = sub i32 0, %x
+  %xor = xor i32 %x, %neg
+  %and = and i32 %y, %xor
+  ret i32 %and
+}
+
+define i32 @fold_and_xor_neg_v3_32(i32 %x, i32 %y) {
+; X86-NOBMI-LABEL: fold_and_xor_neg_v3_32:
+; X86-NOBMI:       # %bb.0:
+; X86-NOBMI-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; X86-NOBMI-NEXT:    movl %ecx, %eax
+; X86-NOBMI-NEXT:    negl %eax
+; X86-NOBMI-NEXT:    xorl %ecx, %eax
+; X86-NOBMI-NEXT:    andl {{[0-9]+}}(%esp), %eax
+; X86-NOBMI-NEXT:    retl
+;
+; X86-BMI-LABEL: fold_and_xor_neg_v3_32:
+; X86-BMI:       # %bb.0:
+; X86-BMI-NEXT:    blsmskl {{[0-9]+}}(%esp), %eax
+; X86-BMI-NEXT:    andnl {{[0-9]+}}(%esp), %eax, %eax
+; X86-BMI-NEXT:    retl
+;
+; X64-NOBMI-LABEL: fold_and_xor_neg_v3_32:
+; X64-NOBMI:       # %bb.0:
+; X64-NOBMI-NEXT:    movl %edi, %eax
+; X64-NOBMI-NEXT:    negl %eax
+; X64-NOBMI-NEXT:    xorl %edi, %eax
+; X64-NOBMI-NEXT:    andl %esi, %eax
+; X64-NOBMI-NEXT:    retq
+;
+; X64-BMI-LABEL: fold_and_xor_neg_v3_32:
+; X64-BMI:       # %bb.0:
+; X64-BMI-NEXT:    blsmskl %edi, %eax
+; X64-BMI-NEXT:    andnl %esi, %eax, %eax
+; X64-BMI-NEXT:    retq
+  %neg = sub i32 0, %x
+  %xor = xor i32 %neg, %x
+  %and = and i32 %xor, %y
+  ret i32 %and
+}
+
+define i32 @fold_and_xor_neg_v4_32(i32 %x, i32 %y) {
+; X86-NOBMI-LABEL: fold_and_xor_neg_v4_32:
+; X86-NOBMI:       # %bb.0:
+; X86-NOBMI-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; X86-NOBMI-NEXT:    movl %ecx, %eax
+; X86-NOBMI-NEXT:    negl %eax
+; X86-NOBMI-NEXT:    xorl %ecx, %eax
+; X86-NOBMI-NEXT:    andl {{[0-9]+}}(%esp), %eax
+; X86-NOBMI-NEXT:    retl
+;
+; X86-BMI-LABEL: fold_and_xor_neg_v4_32:
+; X86-BMI:       # %bb.0:
+; X86-BMI-NEXT:    blsmskl {{[0-9]+}}(%esp), %eax
+; X86-BMI-NEXT:    andnl {{[0-9]+}}(%esp), %eax, %eax
+; X86-BMI-NEXT:    retl
+;
+; X64-NOBMI-LABEL: fold_and_xor_neg_v4_32:
+; X64-NOBMI:       # %bb.0:
+; X64-NOBMI-NEXT:    movl %edi, %eax
+; X64-NOBMI-NEXT:    negl %eax
+; X64-NOBMI-NEXT:    xorl %edi, %eax
+; X64-NOBMI-NEXT:    andl %esi, %eax
+; X64-NOBMI-NEXT:    retq
+;
+; X64-BMI-LABEL: fold_and_xor_neg_v4_32:
+; X64-BMI:       # %bb.0:
+; X64-BMI-NEXT:    blsmskl %edi, %eax
+; X64-BMI-NEXT:    andnl %esi, %eax, %eax
+; X64-BMI-NEXT:    retq
+  %neg = sub i32 0, %x
+  %xor = xor i32 %neg, %x
+  %and = and i32 %y, %xor
+  ret i32 %and
+}
+
+define i64 @fold_and_xor_neg_v1_64(i64 %x, i64 %y) {
+; X86-LABEL: fold_and_xor_neg_v1_64:
+; X86:       # %bb.0:
+; X86-NEXT:    pushl %esi
+; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    .cfi_offset %esi, -8
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %esi
+; X86-NEXT:    xorl %edx, %edx
+; X86-NEXT:    movl %ecx, %eax
+; X86-NEXT:    negl %eax
+; X86-NEXT:    sbbl %esi, %edx
+; X86-NEXT:    xorl %esi, %edx
+; X86-NEXT:    xorl %ecx, %eax
+; X86-NEXT:    andl {{[0-9]+}}(%esp), %edx
+; X86-NEXT:    andl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    popl %esi
+; X86-NEXT:    .cfi_def_cfa_offset 4
+; X86-NEXT:    retl
+;
+; X64-NOBMI-LABEL: fold_and_xor_neg_v1_64:
+; X64-NOBMI:       # %bb.0:
+; X64-NOBMI-NEXT:    movq %rdi, %rax
+; X64-NOBMI-NEXT:    negq %rax
+; X64-NOBMI-NEXT:    xorq %rdi, %rax
+; X64-NOBMI-NEXT:    andq %rsi, %rax
+; X64-NOBMI-NEXT:    retq
+;
+; X64-BMI-LABEL: fold_and_xor_neg_v1_64:
+; X64-BMI:       # %bb.0:
+; X64-BMI-NEXT:    blsmskq %rdi, %rax
+; X64-BMI-NEXT:    andnq %rsi, %rax, %rax
+; X64-BMI-NEXT:    retq
+  %neg = sub i64 0, %x
+  %xor = xor i64 %x, %neg
+  %and = and i64 %xor, %y
+  ret i64 %and
+}
+
+; Negative test
+define i16 @fold_and_xor_neg_v1_16_negative(i16 %x, i16 %y) {
+; X86-LABEL: fold_and_xor_neg_v1_16_negative:
+; X86:       # %bb.0:
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %ecx
+; X86-NEXT:    movl %ecx, %eax
+; X86-NEXT:    negl %eax
+; X86-NEXT:    xorl %ecx, %eax
+; X86-NEXT:    andw {{[0-9]+}}(%esp), %ax
+; X86-NEXT:    # kill: def $ax killed $ax killed $eax
+; X86-NEXT:    retl
+;
+; X64-LABEL: fold_and_xor_neg_v1_16_negative:
+; X64:       # %bb.0:
+; X64-NEXT:    movl %edi, %eax
+; X64-NEXT:    negl %eax
+; X64-NEXT:    xorl %edi, %eax
+; X64-NEXT:    andl %esi, %eax
+; X64-NEXT:    # kill: def $ax killed $ax killed $eax
+; X64-NEXT:    retq
+  %neg = sub i16 0, %x
+  %xor = xor i16 %x, %neg
+  %and = and i16 %xor, %y
+  ret i16 %and
+}
+
+; Negative test
+define <4 x i32> @fold_and_xor_neg_v1_v4x32_negative(<4 x i32> %x, <4 x i32> %y) {
+; X86-LABEL: fold_and_xor_neg_v1_v4x32_negative:
+; X86:       # %bb.0:
+; X86-NEXT:    pxor %xmm2, %xmm2
+; X86-NEXT:    psubd %xmm0, %xmm2
+; X86-NEXT:    pxor %xmm2, %xmm0
+; X86-NEXT:    pand %xmm1, %xmm0
+; X86-NEXT:    retl
+;
+; X64-LABEL: fold_and_xor_neg_v1_v4x32_negative:
+; X64:       # %bb.0:
+; X64-NEXT:    pxor %xmm2, %xmm2
+; X64-NEXT:    psubd %xmm0, %xmm2
+; X64-NEXT:    pxor %xmm2, %xmm0
+; X64-NEXT:    pand %xmm1, %xmm0
+; X64-NEXT:    retq
+  %neg = sub <4 x i32> zeroinitializer, %x
+  %xor = xor <4 x i32> %x, %neg
+  %and = and <4 x i32> %xor, %y
+  ret <4 x i32> %and
+}
+
+; Negative test
+define i32 @fold_and_xor_neg_v1_32_two_uses_xor_negative(i32 %x, i32 %y) {
+; X86-LABEL: fold_and_xor_neg_v1_32_two_uses_xor_negative:
+; X86:       # %bb.0:
+; X86-NEXT:    pushl %esi
+; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    .cfi_offset %esi, -8
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    movl %eax, %ecx
+; X86-NEXT:    negl %ecx
+; X86-NEXT:    xorl %eax, %ecx
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %esi
+; X86-NEXT:    andl %ecx, %esi
+; X86-NEXT:    pushl %ecx
+; X86-NEXT:    .cfi_adjust_cfa_offset 4
+; X86-NEXT:    calll use@PLT
+; X86-NEXT:    addl $4, %esp
+; X86-NEXT:    .cfi_adjust_cfa_offset -4
+; X86-NEXT:    movl %esi, %eax
+; X86-NEXT:    popl %esi
+; X86-NEXT:    .cfi_def_cfa_offset 4
+; X86-NEXT:    retl
+;
+; X64-LABEL: fold_and_xor_neg_v1_32_two_uses_xor_negative:
+; X64:       # %bb.0:
+; X64-NEXT:    pushq %rbx
+; X64-NEXT:    .cfi_def_cfa_offset 16
+; X64-NEXT:    .cfi_offset %rbx, -16
+; X64-NEXT:    movl %esi, %ebx
+; X64-NEXT:    movl %edi, %eax
+; X64-NEXT:    negl %eax
+; X64-NEXT:    xorl %eax, %edi
+; X64-NEXT:    andl %edi, %ebx
+; X64-NEXT:    callq use@PLT
+; X64-NEXT:    movl %ebx, %eax
+; X64-NEXT:    popq %rbx
+; X64-NEXT:    .cfi_def_cfa_offset 8
+; X64-NEXT:    retq
+  %neg = sub i32 0, %x
+  %xor = xor i32 %x, %neg
+  %and = and i32 %xor, %y
+  call void @use(i32 %xor)
+  ret i32 %and
+}
+
+; Negative test
+define i32 @fold_and_xor_neg_v1_32_two_uses_sub_negative(i32 %x, i32 %y) {
+; X86-LABEL: fold_and_xor_neg_v1_32_two_uses_sub_negative:
+; X86:       # %bb.0:
+; X86-NEXT:    pushl %esi
+; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    .cfi_offset %esi, -8
+; X86-NEXT:    movl {{[0-9]+}}(%esp), %esi
+; X86-NEXT:    movl %esi, %eax
+; X86-NEXT:    negl %eax
+; X86-NEXT:    xorl %eax, %esi
+; X86-NEXT:    andl {{[0-9]+}}(%esp), %esi
+; X86-NEXT:    pushl %eax
+; X86-NEXT:    .cfi_adjust_cfa_offset 4
+; X86-NEXT:    calll use@PLT
+; X86-NEXT:    addl $4, %esp
+; X86-NEXT:    .cfi_adjust_cfa_offset -4
+; X86-NEXT:    movl %esi, %eax
+; X86-NEXT:    popl %esi
+; X86-NEXT:    .cfi_def_cfa_offset 4
+; X86-NEXT:    retl
+;
+; X64-LABEL: fold_and_xor_neg_v1_32_two_uses_sub_negative:
+; X64:       # %bb.0:
+; X64-NEXT:    pushq %rbx
+; X64-NEXT:    .cfi_def_cfa_offset 16
+; X64-NEXT:    .cfi_offset %rbx, -16
+; X64-NEXT:    movl %edi, %ebx
+; X64-NEXT:    negl %edi
+; X64-NEXT:    xorl %edi, %ebx
+; X64-NEXT:    andl %esi, %ebx
+; X64-NEXT:    callq use@PLT
+; X64-NEXT:    movl %ebx, %eax
+; X64-NEXT:    popq %rbx
+; X64-NEXT:    .cfi_def_cfa_offset 8
+; X64-NEXT:    retq
+  %neg = sub i32 0, %x
+  %xor = xor i32 %x, %neg
+  %and = and i32 %xor, %y
+  call void @use(i32 %neg)
+  ret i32 %and
+}
+
+; Negative test
+define i32 @fold_and_xor_neg_v1_32_no_blsmsk_negative(i32 %x, i32 %y, i32 %z) {
+; X86-LABEL: fold_and_xor_neg_v1_32_no_blsmsk_negative:
+; X86:       # %bb.0:
+; X86-NEXT:    xorl %eax, %eax
+; X86-NEXT:    subl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    xorl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    andl {{[0-9]+}}(%esp), %eax
+; X86-NEXT:    retl
+;
+; X64-LABEL: fold_and_xor_neg_v1_32_no_blsmsk_negative:
+; X64:       # %bb.0:
+; X64-NEXT:    movl %edx, %eax
+; X64-NEXT:    negl %eax
+; X64-NEXT:    xorl %edi, %eax
+; X64-NEXT:    andl %esi, %eax
+; X64-NEXT:    retq
+  %neg = sub i32 0, %z
+  %xor = xor i32 %x, %neg
+  %and = and i32 %xor, %y
+  ret i32 %and
+}

@mskamp mskamp force-pushed the fix_103501_fold_andn_blsmsk branch from 782d944 to c941de3 Compare February 22, 2025 13:24
@mskamp mskamp force-pushed the fix_103501_fold_andn_blsmsk branch from c941de3 to a93469e Compare February 22, 2025 17:30
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with 2 minors

XOR(X, SUB(0, X)) corresponds to a bitwise-negated BLSMSK instruction
(i.e., x ^ (x - 1)). On its own, this transformation is probably not
really profitable but when the XOR operation is an operand of an AND
operation, we can use an ANDN instruction to reduce the number of
emitted instructions by one.

Fixes llvm#103501.
@mskamp mskamp force-pushed the fix_103501_fold_andn_blsmsk branch from a93469e to 2365de5 Compare February 25, 2025 15:28
@RKSimon RKSimon merged commit 8bea511 into llvm:main Feb 25, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[x86-64 BMI2] Missed canonicalization on ~blsmsk
3 participants