[X86] Use fence(seq_cst) in IdempotentRMWIntoFencedLoad #126521

vchuravy · 2025-02-10T14:27:15Z

This extends this optimization for scenarios where the subtarget
has !hasMFence or we have SyncScope SingleThread, by avoiding
the direct usage of llvm.x64.sse2.mfence.

Originally part of #106555

vchuravy · 2025-02-10T14:27:33Z

[X86] Use fence(seq_cst) in IdempotentRMWIntoFencedLoad #126521 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-02-10T14:29:04Z

@llvm/pr-subscribers-backend-x86

Author: Valentin Churavy (vchuravy)

Changes

This extends this optimization for scenarios where the subtarget
has !hasMFence or we have SyncScope SingleThread, by avoiding
the direct usage of llvm.x64.sse2.mfence.

Originally part of #106555

Full diff: https://github.com/llvm/llvm-project/pull/126521.diff

2 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+3-14)
(modified) llvm/test/CodeGen/X86/atomic-idempotent.ll (+30-56)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 995b4de12ce12c..8a27bd1acaa2cd 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -31807,21 +31807,10 @@ X86TargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
   // otherwise, we might be able to be more aggressive on relaxed idempotent
   // rmw. In practice, they do not look useful, so we don't try to be
   // especially clever.
-  if (SSID == SyncScope::SingleThread)
-    // FIXME: we could just insert an ISD::MEMBARRIER here, except we are at
-    // the IR level, so we must wrap it in an intrinsic.
-    return nullptr;
-
-  if (!Subtarget.hasMFence())
-    // FIXME: it might make sense to use a locked operation here but on a
-    // different cache-line to prevent cache-line bouncing. In practice it
-    // is probably a small win, and x86 processors without mfence are rare
-    // enough that we do not bother.
-    return nullptr;
 
-  Function *MFence =
-      llvm::Intrinsic::getOrInsertDeclaration(M, Intrinsic::x86_sse2_mfence);
-  Builder.CreateCall(MFence, {});
+  // Use `fence seq_cst` over `llvm.x64.sse2.mfence` here to get the correct
+  // lowering for SSID == SyncScope::SingleThread and !hasMFence
+  Builder.CreateFence(AtomicOrdering::SequentiallyConsistent, SSID);
 
   // Finally we can emit the atomic load.
   LoadInst *Loaded = Builder.CreateAlignedLoad(
diff --git a/llvm/test/CodeGen/X86/atomic-idempotent.ll b/llvm/test/CodeGen/X86/atomic-idempotent.ll
index 55b4d1af094f67..10e8cfc0ad497e 100644
--- a/llvm/test/CodeGen/X86/atomic-idempotent.ll
+++ b/llvm/test/CodeGen/X86/atomic-idempotent.ll
@@ -27,18 +27,16 @@ define i8 @add8(ptr %p) {
 ;
 ; X86-SLM-LABEL: add8:
 ; X86-SLM:       # %bb.0:
-; X86-SLM-NEXT:    movl {{[0-9]+}}(%esp), %ecx
-; X86-SLM-NEXT:    xorl %eax, %eax
-; X86-SLM-NEXT:    lock xaddb %al, (%ecx)
-; X86-SLM-NEXT:    # kill: def $al killed $al killed $eax
+; X86-SLM-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-SLM-NEXT:    lock orl $0, (%esp)
+; X86-SLM-NEXT:    movzbl (%eax), %eax
 ; X86-SLM-NEXT:    retl
 ;
 ; X86-ATOM-LABEL: add8:
 ; X86-ATOM:       # %bb.0:
-; X86-ATOM-NEXT:    movl {{[0-9]+}}(%esp), %ecx
-; X86-ATOM-NEXT:    xorl %eax, %eax
-; X86-ATOM-NEXT:    lock xaddb %al, (%ecx)
-; X86-ATOM-NEXT:    # kill: def $al killed $al killed $eax
+; X86-ATOM-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-ATOM-NEXT:    lock orl $0, (%esp)
+; X86-ATOM-NEXT:    movzbl (%eax), %eax
 ; X86-ATOM-NEXT:    nop
 ; X86-ATOM-NEXT:    nop
 ; X86-ATOM-NEXT:    retl
@@ -62,26 +60,18 @@ define i16 @or16(ptr %p) {
 ;
 ; X86-SLM-LABEL: or16:
 ; X86-SLM:       # %bb.0:
-; X86-SLM-NEXT:    movl {{[0-9]+}}(%esp), %ecx
-; X86-SLM-NEXT:    movzwl (%ecx), %eax
-; X86-SLM-NEXT:    .p2align 4
-; X86-SLM-NEXT:  .LBB1_1: # %atomicrmw.start
-; X86-SLM-NEXT:    # =>This Inner Loop Header: Depth=1
-; X86-SLM-NEXT:    lock cmpxchgw %ax, (%ecx)
-; X86-SLM-NEXT:    jne .LBB1_1
-; X86-SLM-NEXT:  # %bb.2: # %atomicrmw.end
+; X86-SLM-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-SLM-NEXT:    lock orl $0, (%esp)
+; X86-SLM-NEXT:    movzwl (%eax), %eax
 ; X86-SLM-NEXT:    retl
 ;
 ; X86-ATOM-LABEL: or16:
 ; X86-ATOM:       # %bb.0:
-; X86-ATOM-NEXT:    movl {{[0-9]+}}(%esp), %ecx
-; X86-ATOM-NEXT:    movzwl (%ecx), %eax
-; X86-ATOM-NEXT:    .p2align 4
-; X86-ATOM-NEXT:  .LBB1_1: # %atomicrmw.start
-; X86-ATOM-NEXT:    # =>This Inner Loop Header: Depth=1
-; X86-ATOM-NEXT:    lock cmpxchgw %ax, (%ecx)
-; X86-ATOM-NEXT:    jne .LBB1_1
-; X86-ATOM-NEXT:  # %bb.2: # %atomicrmw.end
+; X86-ATOM-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-ATOM-NEXT:    lock orl $0, (%esp)
+; X86-ATOM-NEXT:    movzwl (%eax), %eax
+; X86-ATOM-NEXT:    nop
+; X86-ATOM-NEXT:    nop
 ; X86-ATOM-NEXT:    retl
   %1 = atomicrmw or ptr %p, i16 0 acquire
   ret i16 %1
@@ -103,26 +93,18 @@ define i32 @xor32(ptr %p) {
 ;
 ; X86-SLM-LABEL: xor32:
 ; X86-SLM:       # %bb.0:
-; X86-SLM-NEXT:    movl {{[0-9]+}}(%esp), %ecx
-; X86-SLM-NEXT:    movl (%ecx), %eax
-; X86-SLM-NEXT:    .p2align 4
-; X86-SLM-NEXT:  .LBB2_1: # %atomicrmw.start
-; X86-SLM-NEXT:    # =>This Inner Loop Header: Depth=1
-; X86-SLM-NEXT:    lock cmpxchgl %eax, (%ecx)
-; X86-SLM-NEXT:    jne .LBB2_1
-; X86-SLM-NEXT:  # %bb.2: # %atomicrmw.end
+; X86-SLM-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-SLM-NEXT:    lock orl $0, (%esp)
+; X86-SLM-NEXT:    movl (%eax), %eax
 ; X86-SLM-NEXT:    retl
 ;
 ; X86-ATOM-LABEL: xor32:
 ; X86-ATOM:       # %bb.0:
-; X86-ATOM-NEXT:    movl {{[0-9]+}}(%esp), %ecx
-; X86-ATOM-NEXT:    movl (%ecx), %eax
-; X86-ATOM-NEXT:    .p2align 4
-; X86-ATOM-NEXT:  .LBB2_1: # %atomicrmw.start
-; X86-ATOM-NEXT:    # =>This Inner Loop Header: Depth=1
-; X86-ATOM-NEXT:    lock cmpxchgl %eax, (%ecx)
-; X86-ATOM-NEXT:    jne .LBB2_1
-; X86-ATOM-NEXT:  # %bb.2: # %atomicrmw.end
+; X86-ATOM-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-ATOM-NEXT:    lock orl $0, (%esp)
+; X86-ATOM-NEXT:    movl (%eax), %eax
+; X86-ATOM-NEXT:    nop
+; X86-ATOM-NEXT:    nop
 ; X86-ATOM-NEXT:    retl
   %1 = atomicrmw xor ptr %p, i32 0 release
   ret i32 %1
@@ -318,26 +300,18 @@ define i32 @and32 (ptr %p) {
 ;
 ; X86-SLM-LABEL: and32:
 ; X86-SLM:       # %bb.0:
-; X86-SLM-NEXT:    movl {{[0-9]+}}(%esp), %ecx
-; X86-SLM-NEXT:    movl (%ecx), %eax
-; X86-SLM-NEXT:    .p2align 4
-; X86-SLM-NEXT:  .LBB5_1: # %atomicrmw.start
-; X86-SLM-NEXT:    # =>This Inner Loop Header: Depth=1
-; X86-SLM-NEXT:    lock cmpxchgl %eax, (%ecx)
-; X86-SLM-NEXT:    jne .LBB5_1
-; X86-SLM-NEXT:  # %bb.2: # %atomicrmw.end
+; X86-SLM-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-SLM-NEXT:    lock orl $0, (%esp)
+; X86-SLM-NEXT:    movl (%eax), %eax
 ; X86-SLM-NEXT:    retl
 ;
 ; X86-ATOM-LABEL: and32:
 ; X86-ATOM:       # %bb.0:
-; X86-ATOM-NEXT:    movl {{[0-9]+}}(%esp), %ecx
-; X86-ATOM-NEXT:    movl (%ecx), %eax
-; X86-ATOM-NEXT:    .p2align 4
-; X86-ATOM-NEXT:  .LBB5_1: # %atomicrmw.start
-; X86-ATOM-NEXT:    # =>This Inner Loop Header: Depth=1
-; X86-ATOM-NEXT:    lock cmpxchgl %eax, (%ecx)
-; X86-ATOM-NEXT:    jne .LBB5_1
-; X86-ATOM-NEXT:  # %bb.2: # %atomicrmw.end
+; X86-ATOM-NEXT:    movl {{[0-9]+}}(%esp), %eax
+; X86-ATOM-NEXT:    lock orl $0, (%esp)
+; X86-ATOM-NEXT:    movl (%eax), %eax
+; X86-ATOM-NEXT:    nop
+; X86-ATOM-NEXT:    nop
 ; X86-ATOM-NEXT:    retl
   %1 = atomicrmw and ptr %p, i32 -1 acq_rel
   ret i32 %1

phoebewang · 2025-02-11T09:48:14Z

llvm/lib/Target/X86/X86ISelLowering.cpp

@@ -31807,21 +31807,10 @@ X86TargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
  // otherwise, we might be able to be more aggressive on relaxed idempotent
  // rmw. In practice, they do not look useful, so we don't try to be
  // especially clever.
-  if (SSID == SyncScope::SingleThread)


Do we have test case to cover this condition?

In 02f6abd I added a test

and a01668d shows the changes with this PR

phoebewang

I'm not an expert of atomic, but the changes in test and the old comments indicate it's correcrt to do so.

I'll give my +1 but it'd better if someone familar with atomic take another look.

giordano · 2025-03-06T09:50:58Z

Bump for a review.

phoebewang · 2025-03-06T10:21:00Z

If no objections, I'll merge it tomorrow.

RKSimon

LGTM with one minor

llvm/test/CodeGen/X86/atomic-idempotent-syncscope.ll

This extends this optimization for scenarios where the subtarget has `!hasMFence` or we have SyncScope SingleThread, by avoiding the direct usage of `llvm.x64.sse2.mfence`. Originally part of #106555

kazutakahirata · 2025-03-07T07:30:33Z

@vchuravy I've landed f83eeac to fix an unused variable warning. Thanks!

This extends this optimization for scenarios where the subtarget has `!hasMFence` or we have SyncScope SingleThread, by avoiding the direct usage of `llvm.x64.sse2.mfence`. (cherry picked from commit ed6bde9)

This extends this optimization for scenarios where the subtarget has `!hasMFence` or we have SyncScope SingleThread, by avoiding the direct usage of `llvm.x64.sse2.mfence`.

This extends this optimization for scenarios where the subtarget has `!hasMFence` or we have SyncScope SingleThread, by avoiding the direct usage of `llvm.x64.sse2.mfence`. (cherry picked from commit ed6bde9)

vchuravy requested a review from RKSimon February 10, 2025 14:27

vchuravy marked this pull request as ready for review February 10, 2025 14:28

llvmbot added the backend:X86 label Feb 10, 2025

RKSimon requested review from phoebewang and topperc February 10, 2025 14:51

phoebewang reviewed Feb 11, 2025

View reviewed changes

phoebewang approved these changes Feb 11, 2025

View reviewed changes

RKSimon approved these changes Mar 6, 2025

View reviewed changes

llvm/test/CodeGen/X86/atomic-idempotent-syncscope.ll Outdated Show resolved Hide resolved

vchuravy added 4 commits March 6, 2025 15:14

[X86] Use fence(seq_cst) in IdempotentRMWIntoFencedLoad

93cb3f0

This extends this optimization for scenarios where the subtarget has `!hasMFence` or we have SyncScope SingleThread, by avoiding the direct usage of `llvm.x64.sse2.mfence`. Originally part of #106555

add test for syncscope

52ff0b2

update test with patch

2357c81

mark tests as nounwind

af8f6d4

vchuravy force-pushed the users/vchuravy/x86isel_hasmfence branch from a01668d to af8f6d4 Compare March 6, 2025 15:27

vchuravy merged commit ed6bde9 into main Mar 7, 2025
11 checks passed

vchuravy deleted the users/vchuravy/x86isel_hasmfence branch March 7, 2025 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] Use fence(seq_cst) in IdempotentRMWIntoFencedLoad #126521

[X86] Use fence(seq_cst) in IdempotentRMWIntoFencedLoad #126521

Uh oh!

vchuravy commented Feb 10, 2025

Uh oh!

vchuravy commented Feb 10, 2025

Uh oh!

llvmbot commented Feb 10, 2025

Uh oh!

phoebewang Feb 11, 2025

Uh oh!

vchuravy Feb 11, 2025

Uh oh!

phoebewang left a comment

Uh oh!

giordano commented Mar 6, 2025

Uh oh!

phoebewang commented Mar 6, 2025

Uh oh!

RKSimon left a comment

Uh oh!

Uh oh!

Uh oh!

kazutakahirata commented Mar 7, 2025

Uh oh!

Uh oh!

[X86] Use fence(seq_cst) in IdempotentRMWIntoFencedLoad #126521

[X86] Use fence(seq_cst) in IdempotentRMWIntoFencedLoad #126521

Uh oh!

Conversation

vchuravy commented Feb 10, 2025

Uh oh!

vchuravy commented Feb 10, 2025

Uh oh!

llvmbot commented Feb 10, 2025

Uh oh!

phoebewang Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

vchuravy Feb 11, 2025

Choose a reason for hiding this comment

Uh oh!

phoebewang left a comment

Choose a reason for hiding this comment

Uh oh!

giordano commented Mar 6, 2025

Uh oh!

phoebewang commented Mar 6, 2025

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kazutakahirata commented Mar 7, 2025

Uh oh!

Uh oh!