[AArch64] Create set.fpmr intrinsic and assembly lowering #114248

Lukacma · 2024-10-30T15:37:12Z

This patch introduces new llvm.set.fpmr intrinsics for setting value in FPMR register and adds its lowering to series of read-compare-write instructions. This intrinsic will be generated during lowering of FP8 C intrinsics into LLVM-IR introduced in later patch.

This is an experimental implementation of handling fp8 intriniscs and is likely to change in the future.

This patch introduces new llvm.set.fpmr intrinsics for setting value in FPMR register and adds its lowering to series of read-compare-write instructions. This intrinsic will be generated during lowering of FP8 C intrinsics into LLVM-IR introduced in later patch.

llvmbot · 2024-10-30T15:37:55Z

@llvm/pr-subscribers-backend-aarch64

@llvm/pr-subscribers-llvm-ir

Author: None (Lukacma)

Changes

This patch introduces new llvm.set.fpmr intrinsics for setting value in FPMR register and adds its lowering to series of read-compare-write instructions. This intrinsic will be generated during lowering of FP8 C intrinsics into LLVM-IR introduced in later patch.

Full diff: https://github.com/llvm/llvm-project/pull/114248.diff

6 Files Affected:

(modified) llvm/include/llvm/IR/IntrinsicsAArch64.td (+3)
(modified) llvm/lib/CodeGen/LivePhysRegs.cpp (+1-1)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+45)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.h (+3)
(modified) llvm/lib/Target/AArch64/AArch64InstrInfo.td (+5)
(modified) llvm/test/CodeGen/AArch64/arm64-fpenv.ll (+15-1)

diff --git a/llvm/include/llvm/IR/IntrinsicsAArch64.td b/llvm/include/llvm/IR/IntrinsicsAArch64.td
index 594069c619ceb0..4579f3fda523c3 100644
--- a/llvm/include/llvm/IR/IntrinsicsAArch64.td
+++ b/llvm/include/llvm/IR/IntrinsicsAArch64.td
@@ -778,6 +778,9 @@ def int_aarch64_get_fpcr : FPENV_Get_Intrinsic;
 def int_aarch64_set_fpcr : FPENV_Set_Intrinsic;
 def int_aarch64_get_fpsr : FPENV_Get_Intrinsic;
 def int_aarch64_set_fpsr : FPENV_Set_Intrinsic;
+def int_aarch64_set_fpmr : DefaultAttrsIntrinsic<[], [llvm_i64_ty], [IntrInaccessibleMemOnly]>{
+  let TargetPrefix = "aarch64";
+}
 
 // Armv8.5-A Random number generation intrinsics
 def int_aarch64_rndr : RNDR_Intrinsic;
diff --git a/llvm/lib/CodeGen/LivePhysRegs.cpp b/llvm/lib/CodeGen/LivePhysRegs.cpp
index 96380d40848257..330ea65770847a 100644
--- a/llvm/lib/CodeGen/LivePhysRegs.cpp
+++ b/llvm/lib/CodeGen/LivePhysRegs.cpp
@@ -262,7 +262,7 @@ void llvm::addLiveIns(MachineBasicBlock &MBB, const LivePhysRegs &LiveRegs) {
   const MachineRegisterInfo &MRI = MF.getRegInfo();
   const TargetRegisterInfo &TRI = *MRI.getTargetRegisterInfo();
   for (MCPhysReg Reg : LiveRegs) {
-    if (MRI.isReserved(Reg))
+    if (TRI.getReservedRegs(MF).test(Reg))
       continue;
     // Skip the register if we are about to add one of its super registers.
     if (any_of(TRI.superregs(Reg), [&](MCPhysReg SReg) {
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 32ba2866ac8180..f5d08bf8d57a4d 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -3007,6 +3007,49 @@ MachineBasicBlock *AArch64TargetLowering::EmitLoweredCatchRet(
   return BB;
 }
 
+MachineBasicBlock *
+AArch64TargetLowering::EmitLoweredSetFpmr(MachineInstr &MI,
+                                          MachineBasicBlock *MBB) const {
+  MachineFunction *MF = MBB->getParent();
+  const TargetInstrInfo *TII = Subtarget->getInstrInfo();
+  DebugLoc DL = MI.getDebugLoc();
+  const BasicBlock *LLVM_BB = MBB->getBasicBlock();
+  Register NewFpmrVal = MI.getOperand(0).getReg();
+
+  // Test if FPMR is set correctly already
+  Register OldFpmrVal =
+      MI.getMF()->getRegInfo().createVirtualRegister(&AArch64::GPR64RegClass);
+  BuildMI(*MBB, MI, DL, TII->get(AArch64::MRS), OldFpmrVal)
+      .addImm(0xda22)
+      .addUse(AArch64::FPMR, RegState::Implicit);
+  BuildMI(*MBB, MI, DL, TII->get(AArch64::SUBSXrs), AArch64::XZR)
+      .addReg(OldFpmrVal)
+      .addReg(NewFpmrVal)
+      .addImm(0);
+
+  MachineBasicBlock *MsrBB = MF->CreateMachineBasicBlock(LLVM_BB);
+  // Transfer rest of current basic-block to EndBB
+  MachineBasicBlock *EndBB = MBB->splitAt(MI);
+  MF->insert(++MBB->getIterator(), MsrBB);
+
+  // If already set continue
+  BuildMI(*MBB, MI, DL, TII->get(AArch64::Bcc))
+      .addImm(AArch64CC::EQ)
+      .addMBB(EndBB);
+
+  BuildMI(*MsrBB, MsrBB->begin(), DL, TII->get(AArch64::MSR))
+      .addImm(0xda22)
+      .addReg(NewFpmrVal)
+      .addDef(AArch64::FPMR, RegState::Implicit);
+
+  MBB->addSuccessor(MsrBB);
+  // MsrBB falls through to the end.
+  MsrBB->addSuccessor(EndBB);
+
+  MI.eraseFromParent();
+  return EndBB;
+}
+
 MachineBasicBlock *
 AArch64TargetLowering::EmitDynamicProbedAlloc(MachineInstr &MI,
                                               MachineBasicBlock *MBB) const {
@@ -3292,6 +3335,8 @@ MachineBasicBlock *AArch64TargetLowering::EmitInstrWithCustomInserter(
     return EmitZTInstr(MI, BB, AArch64::ZERO_T, /*Op0IsDef=*/true);
   case AArch64::MOVT_TIZ_PSEUDO:
     return EmitZTInstr(MI, BB, AArch64::MOVT_TIZ, /*Op0IsDef=*/true);
+  case AArch64::SET_FPMR:
+    return EmitLoweredSetFpmr(MI, BB);
   }
 }
 
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index d696355bb062a8..4d72c93d255821 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -647,6 +647,9 @@ class AArch64TargetLowering : public TargetLowering {
   MachineBasicBlock *EmitLoweredCatchRet(MachineInstr &MI,
                                            MachineBasicBlock *BB) const;
 
+  MachineBasicBlock *EmitLoweredSetFpmr(MachineInstr &MI,
+                                        MachineBasicBlock *BB) const;
+
   MachineBasicBlock *EmitDynamicProbedAlloc(MachineInstr &MI,
                                             MachineBasicBlock *MBB) const;
 
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 6194de2d56b630..be62daceeed750 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -2145,6 +2145,11 @@ def MSR_FPSR : Pseudo<(outs), (ins GPR64:$val),
                PseudoInstExpansion<(MSR 0xda21, GPR64:$val)>,
                Sched<[WriteSys]>;
 
+let Uses = [FPMR], Defs = [FPMR, NZCV], usesCustomInserter = 1 in
+def SET_FPMR : Pseudo<(outs), (ins GPR64:$val),
+                      [(int_aarch64_set_fpmr i64:$val)]>,
+               Sched<[WriteSys]>;
+
 // Generic system instructions
 def SYSxt  : SystemXtI<0, "sys">;
 def SYSLxt : SystemLXtI<1, "sysl">;
diff --git a/llvm/test/CodeGen/AArch64/arm64-fpenv.ll b/llvm/test/CodeGen/AArch64/arm64-fpenv.ll
index 030809caee3394..deb0a2df7b7fba 100644
--- a/llvm/test/CodeGen/AArch64/arm64-fpenv.ll
+++ b/llvm/test/CodeGen/AArch64/arm64-fpenv.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
-; RUN: llc -mtriple=aarch64 < %s | FileCheck %s
+; RUN: llc -mtriple=aarch64 -mattr=+fpmr -verify-machineinstrs < %s | FileCheck %s
 
 define i64 @get_fpcr() #0 {
 ; CHECK-LABEL: get_fpcr:
@@ -37,6 +37,20 @@ define void @set_fpsr(i64 %sr) {
   ret void
 }
 
+define dso_local void @set_fpmr(i64 %sr) {
+; CHECK-LABEL: set_fpmr:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mrs x8, FPMR
+; CHECK-NEXT:    cmp x8, x0
+; CHECK-NEXT:    b.eq .LBB4_2
+; CHECK-NEXT:  // %bb.1:
+; CHECK-NEXT:    msr FPMR, x0
+; CHECK-NEXT:  .LBB4_2:
+; CHECK-NEXT:    ret
+    call void @llvm.aarch64.set.fpmr(i64 %sr)
+    ret void
+}
+
 declare i64 @llvm.aarch64.get.fpcr()
 declare void @llvm.aarch64.set.fpcr(i64)
 declare i64 @llvm.aarch64.get.fpsr()

llvm/test/CodeGen/AArch64/arm64-fpenv.ll

davemgreen

Can you explain how these are intended to be used? I am maybe a bit out of touch but I was under the impression that the FP8 intrinsics had fpmr operands to make data-flow analysis possible. Does a stand-alone intrinsic not go against that?

CarolineConcatto · 2024-11-01T13:17:37Z

Can you explain how these are intended to be used? I am maybe a bit out of touch but I was under the impression that the FP8 intrinsics had fpmr operands to make data-flow analysis possible. Does a stand-along intrinsic not go against that?

Do you mean that the compiler would have a pass to improve the writes into FMPR?Is that what you mean?
ATM this pass/data-flow analysis does not exist and we did not tried LICM.

For this implementation the compiler would generate a write to FMPR to every FP8 intrinsic.
For instance the example I shown :
svfloat16x2_t svcvt1_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm)
Will lower to:
llvm.set.fpmr(i64 fpm)
{<vscale x 8 x half>,<vscale x 8 x half>}llvm.aarch64.scvt2.nxv8i16(<vscale x 8 x i8> %zn)
One of the reasons to have an llvm-ir intrinsic for FPMR instead of lowering to machine instruction late in the pipeline is because the compiler could hoist llvm.set.fprm outside a loop vectorizer, if it is a constant/does not change. AFAIU to hoist a machine instructions outside a loop is more complicated than llvm-i intrinsics.

llvm/test/CodeGen/AArch64/arm64-fpenv.ll

davemgreen · 2024-11-04T08:07:35Z

Can you explain how these are intended to be used? I am maybe a bit out of touch but I was under the impression that the FP8 intrinsics had fpmr operands to make data-flow analysis possible. Does a stand-along intrinsic not go against that?

Do you mean that the compiler would have a pass to improve the writes into FMPR?Is that what you mean? ATM this pass/data-flow analysis does not exist and we did not tried LICM.

For this implementation the compiler would generate a write to FMPR to every FP8 intrinsic. For instance the example I shown : svfloat16x2_t svcvt1_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) Will lower to: llvm.set.fpmr(i64 fpm) {<vscale x 8 x half>,<vscale x 8 x half>}llvm.aarch64.scvt2.nxv8i16(<vscale x 8 x i8> %zn) One of the reasons to have an llvm-ir intrinsic for FPMR instead of lowering to machine instruction late in the pipeline is because the compiler could hoist llvm.set.fprm outside a loop vectorizer, if it is a constant/does not change. AFAIU to hoist a machine instructions outside a loop is more complicated than llvm-i intrinsics.

Hi. Yeah sorry there were two issues I had.

The first is that the lowers to read + branch + set, as opposed to just set. This feels like a premature optimization to me, considering this is very new and we don't know whether that will be a sensible optimization or not when there is real hardware. My guess would be that it often hurts more than it help, so unless you have a very strong reason to add it you might consider simplifying it to a simple store, like other operations. We can always add it back in later if needed.
The other was a bit of a drive-by comment about the dataflow between the llvm.set.fpmr and the use. I agree that having a separate llvm.set.fpmr is a really sensible idea - that llvm can optimize the placement of and cse. We probably don't want to leave that until very late and rely on machine ir optimizations. But throwing away which fpmr values is acting on the current intrinsic throws away all the info about the types it is using and how the instruction operates. It essentially turns any optimization that you might want to do in the mid-end into a (function-)global optimization, as opposed to keeping it local. It would need to search back through the program to find what the current fpmr might be. It would feel like keeping that intact would be useful, with something like
%fpmr = i64 llvm.set.fpmr(i64 fpm)
{<vscale x 8 x half>,<vscale x 8 x half>}llvm.aarch64.scvt2.nxv8i16(<vscale x 8 x i8> %zn, i64 %fpmr)
Whether they are kept using Inaccessible memory I'm not sure, it might be useful to keep. It's ugly and prevents a lot of optimizations, but if it prevents overlapping ranges of fpmr then maybe that is a good thing. We would usually use it for things we don't have any other alternative as we have no data-fow information, but there is only a single register in the hardware. It is should be easy to remove later if we find we can, adding a new %fpmr parameter is more difficult.

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/lib/CodeGen/LivePhysRegs.cpp

llvm/include/llvm/IR/IntrinsicsAArch64.td

Lukacma · 2024-11-05T15:54:15Z

Hello David,

Thank you for your feedback! I have now fixed the first issue you had with the patch. As for the 2nd one, for now I don't think creating data dependencies is necessary. Functionally, dependencies between defs and uses of FPMR are going to be handled using memory dependency through inaccessible memory. I do not see a "nice" way to handle overlaping ranges without using memory dependency, but if you know of a way I would be keen on hearing about it and we might change to data dependency only later. I agree that it will help with optimization to not have to search through the program for where the fpmr value might be coming from, but that is something we will think about once we start optimizing these intrinsics.

davemgreen · 2024-11-13T07:37:41Z

Hello David,

Thank you for your feedback! I have now fixed the first issue you had with the patch. As for the 2nd one, for now I don't think creating data dependencies is necessary. Functionally, dependencies between defs and uses of FPMR are going to be handled using memory dependency through inaccessible memory. I do not see a "nice" way to handle overlaping ranges without using memory dependency, but if you know of a way I would be keen on hearing about it and we might change to data dependency only later. I agree that it will help with optimization to not have to search through the program for where the fpmr value might be coming from, but that is something we will think about once we start optimizing these intrinsics.

As you ask here I will try and reply - we had a conversation about this internally and it seemed the response was that we plan to look into this more and hopefully change it later. (My worry with that is that is will hard to do the more time is spent with them working this way). As for handling overlapping ranges it would just be a case of sinking the intrinsic to the use, it sounds like this will need a pass to do the placement in either case. But I don't actually mean to say that it needs to remove the memory side effect entirely. It could keep the side effect to limit how much the intrinsics move and have the benefits of knowing the fpmr and no overlapping ranges at the same time. You can even remove one of them later if it is decided they are not useful (going the other way is obviously more difficult).

Thanks for simplifying the codegen though - hopefully we won't end up needing it in the future, and if we do we can always resurrect it.

This patch introduces new llvm.set.fpmr intrinsics for setting value in FPMR register and adds its lowering to series of read-compare-write instructions. This intrinsic will be generated during lowering of FP8 C intrinsics into LLVM-IR introduced in later patch. ***This is an experimental implementation of handling fp8 intriniscs and is likely to change in the future.***

Lukacma requested review from momchil-velikov, CarolineConcatto and SpencerAbson October 30, 2024 15:37

llvmbot added backend:AArch64 llvm:ir labels Oct 30, 2024

CarolineConcatto requested a review from paulwalker-arm October 30, 2024 16:07

davemgreen reviewed Nov 1, 2024

View reviewed changes

llvm/test/CodeGen/AArch64/arm64-fpenv.ll Outdated Show resolved Hide resolved

llvm/test/CodeGen/AArch64/arm64-fpenv.ll Show resolved Hide resolved

davemgreen reviewed Nov 1, 2024

View reviewed changes

SpencerAbson reviewed Nov 1, 2024

View reviewed changes

llvm/test/CodeGen/AArch64/arm64-fpenv.ll Outdated Show resolved Hide resolved

switch to using simple msr and address comments

4ee6072

paulwalker-arm reviewed Nov 5, 2024

View reviewed changes

fix incorrect defs and refactor code

4851feb

paulwalker-arm approved these changes Nov 8, 2024

View reviewed changes

Lukacma merged commit 3797daa into llvm:main Nov 8, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64] Create set.fpmr intrinsic and assembly lowering #114248

[AArch64] Create set.fpmr intrinsic and assembly lowering #114248

Uh oh!

Lukacma commented Oct 30, 2024 •

edited

Loading

Uh oh!

llvmbot commented Oct 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

davemgreen left a comment •

edited

Loading

Uh oh!

CarolineConcatto commented Nov 1, 2024 •

edited

Loading

Uh oh!

Uh oh!

davemgreen commented Nov 4, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lukacma commented Nov 5, 2024

Uh oh!

Uh oh!

davemgreen commented Nov 13, 2024

Uh oh!

Uh oh!

[AArch64] Create set.fpmr intrinsic and assembly lowering #114248

[AArch64] Create set.fpmr intrinsic and assembly lowering #114248

Uh oh!

Conversation

Lukacma commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davemgreen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CarolineConcatto commented Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

davemgreen commented Nov 4, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lukacma commented Nov 5, 2024

Uh oh!

Uh oh!

davemgreen commented Nov 13, 2024

Uh oh!

Uh oh!

Lukacma commented Oct 30, 2024 •

edited

Loading

llvmbot commented Oct 30, 2024 •

edited

Loading

davemgreen left a comment •

edited

Loading

CarolineConcatto commented Nov 1, 2024 •

edited

Loading