Skip to content

[NVPTX] Avoid introducing unnecessary ProxyRegs and Movs in ISel #120486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 19, 2024

Conversation

AlexMaclean
Copy link
Member

Avoid introducing ProxyReg and MOV nodes during ISel when lowering bitconvert or similar operations. These nodes are all erased by a later pass but not introducing them in the first place is simpler and likely saves compile time.

Also remove redundant MOV instruction definitions.

@llvmbot
Copy link
Member

llvmbot commented Dec 18, 2024

@llvm/pr-subscribers-backend-nvptx

Author: Alex MacLean (AlexMaclean)

Changes

Avoid introducing ProxyReg and MOV nodes during ISel when lowering bitconvert or similar operations. These nodes are all erased by a later pass but not introducing them in the first place is simpler and likely saves compile time.

Also remove redundant MOV instruction definitions.


Patch is 201.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/120486.diff

30 Files Affected:

  • (modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+14-25)
  • (modified) llvm/lib/Target/NVPTX/NVPTXIntrinsics.td (+2-2)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm70.ll (+13-13)
  • (modified) llvm/test/CodeGen/NVPTX/atomics-sm90.ll (+24-24)
  • (modified) llvm/test/CodeGen/NVPTX/bf16-instructions.ll (+68-68)
  • (modified) llvm/test/CodeGen/NVPTX/bf16x2-instructions.ll (+20-20)
  • (modified) llvm/test/CodeGen/NVPTX/chain-different-as.ll (+2-2)
  • (modified) llvm/test/CodeGen/NVPTX/cmpxchg.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/compute-ptx-value-vts.ll (+2-2)
  • (modified) llvm/test/CodeGen/NVPTX/demote-vars.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/extractelement.ll (+21-21)
  • (modified) llvm/test/CodeGen/NVPTX/f16x2-instructions.ll (+69-69)
  • (modified) llvm/test/CodeGen/NVPTX/fma-relu-contract.ll (+84-84)
  • (modified) llvm/test/CodeGen/NVPTX/fma-relu-fma-intrinsic.ll (+53-53)
  • (modified) llvm/test/CodeGen/NVPTX/fma-relu-instruction-flag.ll (+122-122)
  • (modified) llvm/test/CodeGen/NVPTX/i1-load-lower.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/i128.ll (+9-9)
  • (modified) llvm/test/CodeGen/NVPTX/i16x2-instructions.ll (+36-36)
  • (modified) llvm/test/CodeGen/NVPTX/i8x4-instructions.ll (+86-86)
  • (modified) llvm/test/CodeGen/NVPTX/inline-asm-b128-test1.ll (+3-3)
  • (modified) llvm/test/CodeGen/NVPTX/inline-asm-b128-test2.ll (+4-4)
  • (modified) llvm/test/CodeGen/NVPTX/inline-asm-b128-test3.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/math-intrins.ll (+88-88)
  • (modified) llvm/test/CodeGen/NVPTX/misched_func_call.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/pr13291-i1-store.ll (+2-2)
  • (modified) llvm/test/CodeGen/NVPTX/reg-types.ll (+6-6)
  • (modified) llvm/test/CodeGen/NVPTX/unfold-masked-merge-vector-variablemask.ll (+130-130)
  • (modified) llvm/test/CodeGen/NVPTX/vaargs.ll (+1-1)
  • (modified) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+6-6)
  • (modified) llvm/test/CodeGen/NVPTX/vector-returns.ll (+25-25)
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index abaf8e0b0ec1f8..eb4918c43f0dce 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -1994,22 +1994,15 @@ let IsSimpleMove=1, hasSideEffects=0 in {
 def IMOV1ri : NVPTXInst<(outs Int1Regs:$dst), (ins i1imm:$src),
                         "mov.pred \t$dst, $src;",
                         [(set i1:$dst, imm:$src)]>;
-def IMOV16ri : NVPTXInst<(outs Int16Regs:$dst), (ins i16imm:$src),
-                         "mov.u16 \t$dst, $src;",
-                         [(set i16:$dst, imm:$src)]>;
-def IMOV32ri : NVPTXInst<(outs Int32Regs:$dst), (ins i32imm:$src),
-                         "mov.u32 \t$dst, $src;",
-                         [(set i32:$dst, imm:$src)]>;
-def IMOV64ri : NVPTXInst<(outs Int64Regs:$dst), (ins i64imm:$src),
-                        "mov.u64 \t$dst, $src;",
-                        [(set i64:$dst, imm:$src)]>;
-
 def IMOVB16ri : NVPTXInst<(outs Int16Regs:$dst), (ins i16imm:$src),
-                         "mov.b16 \t$dst, $src;", []>;
+                         "mov.b16 \t$dst, $src;",
+                         [(set i16:$dst, imm:$src)]>;
 def IMOVB32ri : NVPTXInst<(outs Int32Regs:$dst), (ins i32imm:$src),
-                         "mov.b32 \t$dst, $src;", []>;
+                         "mov.b32 \t$dst, $src;",
+                         [(set i32:$dst, imm:$src)]>;
 def IMOVB64ri : NVPTXInst<(outs Int64Regs:$dst), (ins i64imm:$src),
-                        "mov.b64 \t$dst, $src;", []>;
+                        "mov.b64 \t$dst, $src;",
+                        [(set i64:$dst, imm:$src)]>;
 
 def FMOV32ri : NVPTXInst<(outs Float32Regs:$dst), (ins f32imm:$src),
                          "mov.f32 \t$dst, $src;",
@@ -2018,8 +2011,8 @@ def FMOV64ri : NVPTXInst<(outs Float64Regs:$dst), (ins f64imm:$src),
                          "mov.f64 \t$dst, $src;",
                          [(set f64:$dst, fpimm:$src)]>;
 
-def : Pat<(i32 (Wrapper texternalsym:$dst)), (IMOV32ri texternalsym:$dst)>;
-def : Pat<(i64 (Wrapper texternalsym:$dst)), (IMOV64ri texternalsym:$dst)>;
+def : Pat<(i32 (Wrapper texternalsym:$dst)), (IMOVB32ri texternalsym:$dst)>;
+def : Pat<(i64 (Wrapper texternalsym:$dst)), (IMOVB64ri texternalsym:$dst)>;
 
 //---- Copy Frame Index ----
 def LEA_ADDRi :   NVPTXInst<(outs Int32Regs:$dst), (ins MEMri:$addr),
@@ -3104,21 +3097,17 @@ def: Pat<(f32 (bitconvert vt:$a)),
          (BITCONVERT_32_I2F Int32Regs:$a)>;
 }
 foreach vt = [f16, bf16] in {
-def: Pat<(vt (bitconvert (i16 UInt16Const:$a))),
-         (IMOVB16ri UInt16Const:$a)>;
-def: Pat<(vt (bitconvert i16:$a)),
-         (ProxyRegI16 Int16Regs:$a)>;
-def: Pat<(i16 (bitconvert vt:$a)),
-         (ProxyRegI16 Int16Regs:$a)>;
+  def: Pat<(vt (bitconvert i16:$a)),
+           (vt Int16Regs:$a)>;
+  def: Pat<(i16 (bitconvert vt:$a)),
+           (i16 Int16Regs:$a)>;
 }
 
 foreach ta = [v2f16, v2bf16, v2i16, v4i8, i32] in {
-  def: Pat<(ta (bitconvert (i32 UInt32Const:$a))),
-           (IMOVB32ri UInt32Const:$a)>;
   foreach tb = [v2f16, v2bf16, v2i16, v4i8, i32] in {
     if !ne(ta, tb) then {
-      def: Pat<(ta (bitconvert (tb Int32Regs:$a))),
-             (ProxyRegI32 Int32Regs:$a)>;
+      def: Pat<(ta (bitconvert tb:$a)),
+               (ta Int32Regs:$a)>;
     }
   }
 }
diff --git a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
index 256161d5d79c77..6d4a56f191825b 100644
--- a/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
+++ b/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td
@@ -2803,10 +2803,10 @@ def : Pat<(int_nvvm_ptr_param_to_gen i64:$src),
 
 // nvvm.ptr.gen.to.param
 def : Pat<(int_nvvm_ptr_gen_to_param i32:$src),
-          (IMOV32rr Int32Regs:$src)>;
+          (i32 Int32Regs:$src)>;
 
 def : Pat<(int_nvvm_ptr_gen_to_param i64:$src),
-          (IMOV64rr Int64Regs:$src)>;
+          (i64 Int64Regs:$src)>;
 
 // nvvm.move intrinsicc
 def nvvm_move_i16 : NVPTXInst<(outs Int16Regs:$r), (ins Int16Regs:$s),
diff --git a/llvm/test/CodeGen/NVPTX/atomics-sm70.ll b/llvm/test/CodeGen/NVPTX/atomics-sm70.ll
index bae7109288b99f..05f466f2138ec1 100644
--- a/llvm/test/CodeGen/NVPTX/atomics-sm70.ll
+++ b/llvm/test/CodeGen/NVPTX/atomics-sm70.ll
@@ -46,7 +46,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, half %
 ; CHECKPTX62-LABEL: test(
 ; CHECKPTX62:       {
 ; CHECKPTX62-NEXT:    .reg .pred %p<5>;
-; CHECKPTX62-NEXT:    .reg .b16 %rs<19>;
+; CHECKPTX62-NEXT:    .reg .b16 %rs<11>;
 ; CHECKPTX62-NEXT:    .reg .b32 %r<58>;
 ; CHECKPTX62-EMPTY:
 ; CHECKPTX62-NEXT:  // %bb.0:
@@ -65,8 +65,8 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, half %
 ; CHECKPTX62-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECKPTX62-NEXT:    shr.u32 %r28, %r54, %r2;
 ; CHECKPTX62-NEXT:    cvt.u16.u32 %rs2, %r28;
-; CHECKPTX62-NEXT:    add.rn.f16 %rs4, %rs2, %rs1;
-; CHECKPTX62-NEXT:    cvt.u32.u16 %r29, %rs4;
+; CHECKPTX62-NEXT:    add.rn.f16 %rs3, %rs2, %rs1;
+; CHECKPTX62-NEXT:    cvt.u32.u16 %r29, %rs3;
 ; CHECKPTX62-NEXT:    shl.b32 %r30, %r29, %r2;
 ; CHECKPTX62-NEXT:    and.b32 %r31, %r54, %r3;
 ; CHECKPTX62-NEXT:    or.b32 %r32, %r31, %r30;
@@ -79,10 +79,10 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, half %
 ; CHECKPTX62-NEXT:  $L__BB0_3: // %atomicrmw.start27
 ; CHECKPTX62-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECKPTX62-NEXT:    shr.u32 %r33, %r55, %r2;
-; CHECKPTX62-NEXT:    cvt.u16.u32 %rs6, %r33;
-; CHECKPTX62-NEXT:    mov.b16 %rs8, 0x3C00;
-; CHECKPTX62-NEXT:    add.rn.f16 %rs9, %rs6, %rs8;
-; CHECKPTX62-NEXT:    cvt.u32.u16 %r34, %rs9;
+; CHECKPTX62-NEXT:    cvt.u16.u32 %rs4, %r33;
+; CHECKPTX62-NEXT:    mov.b16 %rs5, 0x3C00;
+; CHECKPTX62-NEXT:    add.rn.f16 %rs6, %rs4, %rs5;
+; CHECKPTX62-NEXT:    cvt.u32.u16 %r34, %rs6;
 ; CHECKPTX62-NEXT:    shl.b32 %r35, %r34, %r2;
 ; CHECKPTX62-NEXT:    and.b32 %r36, %r55, %r3;
 ; CHECKPTX62-NEXT:    or.b32 %r37, %r36, %r35;
@@ -100,9 +100,9 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, half %
 ; CHECKPTX62-NEXT:  $L__BB0_5: // %atomicrmw.start9
 ; CHECKPTX62-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECKPTX62-NEXT:    shr.u32 %r41, %r56, %r11;
-; CHECKPTX62-NEXT:    cvt.u16.u32 %rs11, %r41;
-; CHECKPTX62-NEXT:    add.rn.f16 %rs13, %rs11, %rs1;
-; CHECKPTX62-NEXT:    cvt.u32.u16 %r42, %rs13;
+; CHECKPTX62-NEXT:    cvt.u16.u32 %rs7, %r41;
+; CHECKPTX62-NEXT:    add.rn.f16 %rs8, %rs7, %rs1;
+; CHECKPTX62-NEXT:    cvt.u32.u16 %r42, %rs8;
 ; CHECKPTX62-NEXT:    shl.b32 %r43, %r42, %r11;
 ; CHECKPTX62-NEXT:    and.b32 %r44, %r56, %r12;
 ; CHECKPTX62-NEXT:    or.b32 %r45, %r44, %r43;
@@ -120,9 +120,9 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, half %
 ; CHECKPTX62-NEXT:  $L__BB0_7: // %atomicrmw.start
 ; CHECKPTX62-NEXT:    // =>This Inner Loop Header: Depth=1
 ; CHECKPTX62-NEXT:    shr.u32 %r49, %r57, %r17;
-; CHECKPTX62-NEXT:    cvt.u16.u32 %rs15, %r49;
-; CHECKPTX62-NEXT:    add.rn.f16 %rs17, %rs15, %rs1;
-; CHECKPTX62-NEXT:    cvt.u32.u16 %r50, %rs17;
+; CHECKPTX62-NEXT:    cvt.u16.u32 %rs9, %r49;
+; CHECKPTX62-NEXT:    add.rn.f16 %rs10, %rs9, %rs1;
+; CHECKPTX62-NEXT:    cvt.u32.u16 %r50, %rs10;
 ; CHECKPTX62-NEXT:    shl.b32 %r51, %r50, %r17;
 ; CHECKPTX62-NEXT:    and.b32 %r52, %r57, %r18;
 ; CHECKPTX62-NEXT:    or.b32 %r53, %r52, %r51;
diff --git a/llvm/test/CodeGen/NVPTX/atomics-sm90.ll b/llvm/test/CodeGen/NVPTX/atomics-sm90.ll
index 9ddb82321b4ea2..f81b785f13225c 100644
--- a/llvm/test/CodeGen/NVPTX/atomics-sm90.ll
+++ b/llvm/test/CodeGen/NVPTX/atomics-sm90.ll
@@ -46,7 +46,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, bfloat
 ; CHECKPTX71-LABEL: test(
 ; CHECKPTX71:       {
 ; CHECKPTX71-NEXT:    .reg .pred %p<5>;
-; CHECKPTX71-NEXT:    .reg .b16 %rs<34>;
+; CHECKPTX71-NEXT:    .reg .b16 %rs<22>;
 ; CHECKPTX71-NEXT:    .reg .b32 %r<4>;
 ; CHECKPTX71-NEXT:    .reg .f32 %f<12>;
 ; CHECKPTX71-EMPTY:
@@ -55,49 +55,49 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, bfloat
 ; CHECKPTX71-NEXT:    ld.param.u32 %r3, [test_param_2];
 ; CHECKPTX71-NEXT:    ld.param.u32 %r2, [test_param_1];
 ; CHECKPTX71-NEXT:    ld.param.u32 %r1, [test_param_0];
-; CHECKPTX71-NEXT:    ld.b16 %rs30, [%r1];
+; CHECKPTX71-NEXT:    ld.b16 %rs18, [%r1];
 ; CHECKPTX71-NEXT:    cvt.f32.bf16 %f1, %rs13;
 ; CHECKPTX71-NEXT:  $L__BB0_1: // %atomicrmw.start14
 ; CHECKPTX71-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECKPTX71-NEXT:    cvt.f32.bf16 %f2, %rs30;
+; CHECKPTX71-NEXT:    cvt.f32.bf16 %f2, %rs18;
 ; CHECKPTX71-NEXT:    add.rn.f32 %f3, %f2, %f1;
 ; CHECKPTX71-NEXT:    cvt.rn.bf16.f32 %rs14, %f3;
-; CHECKPTX71-NEXT:    atom.cas.b16 %rs17, [%r1], %rs30, %rs14;
-; CHECKPTX71-NEXT:    setp.ne.s16 %p1, %rs17, %rs30;
-; CHECKPTX71-NEXT:    mov.u16 %rs30, %rs17;
+; CHECKPTX71-NEXT:    atom.cas.b16 %rs3, [%r1], %rs18, %rs14;
+; CHECKPTX71-NEXT:    setp.ne.s16 %p1, %rs3, %rs18;
+; CHECKPTX71-NEXT:    mov.u16 %rs18, %rs3;
 ; CHECKPTX71-NEXT:    @%p1 bra $L__BB0_1;
 ; CHECKPTX71-NEXT:  // %bb.2: // %atomicrmw.end13
-; CHECKPTX71-NEXT:    ld.b16 %rs31, [%r1];
+; CHECKPTX71-NEXT:    ld.b16 %rs19, [%r1];
 ; CHECKPTX71-NEXT:  $L__BB0_3: // %atomicrmw.start8
 ; CHECKPTX71-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECKPTX71-NEXT:    cvt.f32.bf16 %f4, %rs31;
+; CHECKPTX71-NEXT:    cvt.f32.bf16 %f4, %rs19;
 ; CHECKPTX71-NEXT:    add.rn.f32 %f5, %f4, 0f3F800000;
-; CHECKPTX71-NEXT:    cvt.rn.bf16.f32 %rs18, %f5;
-; CHECKPTX71-NEXT:    atom.cas.b16 %rs21, [%r1], %rs31, %rs18;
-; CHECKPTX71-NEXT:    setp.ne.s16 %p2, %rs21, %rs31;
-; CHECKPTX71-NEXT:    mov.u16 %rs31, %rs21;
+; CHECKPTX71-NEXT:    cvt.rn.bf16.f32 %rs15, %f5;
+; CHECKPTX71-NEXT:    atom.cas.b16 %rs6, [%r1], %rs19, %rs15;
+; CHECKPTX71-NEXT:    setp.ne.s16 %p2, %rs6, %rs19;
+; CHECKPTX71-NEXT:    mov.u16 %rs19, %rs6;
 ; CHECKPTX71-NEXT:    @%p2 bra $L__BB0_3;
 ; CHECKPTX71-NEXT:  // %bb.4: // %atomicrmw.end7
-; CHECKPTX71-NEXT:    ld.global.b16 %rs32, [%r2];
+; CHECKPTX71-NEXT:    ld.global.b16 %rs20, [%r2];
 ; CHECKPTX71-NEXT:  $L__BB0_5: // %atomicrmw.start2
 ; CHECKPTX71-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECKPTX71-NEXT:    cvt.f32.bf16 %f7, %rs32;
+; CHECKPTX71-NEXT:    cvt.f32.bf16 %f7, %rs20;
 ; CHECKPTX71-NEXT:    add.rn.f32 %f8, %f7, %f1;
-; CHECKPTX71-NEXT:    cvt.rn.bf16.f32 %rs22, %f8;
-; CHECKPTX71-NEXT:    atom.global.cas.b16 %rs25, [%r2], %rs32, %rs22;
-; CHECKPTX71-NEXT:    setp.ne.s16 %p3, %rs25, %rs32;
-; CHECKPTX71-NEXT:    mov.u16 %rs32, %rs25;
+; CHECKPTX71-NEXT:    cvt.rn.bf16.f32 %rs16, %f8;
+; CHECKPTX71-NEXT:    atom.global.cas.b16 %rs9, [%r2], %rs20, %rs16;
+; CHECKPTX71-NEXT:    setp.ne.s16 %p3, %rs9, %rs20;
+; CHECKPTX71-NEXT:    mov.u16 %rs20, %rs9;
 ; CHECKPTX71-NEXT:    @%p3 bra $L__BB0_5;
 ; CHECKPTX71-NEXT:  // %bb.6: // %atomicrmw.end1
-; CHECKPTX71-NEXT:    ld.shared.b16 %rs33, [%r3];
+; CHECKPTX71-NEXT:    ld.shared.b16 %rs21, [%r3];
 ; CHECKPTX71-NEXT:  $L__BB0_7: // %atomicrmw.start
 ; CHECKPTX71-NEXT:    // =>This Inner Loop Header: Depth=1
-; CHECKPTX71-NEXT:    cvt.f32.bf16 %f10, %rs33;
+; CHECKPTX71-NEXT:    cvt.f32.bf16 %f10, %rs21;
 ; CHECKPTX71-NEXT:    add.rn.f32 %f11, %f10, %f1;
-; CHECKPTX71-NEXT:    cvt.rn.bf16.f32 %rs26, %f11;
-; CHECKPTX71-NEXT:    atom.shared.cas.b16 %rs29, [%r3], %rs33, %rs26;
-; CHECKPTX71-NEXT:    setp.ne.s16 %p4, %rs29, %rs33;
-; CHECKPTX71-NEXT:    mov.u16 %rs33, %rs29;
+; CHECKPTX71-NEXT:    cvt.rn.bf16.f32 %rs17, %f11;
+; CHECKPTX71-NEXT:    atom.shared.cas.b16 %rs12, [%r3], %rs21, %rs17;
+; CHECKPTX71-NEXT:    setp.ne.s16 %p4, %rs12, %rs21;
+; CHECKPTX71-NEXT:    mov.u16 %rs21, %rs12;
 ; CHECKPTX71-NEXT:    @%p4 bra $L__BB0_7;
 ; CHECKPTX71-NEXT:  // %bb.8: // %atomicrmw.end
 ; CHECKPTX71-NEXT:    ret;
diff --git a/llvm/test/CodeGen/NVPTX/bf16-instructions.ll b/llvm/test/CodeGen/NVPTX/bf16-instructions.ll
index 08ed317ef93007..6828bac18cad7f 100644
--- a/llvm/test/CodeGen/NVPTX/bf16-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/bf16-instructions.ll
@@ -17,7 +17,7 @@ define bfloat @test_fadd(bfloat %0, bfloat %1) {
 ; SM70-LABEL: test_fadd(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<2>;
-; SM70-NEXT:    .reg .b16 %rs<3>;
+; SM70-NEXT:    .reg .b16 %rs<2>;
 ; SM70-NEXT:    .reg .b32 %r<11>;
 ; SM70-NEXT:    .reg .f32 %f<4>;
 ; SM70-EMPTY:
@@ -88,7 +88,7 @@ define bfloat @test_fsub(bfloat %0, bfloat %1) {
 ; SM70-LABEL: test_fsub(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<2>;
-; SM70-NEXT:    .reg .b16 %rs<3>;
+; SM70-NEXT:    .reg .b16 %rs<2>;
 ; SM70-NEXT:    .reg .b32 %r<11>;
 ; SM70-NEXT:    .reg .f32 %f<4>;
 ; SM70-EMPTY:
@@ -159,8 +159,8 @@ define <2 x bfloat> @test_faddx2(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-LABEL: test_faddx2(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<3>;
-; SM70-NEXT:    .reg .b16 %rs<9>;
-; SM70-NEXT:    .reg .b32 %r<25>;
+; SM70-NEXT:    .reg .b16 %rs<5>;
+; SM70-NEXT:    .reg .b32 %r<24>;
 ; SM70-NEXT:    .reg .f32 %f<7>;
 ; SM70-EMPTY:
 ; SM70-NEXT:  // %bb.0:
@@ -170,8 +170,8 @@ define <2 x bfloat> @test_faddx2(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-NEXT:    cvt.u32.u16 %r3, %rs2;
 ; SM70-NEXT:    shl.b32 %r4, %r3, 16;
 ; SM70-NEXT:    mov.b32 %f1, %r4;
-; SM70-NEXT:    mov.b32 {%rs4, %rs5}, %r1;
-; SM70-NEXT:    cvt.u32.u16 %r5, %rs5;
+; SM70-NEXT:    mov.b32 {%rs3, %rs4}, %r1;
+; SM70-NEXT:    cvt.u32.u16 %r5, %rs4;
 ; SM70-NEXT:    shl.b32 %r6, %r5, 16;
 ; SM70-NEXT:    mov.b32 %f2, %r6;
 ; SM70-NEXT:    add.rn.f32 %f3, %f2, %f1;
@@ -185,7 +185,7 @@ define <2 x bfloat> @test_faddx2(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-NEXT:    cvt.u32.u16 %r13, %rs1;
 ; SM70-NEXT:    shl.b32 %r14, %r13, 16;
 ; SM70-NEXT:    mov.b32 %f4, %r14;
-; SM70-NEXT:    cvt.u32.u16 %r15, %rs4;
+; SM70-NEXT:    cvt.u32.u16 %r15, %rs3;
 ; SM70-NEXT:    shl.b32 %r16, %r15, 16;
 ; SM70-NEXT:    mov.b32 %f5, %r16;
 ; SM70-NEXT:    add.rn.f32 %f6, %f5, %f4;
@@ -260,8 +260,8 @@ define <2 x bfloat> @test_fsubx2(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-LABEL: test_fsubx2(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<3>;
-; SM70-NEXT:    .reg .b16 %rs<9>;
-; SM70-NEXT:    .reg .b32 %r<25>;
+; SM70-NEXT:    .reg .b16 %rs<5>;
+; SM70-NEXT:    .reg .b32 %r<24>;
 ; SM70-NEXT:    .reg .f32 %f<7>;
 ; SM70-EMPTY:
 ; SM70-NEXT:  // %bb.0:
@@ -271,8 +271,8 @@ define <2 x bfloat> @test_fsubx2(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-NEXT:    cvt.u32.u16 %r3, %rs2;
 ; SM70-NEXT:    shl.b32 %r4, %r3, 16;
 ; SM70-NEXT:    mov.b32 %f1, %r4;
-; SM70-NEXT:    mov.b32 {%rs4, %rs5}, %r1;
-; SM70-NEXT:    cvt.u32.u16 %r5, %rs5;
+; SM70-NEXT:    mov.b32 {%rs3, %rs4}, %r1;
+; SM70-NEXT:    cvt.u32.u16 %r5, %rs4;
 ; SM70-NEXT:    shl.b32 %r6, %r5, 16;
 ; SM70-NEXT:    mov.b32 %f2, %r6;
 ; SM70-NEXT:    sub.rn.f32 %f3, %f2, %f1;
@@ -286,7 +286,7 @@ define <2 x bfloat> @test_fsubx2(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-NEXT:    cvt.u32.u16 %r13, %rs1;
 ; SM70-NEXT:    shl.b32 %r14, %r13, 16;
 ; SM70-NEXT:    mov.b32 %f4, %r14;
-; SM70-NEXT:    cvt.u32.u16 %r15, %rs4;
+; SM70-NEXT:    cvt.u32.u16 %r15, %rs3;
 ; SM70-NEXT:    shl.b32 %r16, %r15, 16;
 ; SM70-NEXT:    mov.b32 %f5, %r16;
 ; SM70-NEXT:    sub.rn.f32 %f6, %f5, %f4;
@@ -361,8 +361,8 @@ define <2 x bfloat> @test_fmulx2(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-LABEL: test_fmulx2(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<3>;
-; SM70-NEXT:    .reg .b16 %rs<9>;
-; SM70-NEXT:    .reg .b32 %r<25>;
+; SM70-NEXT:    .reg .b16 %rs<5>;
+; SM70-NEXT:    .reg .b32 %r<24>;
 ; SM70-NEXT:    .reg .f32 %f<7>;
 ; SM70-EMPTY:
 ; SM70-NEXT:  // %bb.0:
@@ -372,8 +372,8 @@ define <2 x bfloat> @test_fmulx2(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-NEXT:    cvt.u32.u16 %r3, %rs2;
 ; SM70-NEXT:    shl.b32 %r4, %r3, 16;
 ; SM70-NEXT:    mov.b32 %f1, %r4;
-; SM70-NEXT:    mov.b32 {%rs4, %rs5}, %r1;
-; SM70-NEXT:    cvt.u32.u16 %r5, %rs5;
+; SM70-NEXT:    mov.b32 {%rs3, %rs4}, %r1;
+; SM70-NEXT:    cvt.u32.u16 %r5, %rs4;
 ; SM70-NEXT:    shl.b32 %r6, %r5, 16;
 ; SM70-NEXT:    mov.b32 %f2, %r6;
 ; SM70-NEXT:    mul.rn.f32 %f3, %f2, %f1;
@@ -387,7 +387,7 @@ define <2 x bfloat> @test_fmulx2(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-NEXT:    cvt.u32.u16 %r13, %rs1;
 ; SM70-NEXT:    shl.b32 %r14, %r13, 16;
 ; SM70-NEXT:    mov.b32 %f4, %r14;
-; SM70-NEXT:    cvt.u32.u16 %r15, %rs4;
+; SM70-NEXT:    cvt.u32.u16 %r15, %rs3;
 ; SM70-NEXT:    shl.b32 %r16, %r15, 16;
 ; SM70-NEXT:    mov.b32 %f5, %r16;
 ; SM70-NEXT:    mul.rn.f32 %f6, %f5, %f4;
@@ -462,8 +462,8 @@ define <2 x bfloat> @test_fdiv(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-LABEL: test_fdiv(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<3>;
-; SM70-NEXT:    .reg .b16 %rs<9>;
-; SM70-NEXT:    .reg .b32 %r<25>;
+; SM70-NEXT:    .reg .b16 %rs<5>;
+; SM70-NEXT:    .reg .b32 %r<24>;
 ; SM70-NEXT:    .reg .f32 %f<7>;
 ; SM70-EMPTY:
 ; SM70-NEXT:  // %bb.0:
@@ -473,8 +473,8 @@ define <2 x bfloat> @test_fdiv(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-NEXT:    cvt.u32.u16 %r3, %rs2;
 ; SM70-NEXT:    shl.b32 %r4, %r3, 16;
 ; SM70-NEXT:    mov.b32 %f1, %r4;
-; SM70-NEXT:    mov.b32 {%rs4, %rs5}, %r1;
-; SM70-NEXT:    cvt.u32.u16 %r5, %rs5;
+; SM70-NEXT:    mov.b32 {%rs3, %rs4}, %r1;
+; SM70-NEXT:    cvt.u32.u16 %r5, %rs4;
 ; SM70-NEXT:    shl.b32 %r6, %r5, 16;
 ; SM70-NEXT:    mov.b32 %f2, %r6;
 ; SM70-NEXT:    div.rn.f32 %f3, %f2, %f1;
@@ -488,7 +488,7 @@ define <2 x bfloat> @test_fdiv(<2 x bfloat> %a, <2 x bfloat> %b) #0 {
 ; SM70-NEXT:    cvt.u32.u16 %r13, %rs1;
 ; SM70-NEXT:    shl.b32 %r14, %r13, 16;
 ; SM70-NEXT:    mov.b32 %f4, %r14;
-; SM70-NEXT:    cvt.u32.u16 %r15, %rs4;
+; SM70-NEXT:    cvt.u32.u16 %r15, %rs3;
 ; SM70-NEXT:    shl.b32 %r16, %r15, 16;
 ; SM70-NEXT:    mov.b32 %f5, %r16;
 ; SM70-NEXT:    div.rn.f32 %f6, %f5, %f4;
@@ -648,7 +648,7 @@ define bfloat @test_fptrunc_float(float %a) #0 {
 ; SM70-LABEL: test_fptrunc_float(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<2>;
-; SM70-NEXT:    .reg .b16 %rs<3>;
+; SM70-NEXT:    .reg .b16 %rs<2>;
 ; SM70-NEXT:    .reg .b32 %r<7>;
 ; SM70-NEXT:    .reg .f32 %f<2>;
 ; SM70-EMPTY:
@@ -705,7 +705,7 @@ define bfloat @test_fadd_imm_1(bfloat %a) #0 {
 ; SM70-LABEL: test_fadd_imm_1(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<2>;
-; SM70-NEXT:    .reg .b16 %rs<3>;
+; SM70-NEXT:    .reg .b16 %rs<2>;
 ; SM70-NEXT:    .reg .b32 %r<9>;
 ; SM70-NEXT:    .reg .f32 %f<3>;
 ; SM70-EMPTY:
@@ -789,7 +789,7 @@ define bfloat @test_select_cc_bf16_f64(double %a, double %b, bfloat %c, bfloat %
 define <8 x float> @test_extload_bf16x8(ptr addrspace(3) noundef %arg) #0 {
 ; SM70-LABEL: test_extload_bf16x8(
 ; SM70:       {
-; SM70-NEXT:    .reg .b16 %rs<17>;
+; SM70-NEXT:    .reg .b16 %rs<9>;
 ; SM70-NEXT:    .reg .b32 %r<21>;
 ; SM70-NEXT:    .reg .f32 %f<9>;
 ; SM70-NEXT:    .reg .b64 %rd<2>;
@@ -1033,7 +1033,7 @@ define bfloat @test_sitofp_i16(i16 %a) {
 ; SM70-LABEL: test_sitofp_i16(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<2>;
-; SM70-NEXT:    .reg .b16 %rs<4>;
+; SM70-NEXT:    .reg .b16 %rs<3>;
 ; SM70-NEXT:    .reg .b32 %r<7>;
 ; SM70-NEXT:    .reg .f32 %f<2>;
 ; SM70-EMPTY:
@@ -1092,7 +1092,7 @@ define bfloat @test_uitofp_i8(i8 %a) {
 ; SM70-LABEL: test_uitofp_i8(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<2>;
-; SM70-NEXT:    .reg .b16 %rs<4>;
+; SM70-NEXT:    .reg .b16 %rs<3>;
 ; SM70-NEXT:    .reg .b32 %r<7>;
 ; SM70-NEXT:    .reg .f32 %f<2>;
 ; SM70-EMPTY:
@@ -1151,7 +1151,7 @@ define bfloat @test_uitofp_i1(i1 %a) {
 ; SM70-LABEL: test_uitofp_i1(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<3>;
-; SM70-NEXT:    .reg .b16 %rs<5>;
+; SM70-NEXT:    .reg .b16 %rs<4>;
 ; SM70-NEXT:    .reg .b32 %r<8>;
 ; SM70-NEXT:    .reg .f32 %f<2>;
 ; SM70-EMPTY:
@@ -1228,7 +1228,7 @@ define bfloat @test_uitofp_i16(i16 %a) {
 ; SM70-LABEL: test_uitofp_i16(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<2>;
-; SM70-NEXT:    .reg .b16 %rs<4>;
+; SM70-NEXT:    .reg .b16 %rs<3>;
 ; SM70-NEXT:    .reg .b32 %r<7>;
 ; SM70-NEXT:    .reg .f32 %f<2>;
 ; SM70-EMPTY:
@@ -1287,7 +1287,7 @@ define bfloat @test_uitofp_i32(i32 %a) {
 ; SM70-LABEL: test_uitofp_i32(
 ; SM70:       {
 ; SM70-NEXT:    .reg .pred %p<2>;
-; SM70-NEXT:    .reg .b16 %rs<3>;
+; SM70-NEXT:    .reg .b16 %rs<2>;
 ; SM70-NEXT:    .reg .b32 %r<8>;
 ; SM70-NEXT:    .reg .f32 %f<2>;
 ; SM70-EMPTY:
@@ -1349,7 +1349,7 @@ define bfloat @test_uitofp_i64(i64 %a) {
 ; SM70-LABEL: test_uitofp_i64(
 ; SM70:       ...
[truncated]

Copy link
Contributor

@justinfargnoli justinfargnoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These nodes are all erased by a later pass

What pass does this?

Copy link
Contributor

@justinfargnoli justinfargnoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@AlexMaclean
Copy link
Member Author

These nodes are all erased by a later pass

What pass does this?

NVPTXProxyRegErasure

Copy link
Member

@Artem-B Artem-B left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.
LGTM.

Comment on lines +165 to +166
; CHECK-NEXT: { .reg .b32 tmp; mov.b64 {tmp, %r1}, %rd1; }
; CHECK-NEXT: cvt.u32.u64 %r2, %rd1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future improvement opportunity: Coalesce partial mov and cvt into mov.b64 {%r2, %r1}, %rd1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! I've been thinking about this for a while.

@AlexMaclean
Copy link
Member Author

@Artem-B, @justinfargnoli heads up, I've done some further cleanup in 2756e29 while addressing Justin's request. If you have a minute please take another look.

Comment on lines -215 to -216
// There's no way to specify FP16 and BF16 immediates in .(b)f16 ops, so we
// have to load them into an .(b)f16 register first.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're deleting this code because the comment is false, right? Or am I misunderstanding something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the source code comment is false, then the update LGTM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is sort of true but it doesn't justify making things the way they were. It is simpler and cleaner to replace these with normally named Mov instructions and to use tablegen to generate the ISel logic. This is handled the same way any other instruction where we do not support an immediate operand is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's been fixed in ptxas: https://godbolt.org/z/d8EcMevc8

If anything, recent versions seem to be more restrictive than the older ones.

Copy link
Member

@Artem-B Artem-B Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with removing this code as long as we can still handle constants correctly. This should already be covered by the original tests in llvm/test/CodeGen/NVPTX/f16-instructions.ll

if LLVM tests are happy after the run with ptxas enabled, we should be fine.

@AlexMaclean AlexMaclean merged commit 310e798 into llvm:main Dec 19, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants