-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[NVPTX] cleanup & canonicalize mov
#129344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVPTX] cleanup & canonicalize mov
#129344
Conversation
@llvm/pr-subscribers-debuginfo @llvm/pr-subscribers-backend-nvptx Author: Justin Fargnoli (justinfargnoli) ChangesUse a Patch is 24.07 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/129344.diff 11 Files Affected:
diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index f94d7099f1b0e..b967bb6b1dd13 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -1961,50 +1961,27 @@ let hasSideEffects = false in {
// copyPhysreg is hard-coded in NVPTXInstrInfo.cpp
-let hasSideEffects=0, isAsCheapAsAMove=1 in {
- def IMOV1rr : NVPTXInst<(outs Int1Regs:$dst), (ins Int1Regs:$sss),
- "mov.pred \t$dst, $sss;", []>;
- def IMOV16rr : NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$sss),
- "mov.u16 \t$dst, $sss;", []>;
- def IMOV32rr : NVPTXInst<(outs Int32Regs:$dst), (ins Int32Regs:$sss),
- "mov.u32 \t$dst, $sss;", []>;
- def IMOV64rr : NVPTXInst<(outs Int64Regs:$dst), (ins Int64Regs:$sss),
- "mov.u64 \t$dst, $sss;", []>;
- def IMOV128rr : NVPTXInst<(outs Int128Regs:$dst), (ins Int128Regs:$sss),
- "mov.b128 \t$dst, $sss;", []>;
-
- def FMOV32rr : NVPTXInst<(outs Float32Regs:$dst), (ins Float32Regs:$src),
- "mov.f32 \t$dst, $src;", []>;
- def FMOV64rr : NVPTXInst<(outs Float64Regs:$dst), (ins Float64Regs:$src),
- "mov.f64 \t$dst, $src;", []>;
-
- def IMOV1ri : NVPTXInst<(outs Int1Regs:$dst), (ins i1imm:$src),
- "mov.pred \t$dst, $src;",
- [(set i1:$dst, imm:$src)]>;
- def IMOV16ri : NVPTXInst<(outs Int16Regs:$dst), (ins i16imm:$src),
- "mov.b16 \t$dst, $src;",
- [(set i16:$dst, imm:$src)]>;
- def IMOV32ri : NVPTXInst<(outs Int32Regs:$dst), (ins i32imm:$src),
- "mov.b32 \t$dst, $src;",
- [(set i32:$dst, imm:$src)]>;
- def IMOV64ri : NVPTXInst<(outs Int64Regs:$dst), (ins i64imm:$src),
- "mov.b64 \t$dst, $src;",
- [(set i64:$dst, imm:$src)]>;
-
- def FMOV16ri : NVPTXInst<(outs Int16Regs:$dst), (ins f16imm:$src),
- "mov.b16 \t$dst, $src;",
- [(set f16:$dst, fpimm:$src)]>;
- def BFMOV16ri : NVPTXInst<(outs Int16Regs:$dst), (ins bf16imm:$src),
- "mov.b16 \t$dst, $src;",
- [(set bf16:$dst, fpimm:$src)]>;
- def FMOV32ri : NVPTXInst<(outs Float32Regs:$dst), (ins f32imm:$src),
- "mov.f32 \t$dst, $src;",
- [(set f32:$dst, fpimm:$src)]>;
- def FMOV64ri : NVPTXInst<(outs Float64Regs:$dst), (ins f64imm:$src),
- "mov.f64 \t$dst, $src;",
- [(set f64:$dst, fpimm:$src)]>;
+let hasSideEffects = false, isAsCheapAsAMove = true in {
+ multiclass MOV<RegisterClass RC, string OpStr, ValueType VT, Operand IMMType, SDNode ImmNode> {
+ def rr : NVPTXInst<(outs RC:$dst), (ins RC:$src),
+ "mov." # OpStr # " \t$dst, $src;", []>;
+ def ri : NVPTXInst<(outs RC:$dst), (ins IMMType:$src),
+ "mov." # OpStr # " \t$dst, $src;",
+ [(set VT:$dst, ImmNode:$src)]>;
+ }
}
+defm IMOV1 : MOV<Int1Regs, "pred", i1, i1imm, imm>;
+defm IMOV16 : MOV<Int16Regs, "b16", i16, i16imm, imm>;
+defm IMOV32 : MOV<Int32Regs, "b32", i32, i32imm, imm>;
+defm IMOV64 : MOV<Int64Regs, "b64", i64, i64imm, imm>;
+def IMOV128rr : NVPTXInst<(outs Int128Regs:$dst), (ins Int128Regs:$src),
+ "mov.b128 \t$dst, $src;", []>;
+defm FMOV16 : MOV<Int16Regs, "b16", f16, f16imm, fpimm>;
+defm BFMOV16 : MOV<Int16Regs, "b16", bf16, bf16imm, fpimm>;
+defm FMOV32 : MOV<Float32Regs, "b32", f32, f32imm, fpimm>;
+defm FMOV64 : MOV<Float64Regs, "b64", f64, f64imm, fpimm>;
+
def : Pat<(i32 (Wrapper texternalsym:$dst)), (IMOV32ri texternalsym:$dst)>;
def : Pat<(i64 (Wrapper texternalsym:$dst)), (IMOV64ri texternalsym:$dst)>;
diff --git a/llvm/test/CodeGen/NVPTX/atomics-sm70.ll b/llvm/test/CodeGen/NVPTX/atomics-sm70.ll
index b180928af82a4..b14295020bc0e 100644
--- a/llvm/test/CodeGen/NVPTX/atomics-sm70.ll
+++ b/llvm/test/CodeGen/NVPTX/atomics-sm70.ll
@@ -72,7 +72,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, half %
; CHECKPTX62-NEXT: or.b32 %r32, %r31, %r30;
; CHECKPTX62-NEXT: atom.cas.b32 %r6, [%r1], %r54, %r32;
; CHECKPTX62-NEXT: setp.ne.s32 %p1, %r6, %r54;
-; CHECKPTX62-NEXT: mov.u32 %r54, %r6;
+; CHECKPTX62-NEXT: mov.b32 %r54, %r6;
; CHECKPTX62-NEXT: @%p1 bra $L__BB0_1;
; CHECKPTX62-NEXT: // %bb.2: // %atomicrmw.end44
; CHECKPTX62-NEXT: ld.u32 %r55, [%r1];
@@ -88,7 +88,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, half %
; CHECKPTX62-NEXT: or.b32 %r37, %r36, %r35;
; CHECKPTX62-NEXT: atom.cas.b32 %r9, [%r1], %r55, %r37;
; CHECKPTX62-NEXT: setp.ne.s32 %p2, %r9, %r55;
-; CHECKPTX62-NEXT: mov.u32 %r55, %r9;
+; CHECKPTX62-NEXT: mov.b32 %r55, %r9;
; CHECKPTX62-NEXT: @%p2 bra $L__BB0_3;
; CHECKPTX62-NEXT: // %bb.4: // %atomicrmw.end26
; CHECKPTX62-NEXT: and.b32 %r10, %r22, -4;
@@ -109,7 +109,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, half %
; CHECKPTX62-NEXT: or.b32 %r45, %r44, %r43;
; CHECKPTX62-NEXT: atom.global.cas.b32 %r15, [%r10], %r56, %r45;
; CHECKPTX62-NEXT: setp.ne.s32 %p3, %r15, %r56;
-; CHECKPTX62-NEXT: mov.u32 %r56, %r15;
+; CHECKPTX62-NEXT: mov.b32 %r56, %r15;
; CHECKPTX62-NEXT: @%p3 bra $L__BB0_5;
; CHECKPTX62-NEXT: // %bb.6: // %atomicrmw.end8
; CHECKPTX62-NEXT: and.b32 %r16, %r23, -4;
@@ -130,7 +130,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, half %
; CHECKPTX62-NEXT: or.b32 %r53, %r52, %r51;
; CHECKPTX62-NEXT: atom.shared.cas.b32 %r21, [%r16], %r57, %r53;
; CHECKPTX62-NEXT: setp.ne.s32 %p4, %r21, %r57;
-; CHECKPTX62-NEXT: mov.u32 %r57, %r21;
+; CHECKPTX62-NEXT: mov.b32 %r57, %r21;
; CHECKPTX62-NEXT: @%p4 bra $L__BB0_7;
; CHECKPTX62-NEXT: // %bb.8: // %atomicrmw.end
; CHECKPTX62-NEXT: ret;
diff --git a/llvm/test/CodeGen/NVPTX/atomics-sm90.ll b/llvm/test/CodeGen/NVPTX/atomics-sm90.ll
index 9027bd6a14780..f27e574724ce4 100644
--- a/llvm/test/CodeGen/NVPTX/atomics-sm90.ll
+++ b/llvm/test/CodeGen/NVPTX/atomics-sm90.ll
@@ -73,7 +73,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, bfloat
; CHECKPTX71-NEXT: or.b32 %r32, %r31, %r30;
; CHECKPTX71-NEXT: atom.relaxed.cas.b32 %r6, [%r1], %r54, %r32;
; CHECKPTX71-NEXT: setp.ne.s32 %p1, %r6, %r54;
-; CHECKPTX71-NEXT: mov.u32 %r54, %r6;
+; CHECKPTX71-NEXT: mov.b32 %r54, %r6;
; CHECKPTX71-NEXT: @%p1 bra $L__BB0_1;
; CHECKPTX71-NEXT: // %bb.2: // %atomicrmw.end44
; CHECKPTX71-NEXT: ld.u32 %r55, [%r1];
@@ -89,7 +89,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, bfloat
; CHECKPTX71-NEXT: or.b32 %r37, %r36, %r35;
; CHECKPTX71-NEXT: atom.relaxed.cas.b32 %r9, [%r1], %r55, %r37;
; CHECKPTX71-NEXT: setp.ne.s32 %p2, %r9, %r55;
-; CHECKPTX71-NEXT: mov.u32 %r55, %r9;
+; CHECKPTX71-NEXT: mov.b32 %r55, %r9;
; CHECKPTX71-NEXT: @%p2 bra $L__BB0_3;
; CHECKPTX71-NEXT: // %bb.4: // %atomicrmw.end26
; CHECKPTX71-NEXT: and.b32 %r10, %r22, -4;
@@ -111,7 +111,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, bfloat
; CHECKPTX71-NEXT: or.b32 %r45, %r44, %r43;
; CHECKPTX71-NEXT: atom.relaxed.global.cas.b32 %r15, [%r10], %r56, %r45;
; CHECKPTX71-NEXT: setp.ne.s32 %p3, %r15, %r56;
-; CHECKPTX71-NEXT: mov.u32 %r56, %r15;
+; CHECKPTX71-NEXT: mov.b32 %r56, %r15;
; CHECKPTX71-NEXT: @%p3 bra $L__BB0_5;
; CHECKPTX71-NEXT: // %bb.6: // %atomicrmw.end8
; CHECKPTX71-NEXT: and.b32 %r16, %r23, -4;
@@ -133,7 +133,7 @@ define void @test(ptr %dp0, ptr addrspace(1) %dp1, ptr addrspace(3) %dp3, bfloat
; CHECKPTX71-NEXT: or.b32 %r53, %r52, %r51;
; CHECKPTX71-NEXT: atom.relaxed.shared.cas.b32 %r21, [%r16], %r57, %r53;
; CHECKPTX71-NEXT: setp.ne.s32 %p4, %r21, %r57;
-; CHECKPTX71-NEXT: mov.u32 %r57, %r21;
+; CHECKPTX71-NEXT: mov.b32 %r57, %r21;
; CHECKPTX71-NEXT: @%p4 bra $L__BB0_7;
; CHECKPTX71-NEXT: // %bb.8: // %atomicrmw.end
; CHECKPTX71-NEXT: ret;
diff --git a/llvm/test/CodeGen/NVPTX/cmpxchg.ll b/llvm/test/CodeGen/NVPTX/cmpxchg.ll
index aaea0d2ee25ef..fd721a1bb0371 100644
--- a/llvm/test/CodeGen/NVPTX/cmpxchg.ll
+++ b/llvm/test/CodeGen/NVPTX/cmpxchg.ll
@@ -1068,12 +1068,11 @@ define i16 @acq_rel_sys_i16(ptr %addr, i16 %cmp, i16 %new) {
; SM30-NEXT: // %bb.2: // %partword.cmpxchg.failure
; SM30-NEXT: // in Loop: Header=BB8_1 Depth=1
; SM30-NEXT: and.b32 %r8, %r7, %r2;
-; SM30-NEXT: setp.ne.s32 %p2, %r19, %r8;
-; SM30-NEXT: mov.u32 %r19, %r8;
-; SM30-NEXT: @%p2 bra $L__BB8_1;
-; SM30-NEXT: $L__BB8_3: // %partword.cmpxchg.end
-; SM30-NEXT: membar.sys;
-; SM30-NEXT: st.param.b32 [func_retval0], %r14;
+; SM30-NEXT: setp.ne.s32 %p2, %r20, %r8;
+; SM30-NEXT: mov.b32 %r20, %r8;
+; SM30-NEXT: @%p2 bra $L__BB0_1;
+; SM30-NEXT: $L__BB0_3: // %partword.cmpxchg.end
+; SM30-NEXT: st.param.b32 [func_retval0], %r13;
; SM30-NEXT: ret;
;
; SM70-LABEL: acq_rel_sys_i16(
@@ -1110,12 +1109,11 @@ define i16 @acq_rel_sys_i16(ptr %addr, i16 %cmp, i16 %new) {
; SM70-NEXT: // %bb.2: // %partword.cmpxchg.failure
; SM70-NEXT: // in Loop: Header=BB8_1 Depth=1
; SM70-NEXT: and.b32 %r8, %r7, %r2;
-; SM70-NEXT: setp.ne.s32 %p2, %r19, %r8;
-; SM70-NEXT: mov.u32 %r19, %r8;
-; SM70-NEXT: @%p2 bra $L__BB8_1;
-; SM70-NEXT: $L__BB8_3: // %partword.cmpxchg.end
-; SM70-NEXT: fence.acq_rel.sys;
-; SM70-NEXT: st.param.b32 [func_retval0], %r14;
+; SM70-NEXT: setp.ne.s32 %p2, %r20, %r8;
+; SM70-NEXT: mov.b32 %r20, %r8;
+; SM70-NEXT: @%p2 bra $L__BB0_1;
+; SM70-NEXT: $L__BB0_3: // %partword.cmpxchg.end
+; SM70-NEXT: st.param.b32 [func_retval0], %r13;
; SM70-NEXT: ret;
; SM90-LABEL: acq_rel_sys_i16(
; SM90: {
@@ -1199,10 +1197,9 @@ define i16 @seq_cst_sys_i16(ptr %addr, i16 %cmp, i16 %new) {
; SM30-NEXT: // in Loop: Header=BB9_1 Depth=1
; SM30-NEXT: and.b32 %r8, %r7, %r2;
; SM30-NEXT: setp.ne.s32 %p2, %r19, %r8;
-; SM30-NEXT: mov.u32 %r19, %r8;
-; SM30-NEXT: @%p2 bra $L__BB9_1;
-; SM30-NEXT: $L__BB9_3: // %partword.cmpxchg.end
-; SM30-NEXT: membar.sys;
+; SM30-NEXT: mov.b32 %r19, %r8;
+; SM30-NEXT: @%p2 bra $L__BB1_1;
+; SM30-NEXT: $L__BB1_3: // %partword.cmpxchg.end
; SM30-NEXT: st.param.b32 [func_retval0], %r14;
; SM30-NEXT: ret;
;
@@ -1241,10 +1238,9 @@ define i16 @seq_cst_sys_i16(ptr %addr, i16 %cmp, i16 %new) {
; SM70-NEXT: // in Loop: Header=BB9_1 Depth=1
; SM70-NEXT: and.b32 %r8, %r7, %r2;
; SM70-NEXT: setp.ne.s32 %p2, %r19, %r8;
-; SM70-NEXT: mov.u32 %r19, %r8;
-; SM70-NEXT: @%p2 bra $L__BB9_1;
-; SM70-NEXT: $L__BB9_3: // %partword.cmpxchg.end
-; SM70-NEXT: fence.acq_rel.sys;
+; SM70-NEXT: mov.b32 %r19, %r8;
+; SM70-NEXT: @%p2 bra $L__BB1_1;
+; SM70-NEXT: $L__BB1_3: // %partword.cmpxchg.end
; SM70-NEXT: st.param.b32 [func_retval0], %r14;
; SM70-NEXT: ret;
; SM90-LABEL: seq_cst_sys_i16(
diff --git a/llvm/test/CodeGen/NVPTX/div.ll b/llvm/test/CodeGen/NVPTX/div.ll
index 3d14d36ed599b..4f9d58758ca9e 100644
--- a/llvm/test/CodeGen/NVPTX/div.ll
+++ b/llvm/test/CodeGen/NVPTX/div.ll
@@ -11,10 +11,10 @@ define float @div_full(float %a, float %b) {
; CHECK-NEXT: ld.param.f32 %f1, [div_full_param_0];
; CHECK-NEXT: ld.param.f32 %f2, [div_full_param_1];
; CHECK-NEXT: div.full.f32 %f3, %f1, %f2;
-; CHECK-NEXT: mov.f32 %f4, 0f40400000;
+; CHECK-NEXT: mov.b32 %f4, 0f40400000;
; CHECK-NEXT: div.full.f32 %f5, %f3, %f4;
; CHECK-NEXT: div.full.ftz.f32 %f6, %f5, %f2;
-; CHECK-NEXT: mov.f32 %f7, 0f40800000;
+; CHECK-NEXT: mov.b32 %f7, 0f40800000;
; CHECK-NEXT: div.full.ftz.f32 %f8, %f6, %f7;
; CHECK-NEXT: st.param.f32 [func_retval0], %f8;
; CHECK-NEXT: ret;
diff --git a/llvm/test/CodeGen/NVPTX/f16-instructions.ll b/llvm/test/CodeGen/NVPTX/f16-instructions.ll
index f78cfc3172621..70d1167bbb6e2 100644
--- a/llvm/test/CodeGen/NVPTX/f16-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/f16-instructions.ll
@@ -138,7 +138,7 @@ define half @test_fsub(half %a, half %b) #0 {
; CHECK-F16-FTZ-NEXT: mov.b16 [[Z:%rs[0-9]+]], 0x0000
; CHECK-F16-FTZ-NEXT: sub.rn.ftz.f16 [[R:%rs[0-9]+]], [[Z]], [[A]];
; CHECK-NOF16-DAG: cvt.f32.f16 [[A32:%f[0-9]+]], [[A]]
-; CHECK-NOF16-DAG: mov.f32 [[Z:%f[0-9]+]], 0f00000000;
+; CHECK-NOF16-DAG: mov.b32 [[Z:%f[0-9]+]], 0f00000000;
; CHECK-NOF16-NEXT: sub.rn.f32 [[R32:%f[0-9]+]], [[Z]], [[A32]];
; CHECK-NOF16-NEXT: cvt.rn.f16.f32 [[R:%rs[0-9]+]], [[R32]]
; CHECK-NEXT: st.param.b16 [func_retval0], [[R]];
@@ -646,7 +646,7 @@ else:
; CHECK: ld.param.u64 %[[P1:rd[0-9]+]], [test_phi_param_0];
; CHECK: ld.b16 {{%rs[0-9]+}}, [%[[P1]]];
; CHECK: [[LOOP:\$L__BB[0-9_]+]]:
-; CHECK: mov.u16 [[R:%rs[0-9]+]], [[AB:%rs[0-9]+]];
+; CHECK: mov.b16 [[R:%rs[0-9]+]], [[AB:%rs[0-9]+]];
; CHECK: ld.b16 [[AB:%rs[0-9]+]], [%[[P1]]];
; CHECK: {
; CHECK: st.param.b64 [param0], %[[P1]];
diff --git a/llvm/test/CodeGen/NVPTX/f16x2-instructions.ll b/llvm/test/CodeGen/NVPTX/f16x2-instructions.ll
index 1905fec8ab7a8..539e810c83cbd 100644
--- a/llvm/test/CodeGen/NVPTX/f16x2-instructions.ll
+++ b/llvm/test/CodeGen/NVPTX/f16x2-instructions.ll
@@ -260,7 +260,7 @@ define <2 x half> @test_fneg(<2 x half> %a) #0 {
; CHECK-NOF16-NEXT: ld.param.b32 %r1, [test_fneg_param_0];
; CHECK-NOF16-NEXT: mov.b32 {%rs1, %rs2}, %r1;
; CHECK-NOF16-NEXT: cvt.f32.f16 %f1, %rs2;
-; CHECK-NOF16-NEXT: mov.f32 %f2, 0f00000000;
+; CHECK-NOF16-NEXT: mov.b32 %f2, 0f00000000;
; CHECK-NOF16-NEXT: sub.rn.f32 %f3, %f2, %f1;
; CHECK-NOF16-NEXT: cvt.rn.f16.f32 %rs3, %f3;
; CHECK-NOF16-NEXT: cvt.f32.f16 %f4, %rs1;
diff --git a/llvm/test/CodeGen/NVPTX/fma.ll b/llvm/test/CodeGen/NVPTX/fma.ll
index 3416420367beb..90fbd5ba9dfd6 100644
--- a/llvm/test/CodeGen/NVPTX/fma.ll
+++ b/llvm/test/CodeGen/NVPTX/fma.ll
@@ -50,7 +50,7 @@ define ptx_device float @f32_iir(float %x) {
}
define ptx_device float @f32_iii(float %x) {
-; CHECK: mov.f32 %f{{[0-9]+}}, 0f41200000;
+; CHECK: mov.b32 %f{{[0-9]+}}, 0f41200000;
; CHECK: ret;
%r = call float @llvm.fma.f32(float 2.0, float 3.0, float 4.0)
ret float %r
diff --git a/llvm/test/CodeGen/NVPTX/i128.ll b/llvm/test/CodeGen/NVPTX/i128.ll
index ca1b5fdabbf8f..546700c2b0335 100644
--- a/llvm/test/CodeGen/NVPTX/i128.ll
+++ b/llvm/test/CodeGen/NVPTX/i128.ll
@@ -77,7 +77,7 @@ define i128 @srem_i128(i128 %lhs, i128 %rhs) {
; CHECK-NEXT: setp.gt.s32 %p16, %r10, 63;
; CHECK-NEXT: selp.b64 %rd124, %rd76, %rd75, %p16;
; CHECK-NEXT: shl.b64 %rd123, %rd3, %r10;
-; CHECK-NEXT: mov.u64 %rd114, %rd117;
+; CHECK-NEXT: mov.b64 %rd114, %rd117;
; CHECK-NEXT: @%p15 bra $L__BB0_4;
; CHECK-NEXT: // %bb.1: // %udiv-preheader
; CHECK-NEXT: cvt.u32.u64 %r13, %rd119;
@@ -93,7 +93,7 @@ define i128 @srem_i128(i128 %lhs, i128 %rhs) {
; CHECK-NEXT: add.cc.s64 %rd35, %rd5, -1;
; CHECK-NEXT: addc.cc.s64 %rd36, %rd6, -1;
; CHECK-NEXT: mov.b64 %rd114, 0;
-; CHECK-NEXT: mov.u64 %rd117, %rd114;
+; CHECK-NEXT: mov.b64 %rd117, %rd114;
; CHECK-NEXT: $L__BB0_2: // %udiv-do-while
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: shr.u64 %rd83, %rd121, 63;
@@ -210,7 +210,7 @@ define i128 @urem_i128(i128 %lhs, i128 %rhs) {
; CHECK-NEXT: setp.gt.s32 %p14, %r10, 63;
; CHECK-NEXT: selp.b64 %rd110, %rd66, %rd65, %p14;
; CHECK-NEXT: shl.b64 %rd109, %rd41, %r10;
-; CHECK-NEXT: mov.u64 %rd100, %rd103;
+; CHECK-NEXT: mov.b64 %rd100, %rd103;
; CHECK-NEXT: @%p13 bra $L__BB1_4;
; CHECK-NEXT: // %bb.1: // %udiv-preheader
; CHECK-NEXT: cvt.u32.u64 %r13, %rd105;
@@ -226,7 +226,7 @@ define i128 @urem_i128(i128 %lhs, i128 %rhs) {
; CHECK-NEXT: add.cc.s64 %rd33, %rd3, -1;
; CHECK-NEXT: addc.cc.s64 %rd34, %rd4, -1;
; CHECK-NEXT: mov.b64 %rd100, 0;
-; CHECK-NEXT: mov.u64 %rd103, %rd100;
+; CHECK-NEXT: mov.b64 %rd103, %rd100;
; CHECK-NEXT: $L__BB1_2: // %udiv-do-while
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: shr.u64 %rd73, %rd107, 63;
@@ -386,7 +386,7 @@ define i128 @sdiv_i128(i128 %lhs, i128 %rhs) {
; CHECK-NEXT: setp.gt.s32 %p16, %r10, 63;
; CHECK-NEXT: selp.b64 %rd119, %rd77, %rd76, %p16;
; CHECK-NEXT: shl.b64 %rd118, %rd1, %r10;
-; CHECK-NEXT: mov.u64 %rd109, %rd112;
+; CHECK-NEXT: mov.b64 %rd109, %rd112;
; CHECK-NEXT: @%p15 bra $L__BB4_4;
; CHECK-NEXT: // %bb.1: // %udiv-preheader
; CHECK-NEXT: cvt.u32.u64 %r13, %rd114;
@@ -402,7 +402,7 @@ define i128 @sdiv_i128(i128 %lhs, i128 %rhs) {
; CHECK-NEXT: add.cc.s64 %rd35, %rd3, -1;
; CHECK-NEXT: addc.cc.s64 %rd36, %rd4, -1;
; CHECK-NEXT: mov.b64 %rd109, 0;
-; CHECK-NEXT: mov.u64 %rd112, %rd109;
+; CHECK-NEXT: mov.b64 %rd112, %rd109;
; CHECK-NEXT: $L__BB4_2: // %udiv-do-while
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: shr.u64 %rd84, %rd116, 63;
@@ -513,7 +513,7 @@ define i128 @udiv_i128(i128 %lhs, i128 %rhs) {
; CHECK-NEXT: setp.gt.s32 %p14, %r10, 63;
; CHECK-NEXT: selp.b64 %rd104, %rd66, %rd65, %p14;
; CHECK-NEXT: shl.b64 %rd103, %rd41, %r10;
-; CHECK-NEXT: mov.u64 %rd94, %rd97;
+; CHECK-NEXT: mov.b64 %rd94, %rd97;
; CHECK-NEXT: @%p13 bra $L__BB5_4;
; CHECK-NEXT: // %bb.1: // %udiv-preheader
; CHECK-NEXT: cvt.u32.u64 %r13, %rd99;
@@ -529,7 +529,7 @@ define i128 @udiv_i128(i128 %lhs, i128 %rhs) {
; CHECK-NEXT: add.cc.s64 %rd33, %rd43, -1;
; CHECK-NEXT: addc.cc.s64 %rd34, %rd44, -1;
; CHECK-NEXT: mov.b64 %rd94, 0;
-; CHECK-NEXT: mov.u64 %rd97, %rd94;
+; CHECK-NEXT: mov.b64 %rd97, %rd94;
; CHECK-NEXT: $L__BB5_2: // %udiv-do-while
; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-NEXT: shr.u64 %rd73, %rd101, 63;
diff --git a/llvm/test/CodeGen/NVPTX/lower-args-gridconstant.ll b/llvm/test/CodeGen/NVPTX/lower-args-gridconstant.ll
index 90f9306d036cd..010eafdf2f2ac 100644
--- a/llvm/test/CodeGen/NVPTX/lower-args-gridconstant.ll
+++ b/llvm/test/CodeGen/NVPTX/lower-args-gridconstant.ll
@@ -134,7 +134,7 @@ define ptx_kernel void @grid_const_escape(ptr byval(%struct.s) align 4 %input) {
; PTX-EMPTY:
; PTX-NEXT: // %bb.0:
; PTX-NEXT: mov.b64 %rd2, grid_const_escape_param_0;
-; PTX-NEXT: mov.u64 %rd3, %rd2;
+; PTX-NEXT: mov.b64 %rd3, %rd2;
; PTX-NEXT: cvta.param.u64 %rd4, %rd3;
; PTX-NEXT: mov.u64 %rd1, escape;
; PTX-NEXT: { // callseq 0, 0
@@ -176,10 +176,10 @@ define ptx_kernel void @multiple_grid_const_escape(ptr byval(%struct.s) align 4
; PTX-NEXT: cvta.local.u64 %SP, %SPL;
; PTX-NEXT: mov.b64 %rd2, multiple_grid_const_escape_param_0;
; PTX-NEXT: mov.b64 %rd3, multiple_grid_const_escape_param_2;
-; PTX-NEXT: mov.u64 %rd4, %rd3;
+; PTX-NEXT: mov.b64 %rd4, %rd3;
; PTX-NEXT: ld.param.u32 %r1, [multiple_grid_const_escape_param_1];
; PTX-NEXT: cvta.param.u64 %rd5, %rd4;
-; PTX-NEXT: mov.u64 %rd6, %rd2;
+; PTX-NEXT: mov.b64 %rd6, %rd2;
; PTX-NEXT: cvta.param.u64 %rd7, %rd6;
; PTX-NEXT: add.u64 %rd8, %SP, 0;
; PTX-NEXT: add.u64 %rd9, %SPL, 0;
@@ -231,7 +231,7 @@ define ptx_kernel void @grid_const_memory_escape(ptr byval(%struct.s) align 4 %i
; PTX-NEXT: mov.b64 %rd1, grid_const_memory_escape_param_0;
; PTX-NEXT: ld.param.u64 %rd2, [grid_const_memory_escape_param_1];
; PTX-NEXT: cvta.to.global.u64 %rd3, %rd2;
-; PTX-NEXT: mov.u64 %rd4, %rd1;
+; PTX-NEXT: mov.b64 %rd4, %rd1;
; PTX-NEXT: cvta.param.u64 %rd5, %rd4;
; PTX-NEXT: st.global.u64 [%rd3], %rd5;
; PTX-NEXT: ret;
@@ -257,7 +257,7 @@ define ptx_kernel void @grid_const_inlineasm_escape(ptr byval(%struct.s) align 4
; PTX-NEXT: mov.b64 %rd4, grid_const_inlineasm_escape_param_0;
; PTX-NEXT: ld.param.u64 %rd5, [grid_const_inlineasm_escape_param_1];
; PTX-NEXT: cvta.to.global.u64 %rd6, %rd5;
-; PTX-NEXT: mov.u64 %rd7, %rd4;
+; PTX-NEXT: mov.b64 %rd7, %rd4;
; PTX-NEXT: cvta.param.u64 %rd2, %rd7;
; PTX-NEXT: add.s64 %rd3, %rd2, 4;
; PTX-NEXT: // begin inline asm
@@ -295,7 +295,7 @@ define ptx_kernel void @grid_const_partial_escape(ptr byval(i32) %input, ptr %ou
; PTX-NE...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like some nice cleanup. While you're at it it might be good to remove the redundant r
since the only thing you can move a value into is a register. Also looks like there are some other mov instructions such as MOV_ADDR
and MOV_DEPOT_ADDR
which you could switch to using the b
variant as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please make sure to run the lit tests locally with LLVM_PTXAS_EXECUTABLE set.
This causes some functional regressions on Triton, it seems to somehow expose a ptxas bug as changing a mov from |
For what it's worth, we do not see any interesting failures that may have been caused by this patch, and it's been in use by XLA for about a week now and I'd expect a major issue like this to surface by now. |
thanks for sharing, this is good to know, it must be a niche case then. Looks like it only affects one of our kernel and we found a workaround for now. This is clearly a ptxas bug though, we'll try to see if Nvidia can fix it. |
Any chance you can post a minimized reproducer on godbolt? |
I don't have a minimized reproducer yet. As it is related to ptxas optimization making it minimal is a challenge |
"Challenge" as in "it's hard to automatically reduce PTX in principle" (i.e. delta/creduce don't work well with PTX). Or as in "the error only happens on the large PTX input and disappears once PTX is smaller than some still-large size" ? If it's the former, I found that I can get most of the way towards minimal reproducer by reducing the IR instead. I use llvm-reduce (https://llvm.org/docs/CommandGuide/llvm-reduce.html) + a test which assembles PTX with ptxas, then disassembles it using cuobjdump or nvdisasm and looks at SASS for the signs of miscompilation. |
I guess it's both. How do you use reduce when it's a miscompile? I have used it for crashes but here the problem is that the value is wrong so I need a reference to compare it to |
That depends on the details. A typical pattern is to identify the pattern of invalid instructions in sass, and make sure they are always present in the compilation output. In this case in SASS. It's not always feasible (e.g. when it's a common pattern, just used inappropriately), and it's often tricky to make the checks robust enough (yay, wonders of optimizing assembler), but sometimes it does work well enough. Another approach is to compile test input with known good and known bad ptxas, and watch the differences. That only works for regressions, when we have "good" version of ptxas which produces close-enough outputs to the new ptxas with regression. If the reproducer can be made public, even if it's not minimal, that would also be useful. I may get more specific ideas after poking at it for a bit. |
@ThomasRaoux, in your interestingness test, perhaps you could check that the output of your test is:
|
Use a
multiclass
to definemov
and canonicalize themov
instruction to always use theb<bit-size>
suffix.