Skip to content

[AMDGPU] Fix machine verification failure from INIT_EXEC lowering #98333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1676,6 +1676,8 @@ bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {
if (!(GlobalFlags & (StateWQM | StateStrict)) && LowerToCopyInstrs.empty() &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now even more convinced that we should not bother with this fast path. We should just make sure that the general path is fast when there is no work to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run compile time testing on a version without it -- it's possible there really is little difference already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I can see that we might be able to skip over in some cases is the two for (auto BII : Blocks) loops below.

LowerToMovInstrs.empty() && KillInstrs.empty()) {
lowerLiveMaskQueries();
if (!InitExecInstrs.empty())
LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);
return !InitExecInstrs.empty() || !LiveMaskQueries.empty();
}

Expand Down Expand Up @@ -1717,7 +1719,7 @@ bool SIWholeQuadMode::runOnMachineFunction(MachineFunction &MF) {
LIS->removeAllRegUnitsForPhysReg(AMDGPU::SCC);

// If we performed any kills then recompute EXEC
if (!KillInstrs.empty())
if (!KillInstrs.empty() || !InitExecInstrs.empty())
LIS->removeAllRegUnitsForPhysReg(AMDGPU::EXEC);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be necessary in the first place. Reserved registers, like exec, do not have liveness tracked

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I see stuff like this in the dumps:

# After SI Whole Quad Mode
********** INTERVALS **********
EXEC_LO_LO16 [16r,16d:0) 0@16r
EXEC_LO_HI16 [16r,16d:0) 0@16r
EXEC_HI_LO16 [16r,16d:0) 0@16r
EXEC_HI_HI16 [16r,16d:0) 0@16r

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those shouldn't have been created in the first place. Where did they come from?

Copy link
Contributor

@perlfu perlfu Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test case they are formed in Register Coalescer when it attempts to merge a copy of EXEC with a virtual and asks for an interval on a EXEC reg unit.

Reserved registers still have intervals, they are a sequence of a dead defs, one for each assignment to the register.
You can see this in LiveIntervals::computeRegUnitRange.
I do not know the use for this, but assume it is for interference calculations.


return true;
Expand Down
75 changes: 75 additions & 0 deletions llvm/test/CodeGen/AMDGPU/wqm.ll
Original file line number Diff line number Diff line change
Expand Up @@ -3463,6 +3463,81 @@ bb:
ret void
}

; Test a case that failed machine verification.
define amdgpu_gs void @wqm_init_exec_switch(i32 %arg) {
; GFX9-W64-LABEL: wqm_init_exec_switch:
; GFX9-W64: ; %bb.0:
; GFX9-W64-NEXT: s_mov_b64 exec, 0
; GFX9-W64-NEXT: v_cmp_lt_i32_e32 vcc, 0, v0
; GFX9-W64-NEXT: s_and_saveexec_b64 s[0:1], vcc
; GFX9-W64-NEXT: s_xor_b64 s[0:1], exec, s[0:1]
; GFX9-W64-NEXT: s_andn2_saveexec_b64 s[0:1], s[0:1]
; GFX9-W64-NEXT: s_endpgm
;
; GFX10-W32-LABEL: wqm_init_exec_switch:
; GFX10-W32: ; %bb.0:
; GFX10-W32-NEXT: s_mov_b32 exec_lo, 0
; GFX10-W32-NEXT: s_mov_b32 s0, exec_lo
; GFX10-W32-NEXT: v_cmpx_lt_i32_e32 0, v0
; GFX10-W32-NEXT: s_xor_b32 s0, exec_lo, s0
; GFX10-W32-NEXT: s_andn2_saveexec_b32 s0, s0
; GFX10-W32-NEXT: s_endpgm
call void @llvm.amdgcn.init.exec(i64 0)
switch i32 %arg, label %bb1 [
i32 0, label %bb3
i32 1, label %bb2
]
bb1:
ret void
bb2:
ret void
bb3:
ret void
}

define amdgpu_gs void @wqm_init_exec_wwm() {
; GFX9-W64-LABEL: wqm_init_exec_wwm:
; GFX9-W64: ; %bb.0:
; GFX9-W64-NEXT: s_mov_b64 exec, 0
; GFX9-W64-NEXT: s_mov_b32 s1, 0
; GFX9-W64-NEXT: s_mov_b32 s0, s1
; GFX9-W64-NEXT: s_cmp_lg_u64 exec, 0
; GFX9-W64-NEXT: s_cselect_b64 s[2:3], -1, 0
; GFX9-W64-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX9-W64-NEXT: s_cselect_b64 s[0:1], -1, 0
; GFX9-W64-NEXT: s_xor_b64 s[0:1], s[2:3], s[0:1]
; GFX9-W64-NEXT: v_cndmask_b32_e64 v0, 0, 1.0, s[0:1]
; GFX9-W64-NEXT: v_mov_b32_e32 v1, 0
; GFX9-W64-NEXT: exp mrt0 off, off, off, off
; GFX9-W64-NEXT: s_endpgm
;
; GFX10-W32-LABEL: wqm_init_exec_wwm:
; GFX10-W32: ; %bb.0:
; GFX10-W32-NEXT: s_mov_b32 exec_lo, 0
; GFX10-W32-NEXT: s_mov_b32 s1, 0
; GFX10-W32-NEXT: s_cmp_lg_u64 exec, 0
; GFX10-W32-NEXT: s_mov_b32 s0, s1
; GFX10-W32-NEXT: s_cselect_b32 s2, -1, 0
; GFX10-W32-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX10-W32-NEXT: v_mov_b32_e32 v1, 0
; GFX10-W32-NEXT: s_cselect_b32 s0, -1, 0
; GFX10-W32-NEXT: s_xor_b32 s0, s2, s0
; GFX10-W32-NEXT: v_cndmask_b32_e64 v0, 0, 1.0, s0
; GFX10-W32-NEXT: exp mrt0 off, off, off, off
; GFX10-W32-NEXT: s_endpgm
call void @llvm.amdgcn.init.exec(i64 0)
%i = call i64 @llvm.amdgcn.ballot.i64(i1 true)
%i1 = call i32 @llvm.amdgcn.wwm.i32(i32 0)
%i2 = insertelement <2 x i32> zeroinitializer, i32 %i1, i64 0
%i3 = bitcast <2 x i32> %i2 to i64
%i4 = icmp ne i64 %i, 0
%i5 = icmp ne i64 %i3, 0
%i6 = xor i1 %i4, %i5
%i7 = uitofp i1 %i6 to float
call void @llvm.amdgcn.exp.f32(i32 0, i32 0, float %i7, float 0.0, float 0.0, float 0.0, i1 false, i1 false)
ret void
}

declare void @llvm.amdgcn.exp.f32(i32, i32, float, float, float, float, i1, i1) #1
declare void @llvm.amdgcn.image.store.1d.v4f32.i32(<4 x float>, i32, i32, <8 x i32>, i32, i32) #1

Expand Down
Loading