Skip to content

[AMDGPU] Do not use original PHIs in coercion chains #98063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 10, 2024

Conversation

jrbyrnes
Copy link
Contributor

@jrbyrnes jrbyrnes commented Jul 8, 2024

It's possible that we are unable to coerce all the incoming values of a PHINode (A). Thus, we are unable to coerce the PHINode. In this situation, we previously would add the PHINode back to the ValMap. This would cause a problem is PhiNode (B) was a user of A. In this scenario, if B has been coerced, we would hit an assert regarding the incompatible type between the PHINode and its incoming value.

Deleting non-coerced PHINodes from the map, and propagating the removal to users, resolves the issue.

Change-Id: Ib15b2716d69c796eefe6683bd5f0a6ba0b94f6a2
@jrbyrnes jrbyrnes requested a review from arsenm July 8, 2024 18:35
@llvmbot
Copy link
Member

llvmbot commented Jul 8, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Jeffrey Byrnes (jrbyrnes)

Changes

It's possible that we are unable to coerce all the incoming values of a PHINode (A). Thus, we are unable to coerce the PHINode. In this situation, we previously would add the PHINode back to the ValMap. This would cause a problem is PhiNode (B) was a user of A. In this scenario, if B has been coerced, we would hit an assert regarding the incompatible type between the PHINode and its incoming value.

Deleting non-coerced PHINodes from the map resolves the issue.


Full diff: https://github.com/llvm/llvm-project/pull/98063.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+83)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
index 2cc95f81d2f94d..88811a24e2fcd4 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
@@ -369,7 +369,7 @@ bool LiveRegOptimizer::optimizeLiveType(
     if (MissingIncVal) {
       DeadInst = cast<Instruction>(ValMap[Phi]);
       // Do not use the dead phi
-      ValMap[Phi] = Phi;
+      ValMap.erase(Phi);
     }
     DeadInsts.emplace_back(DeadInst);
   }
diff --git a/llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll b/llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll
index 441f00faf329e4..3202926e940746 100644
--- a/llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll
+++ b/llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll
@@ -866,5 +866,88 @@ bb.3:
   ret void
 }
 
+; This should not cause Assertion `getType() == V->getType() && "All operands to PHI node must be the same type as the PHI node
+; Note: whether or not the assertion fires depends on the iteration ortder of PhiNodes in AMDGPULateCodeGenPrepare, which
+; is non-deterministic due to iterators over a set of pointers.
+
+define amdgpu_kernel void @MissingInc_PhiChain(i1 %cmp1.i.i.i.i.i.not, <16 x i8> %promotealloca31.i.i.i.i) {
+; GFX906-LABEL: MissingInc_PhiChain:
+; GFX906:       ; %bb.0: ; %entry
+; GFX906-NEXT:    s_load_dword s2, s[0:1], 0x24
+; GFX906-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x34
+; GFX906-NEXT:    s_mov_b32 s10, 1
+; GFX906-NEXT:    v_mov_b32_e32 v4, 1
+; GFX906-NEXT:    s_mov_b32 s11, 1
+; GFX906-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX906-NEXT:    s_bitcmp1_b32 s2, 0
+; GFX906-NEXT:    s_cselect_b64 s[2:3], -1, 0
+; GFX906-NEXT:    s_xor_b64 s[0:1], s[2:3], -1
+; GFX906-NEXT:    v_cndmask_b32_e64 v0, 0, 1, s[0:1]
+; GFX906-NEXT:    v_cmp_ne_u32_e64 s[0:1], 1, v0
+; GFX906-NEXT:    s_branch .LBB14_2
+; GFX906-NEXT:  .LBB14_1: ; %if.end.1.i.i.i.i.i
+; GFX906-NEXT:    ; in Loop: Header=BB14_2 Depth=1
+; GFX906-NEXT:    v_lshrrev_b32_e32 v4, 8, v0
+; GFX906-NEXT:    s_mov_b32 s10, 0
+; GFX906-NEXT:    s_mov_b32 s11, 0
+; GFX906-NEXT:  .LBB14_2: ; %for.body10.i.i.i.i
+; GFX906-NEXT:    ; =>This Inner Loop Header: Depth=1
+; GFX906-NEXT:    s_and_b64 vcc, exec, s[0:1]
+; GFX906-NEXT:    s_mov_b64 s[8:9], s[2:3]
+; GFX906-NEXT:    ; implicit-def: $vgpr0_vgpr1_vgpr2_vgpr3
+; GFX906-NEXT:    s_cbranch_vccnz .LBB14_4
+; GFX906-NEXT:  ; %bb.3: ; %if.then.i.i.i.i.i
+; GFX906-NEXT:    ; in Loop: Header=BB14_2 Depth=1
+; GFX906-NEXT:    v_lshlrev_b16_e64 v0, 8, s11
+; GFX906-NEXT:    v_or_b32_sdwa v0, s10, v0 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
+; GFX906-NEXT:    v_lshlrev_b16_e32 v1, 8, v4
+; GFX906-NEXT:    v_or_b32_e32 v0, v1, v0
+; GFX906-NEXT:    s_mov_b64 s[8:9], -1
+; GFX906-NEXT:  .LBB14_4: ; %Flow
+; GFX906-NEXT:    ; in Loop: Header=BB14_2 Depth=1
+; GFX906-NEXT:    s_andn2_b64 vcc, exec, s[8:9]
+; GFX906-NEXT:    s_cbranch_vccnz .LBB14_7
+; GFX906-NEXT:  ; %bb.5: ; %if.end.i.i.i.i.i
+; GFX906-NEXT:    ; in Loop: Header=BB14_2 Depth=1
+; GFX906-NEXT:    s_and_b64 vcc, exec, s[0:1]
+; GFX906-NEXT:    s_cbranch_vccnz .LBB14_1
+; GFX906-NEXT:  ; %bb.6: ; %if.then.1.i.i.i.i.i
+; GFX906-NEXT:    ; in Loop: Header=BB14_2 Depth=1
+; GFX906-NEXT:    v_mov_b32_e32 v0, s4
+; GFX906-NEXT:    v_mov_b32_e32 v1, s5
+; GFX906-NEXT:    v_mov_b32_e32 v2, s6
+; GFX906-NEXT:    v_mov_b32_e32 v3, s7
+; GFX906-NEXT:    s_branch .LBB14_1
+; GFX906-NEXT:  .LBB14_7: ; in Loop: Header=BB14_2 Depth=1
+; GFX906-NEXT:    ; implicit-def: $vgpr4
+; GFX906-NEXT:    ; implicit-def: $sgpr10
+; GFX906-NEXT:    ; implicit-def: $sgpr11
+; GFX906-NEXT:    s_cbranch_execz .LBB14_2
+; GFX906-NEXT:  ; %bb.8: ; %DummyReturnBlock
+; GFX906-NEXT:    s_endpgm
+entry:
+  br label %for.body10.i.i.i.i
+
+for.body10.i.i.i.i:                               ; preds = %if.end.1.i.i.i.i.i, %entry
+  %promotealloca3237.i.i.i.i = phi <16 x i8> [ <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>, %entry ], [ %1, %if.end.1.i.i.i.i.i ]
+  br i1 %cmp1.i.i.i.i.i.not, label %if.end.i.i.i.i.i, label %if.then.i.i.i.i.i
+
+if.then.i.i.i.i.i:                                ; preds = %for.body10.i.i.i.i
+  %0 = insertelement <16 x i8> %promotealloca3237.i.i.i.i, i8 0, i64 0
+  br label %if.end.i.i.i.i.i
+
+if.end.i.i.i.i.i:                                 ; preds = %if.then.i.i.i.i.i, %for.body10.i.i.i.i
+  %promotealloca31.i.i.i.i3 = phi <16 x i8> [ %0, %if.then.i.i.i.i.i ], [ %promotealloca3237.i.i.i.i, %for.body10.i.i.i.i ]
+  br i1 %cmp1.i.i.i.i.i.not, label %if.end.1.i.i.i.i.i, label %if.then.1.i.i.i.i.i
+
+if.then.1.i.i.i.i.i:                              ; preds = %if.end.i.i.i.i.i
+  br label %if.end.1.i.i.i.i.i
+
+if.end.1.i.i.i.i.i:                               ; preds = %if.then.1.i.i.i.i.i, %if.end.i.i.i.i.i
+  %promotealloca30.i.i.i.i = phi <16 x i8> [ %promotealloca31.i.i.i.i, %if.then.1.i.i.i.i.i ], [ %promotealloca31.i.i.i.i3, %if.end.i.i.i.i.i ]
+  %1 = shufflevector <16 x i8> %promotealloca30.i.i.i.i, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
+  br label %for.body10.i.i.i.i
+}
+
 
 declare i32 @llvm.amdgcn.workitem.id.x()

jrbyrnes added 2 commits July 8, 2024 13:23
Change-Id: I29c485cb30bbc51324d6701fc77697936f324a96
Change-Id: Ie1f906462f094d8ffa3ac8d949dd01fadbdeab7b
Comment on lines +371 to +372
// The coercion chain of the PHI is broken. Delete the Phi
// from the ValMap and any connected / user Phis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this have been avoided by bitcast + RAUW when it was processed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is when one PHINode uses another -- in this case, we don't use the bitcast + replace use approach.

In this case, we first create empty new typed PHI nodes. Then, we tie together the new PHI nodes with their incoming values. However, the tieing process can fail if we are missing an incoming value. Then the new typed PHI node becomes trash. This is a problem if another PHI node uses it. Currently, we replace the first PHI node with its original type in the ValMap, meaning the second PHI node will have an incompatible type.

The iteration order is non deterministic, so we may have processed the second PHI node before processing the first PHI node. In this scenario, the second PHI node will use the new-typed / trash version of first PHI node. So, when if we are unable to supply all incoming values for a PHI node, we must traverse the users and mark them as dead / unusable. Since the PHINode chain is removed from the ValMap, they won't be propagated to other users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what the WeakTrackingVH is for?

Change-Id: I402ab9cd835d9f725f31b6fe83147488800e1946
@jrbyrnes jrbyrnes merged commit dc8ea04 into llvm:main Jul 10, 2024
4 of 6 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jul 10, 2024

LLVM Buildbot has detected a new failure on builder arc-builder running on arc-worker while building llvm at step 5 "build-unified-tree".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/3/builds/1341

Here is the relevant piece of the build log for the reference:

Step 5 (build-unified-tree) failure: build (failure)
...
28.606 [2/11/104] Linking CXX executable bin/llvm-exegesis
28.613 [2/10/105] Linking CXX executable bin/llvm-isel-fuzzer
29.375 [2/9/106] Linking CXX executable bin/llvm-split
30.612 [2/8/107] Linking CXX executable bin/llvm-opt-fuzzer
31.225 [2/7/108] Linking CXX executable bin/llvm-reduce
31.788 [2/6/109] Linking CXX executable bin/llvm-lto2
32.183 [2/5/110] Linking CXX executable bin/clang-19
32.184 [1/5/111] Linking CXX executable bin/opt
32.198 [1/4/112] Creating executable symlink bin/clang
32.429 [1/3/113] Linking CXX executable bin/clang-repl
command timed out: 1200 seconds without output running [b'ninja', b'-j', b'16'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1275.530542

@llvm-ci
Copy link
Collaborator

llvm-ci commented Jul 10, 2024

LLVM Buildbot has detected a new failure on builder flang-aarch64-dylib running on linaro-flang-aarch64-dylib while building llvm at step 5 "build-unified-tree".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/50/builds/938

Here is the relevant piece of the build log for the reference:

Step 5 (build-unified-tree) failure: build (failure)
...
48.128 [3335/4/3226] Copying clang's avx512vpopcntdqintrin.h...
48.134 [3335/3/3227] Copying clang's avx512vpopcntdqvlintrin.h...
48.163 [3329/8/3228] Linking CXX static library lib/libLLVMLTO.a
48.164 [3329/7/3229] Copying clang's avxifmaintrin.h...
48.164 [3329/6/3230] Copying clang's avxintrin.h...
48.164 [3329/5/3231] Copying clang's avxneconvertintrin.h...
48.165 [3329/4/3232] Copying clang's avxvnniint16intrin.h...
48.165 [3329/3/3233] Copying clang's avxvnniint8intrin.h...
48.170 [3329/2/3234] Copying clang's avxvnniintrin.h...
52.551 [3329/1/3235] Building CXX object tools/flang/lib/Optimizer/Dialect/CUF/CMakeFiles/obj.CUFDialect.dir/CUFOps.cpp.o
FAILED: tools/flang/lib/Optimizer/Dialect/CUF/CMakeFiles/obj.CUFDialect.dir/CUFOps.cpp.o 
/usr/local/bin/c++ -DFLANG_INCLUDE_TESTS=1 -DFLANG_LITTLE_ENDIAN=1 -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/tools/flang/lib/Optimizer/Dialect/CUF -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/flang/lib/Optimizer/Dialect/CUF -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/flang/include -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/tools/flang/include -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/include -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/llvm/include -isystem /home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/llvm/../mlir/include -isystem /home/tcwg-buildbot/worker/flang-aarch64-dylib/build/tools/mlir/include -isystem /home/tcwg-buildbot/worker/flang-aarch64-dylib/build/tools/clang/include -isystem /home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/llvm/../clang/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Wno-deprecated-copy -Wno-string-conversion -Wno-ctad-maybe-unsupported -Wno-unused-command-line-argument -Wstring-conversion           -Wcovered-switch-default -Wno-nested-anon-types -O3 -DNDEBUG  -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -std=c++17 -MD -MT tools/flang/lib/Optimizer/Dialect/CUF/CMakeFiles/obj.CUFDialect.dir/CUFOps.cpp.o -MF tools/flang/lib/Optimizer/Dialect/CUF/CMakeFiles/obj.CUFDialect.dir/CUFOps.cpp.o.d -o tools/flang/lib/Optimizer/Dialect/CUF/CMakeFiles/obj.CUFDialect.dir/CUFOps.cpp.o -c /home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/flang/lib/Optimizer/Dialect/CUF/CUFOps.cpp
In file included from /home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/flang/lib/Optimizer/Dialect/CUF/CUFOps.cpp:16:
/home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/flang/include/flang/Optimizer/Dialect/FIRAttr.h:162:10: fatal error: 'flang/Optimizer/Dialect/FIREnumAttr.h.inc' file not found
  162 | #include "flang/Optimizer/Dialect/FIREnumAttr.h.inc"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
ninja: build stopped: subcommand failed.

@jayfoad
Copy link
Contributor

jayfoad commented Jul 12, 2024

@jrbyrnes this is generating broken IR in some Vulkan CTS tests. Here's a reduced test case:

diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-late-codegenprepare.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-late-codegenprepare.ll
index 83016f1d2d3c..d82293c09ef5 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-late-codegenprepare.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-late-codegenprepare.ll
@@ -93,3 +93,22 @@ define amdgpu_kernel void @constant_from_inttoptr() {
   store i8 %load, ptr addrspace(1) undef
   ret void
 }
+
+define void @broken_phi() {
+bb:
+  br label %bb1
+bb1:
+  %i = phi <4 x i8> [ <i8 1, i8 1, i8 1, i8 1>, %bb ], [ %i8, %bb7 ]
+  br i1 false, label %bb3, label %bb2
+bb2:
+  br label %bb3
+bb3:
+  %i4 = phi <4 x i8> [ zeroinitializer, %bb2 ], [ %i, %bb1 ]
+  br i1 false, label %bb7, label %bb5
+bb5:
+  %i6 = call <4 x i8> @llvm.smax.v4i8(<4 x i8> %i4, <4 x i8> zeroinitializer)
+  br label %bb7
+bb7:
+  %i8 = phi <4 x i8> [ zeroinitializer, %bb5 ], [ zeroinitializer, %bb3 ]
+  br label %bb1
+}

It fails IR verification with PHINode should have one entry for each predecessor of its parent basic block!. Please fix or revert!

@mariusz-sikora-at-amd @dstutt FYI.

aaryanshukla pushed a commit to aaryanshukla/llvm-project that referenced this pull request Jul 14, 2024
It's possible that we are unable to coerce all the incoming values of a
PHINode (A). Thus, we are unable to coerce the PHINode. In this
situation, we previously would add the PHINode back to the ValMap. This
would cause a problem is PhiNode (B) was a user of A. In this scenario,
if B has been coerced, we would hit an assert regarding the incompatible
type between the PHINode and its incoming value.

Deleting non-coerced PHINodes from the map, and propagating the removal
to users, resolves the issue.
jayfoad added a commit that referenced this pull request Jul 15, 2024
This reverts commit dc8ea04.

It generated broken IR as described here:
#98063 (comment)
jayfoad added a commit that referenced this pull request Jul 15, 2024
Add a test case to demonstrate broken IR caused by #98063 "[AMDGPU] Do
not use original PHIs in coercion chains" before it was reverted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants