Skip to content

[GISel] Explicitly disable BF16 tablegen patterns. #124113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 27, 2025

Conversation

davemgreen
Copy link
Collaborator

@davemgreen davemgreen commented Jan 23, 2025

We currently have an issue where bf16 patters can be used to match fp16 types, as GISel does not know about the difference between the two. This patch explicitly disables them to make sure that they are never used.

The opposite can also happen too, where fp16 patterns are used for operators that should be bf16. So this also changes any operations with bf16 types to now cause a fallback to SDAG.

The pass setup for GISel has been slightly adjusted to make sure that a verify pass does not get added between AMD-SDAG and SIFixSGPRCopiesPass, which otherwise can cause verifier issues when falling back.

@llvmbot
Copy link
Member

llvmbot commented Jan 23, 2025

@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-tablegen
@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-aarch64

Author: David Green (davemgreen)

Changes

We currently have an issue where bf16 patters can be used to match fp16 types, as GISel does not know about the difference between the two types. This patch explicitly disables them to make sure that they are never used.


Full diff: https://github.com/llvm/llvm-project/pull/124113.diff

2 Files Affected:

  • (modified) llvm/test/CodeGen/AArch64/fptrunc.ll (+1)
  • (modified) llvm/utils/TableGen/GlobalISelEmitter.cpp (+21)
diff --git a/llvm/test/CodeGen/AArch64/fptrunc.ll b/llvm/test/CodeGen/AArch64/fptrunc.ll
index 2187717c4148ae..b4c38e9f2df3b2 100644
--- a/llvm/test/CodeGen/AArch64/fptrunc.ll
+++ b/llvm/test/CodeGen/AArch64/fptrunc.ll
@@ -1,6 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
 ; RUN: llc -mtriple=aarch64 -global-isel=0 -verify-machineinstrs %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-SD
 ; RUN: llc -mtriple=aarch64 -global-isel=1 -verify-machineinstrs %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-GI
+; RUN: llc -mtriple=aarch64 -global-isel=1 -mattr=+fullfp16,+bf16 -verify-machineinstrs %s -o - | FileCheck %s --check-prefixes=CHECK,CHECK-GI
 
 define float @fptrunc_f64_f32(double %a) {
 ; CHECK-LABEL: fptrunc_f64_f32:
diff --git a/llvm/utils/TableGen/GlobalISelEmitter.cpp b/llvm/utils/TableGen/GlobalISelEmitter.cpp
index 5466d315c05a49..2969dd9156ccbf 100644
--- a/llvm/utils/TableGen/GlobalISelEmitter.cpp
+++ b/llvm/utils/TableGen/GlobalISelEmitter.cpp
@@ -2346,6 +2346,20 @@ void GlobalISelEmitter::emitRunCustomAction(raw_ostream &OS) {
      << "}\n";
 }
 
+bool hasBFloatType(const TreePatternNode &Node) {
+  for (unsigned I = 0, E = Node.getNumTypes(); I < E; I++) {
+    auto Ty = Node.getType(I);
+    for (auto T : Ty)
+      if (T.second == MVT::bf16 ||
+          (T.second.isVector() && T.second.getScalarType() == MVT::bf16))
+        return true;
+  }
+  for (const TreePatternNode &C : Node.children())
+    if (hasBFloatType(C))
+      return true;
+  return false;
+}
+
 void GlobalISelEmitter::run(raw_ostream &OS) {
   if (!UseCoverageFile.empty()) {
     RuleCoverage = CodeGenCoverage();
@@ -2382,6 +2396,13 @@ void GlobalISelEmitter::run(raw_ostream &OS) {
 
     if (Pat.getGISelShouldIgnore())
       continue; // skip without warning
+
+    // Skip any patterns containing BF16 types, as GISel cannot currently tell
+    // the difference between fp16 and bf16. FIXME: This can be removed once
+    // BF16 is supported properly.
+    if (hasBFloatType(Pat.getSrcPattern()))
+      continue;
+
     auto MatcherOrErr = runOnPattern(Pat);
 
     // The pattern analysis can fail, indicating an unsupported pattern.

Copy link
Collaborator

@sjoerdmeijer sjoerdmeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a codegen/runtime error that we are seeing, as reported in #120363 (comment)

I am not a GISel guy, but I have applied and verified that this patch solves the problems that we are seeing, and the change looks very reasonable, so am confident to LGTM this.

Thanks for the quick fix!

@arsenm
Copy link
Contributor

arsenm commented Jan 23, 2025

Do we have a tracking issue for bf16? Should mention to revert this there

@madhur13490
Copy link
Contributor

I am relatively new to GISel intersection FP16. In the longer term, will we have a clean solution when GISel represents types natively? As Matt said above, an issue would be good to have and link to #115133?

@davemgreen
Copy link
Collaborator Author

Do we have a tracking issue for bf16? Should mention to revert this there

I couldn't see one for BF16 specifically, but I'll add a note to #115133 and FYI @tgymnich if you run into this. Hopefully it is fairly obvious that none of the patterns are importing or types work when we can use bf16 correctly. Let me know if you know of a better place, quite a lot has changed now..

Sorry - This is running into other issues, where the opposite happens for certain intrinsics and instructions with fp16 types are using bf16 patterns. As in https://godbolt.org/z/1KcvszrY7. I have applied the fairly heavy hammer of forcing all IR with bf16 types to fall back to SDAG. This shouldn't be a problem for data-processing only instructions (load, stores, call params, shuffles, inserts, extract, select, etc there are a lot of them), but they are currently included to make sure everything is covered properly.

Copy link

github-actions bot commented Jan 24, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@davemgreen davemgreen force-pushed the gh-gi-disablebf16pats branch from a8414e3 to e33932c Compare January 24, 2025 11:52
We currently have an issue where bf16 patters can be used to match fp16 types,
as GISel does not know about the difference between the two types. This patch
explicitly disables them to make sure that they are never used.

The opposite can also happen too, where fp16 patterns are used for operators
that should be bf16. So any operations with bf16 types now cause a fallback to
SDAG. For the moment this includes data-processing only instructions (loads,
stores, shufles, etc).

The pass setup for GISel has been slightly adjusted to make sure that a verify
pass does not get added between AMD-SDAG and SIFixSGPRCopiesPass, which
otherwise can cause verifier issues when falling back.
@davemgreen davemgreen force-pushed the gh-gi-disablebf16pats branch from e33932c to 7af3b30 Compare January 27, 2025 08:28
@davemgreen
Copy link
Collaborator Author

I've tried to clean this up to avoid typesize-only operations that are already tested and include just the instructions that cause problems. Hopefully it has not missed any, and calls are still excluded entirely.

@davemgreen davemgreen merged commit 5a81a55 into llvm:main Jan 27, 2025
8 checks passed
@davemgreen davemgreen deleted the gh-gi-disablebf16pats branch January 27, 2025 22:21
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 27, 2025

LLVM Buildbot has detected a new failure on builder openmp-offload-amdgpu-runtime running on omp-vega20-0 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/30/builds/14690

Here is the relevant piece of the build log for the reference
Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: api/omp_host_call.c' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/api/omp_host_call.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/api/Output/omp_host_call.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a && /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/api/Output/omp_host_call.c.tmp | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/api/omp_host_call.c
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/api/omp_host_call.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/api/Output/omp_host_call.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a
# note: command had no output on stdout or stderr
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/api/Output/omp_host_call.c.tmp
# note: command had no output on stdout or stderr
# error: command failed with exit status: -11
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/api/omp_host_call.c
# .---command stderr------------
# | FileCheck error: '<stdin>' is empty.
# | FileCheck command line:  /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/api/omp_host_call.c
# `-----------------------------
# error: command failed with exit status: 2

--

********************


@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 28, 2025

LLVM Buildbot has detected a new failure on builder llvm-clang-x86_64-expensive-checks-debian running on gribozavr4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/16/builds/12788

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 2: /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 < /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll | /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -enable-var-scope --check-prefixes=GCN,SDAG /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -enable-var-scope --check-prefixes=GCN,SDAG /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=0
RUN: at line 3: /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -global-isel-abort=2 < /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll | /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -enable-var-scope --check-prefixes=GCN,GISEL /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/FileCheck -enable-var-scope --check-prefixes=GCN,GISEL /b/1/llvm-clang-x86_64-expensive-checks-debian/llvm-project/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll
+ /b/1/llvm-clang-x86_64-expensive-checks-debian/build/bin/llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -global-isel-abort=2
warning: Instruction selection used fallback path for test_mfma_f32_16x16x32_bf16
warning: Instruction selection used fallback path for test_mfma_f32_16x16x32_bf16__flags
warning: Instruction selection used fallback path for test_mfma_f32_16x16x32_bf16_no_agpr__vgprcd
error: <unknown>:0:0: no registers from class available to allocate in function 'test_mfma_f32_16x16x32_bf16_no_agpr__vgprcd'

# After Virtual Register Rewriter
********** INTERVALS **********


@mikaelholmen
Copy link
Collaborator

mikaelholmen commented Jan 28, 2025

Hi @davemgreen

With this patch

llvm-lit -av test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll

fails when llc is compiled with EXPENSIVE_CHECKS on:

# After Virtual Register Rewriter
********** INTERVALS **********
VGPR9_LO16 EMPTY
VGPR9_HI16 EMPTY
RegMasks:
********** MACHINEINSTRS **********
# Machine code for function test_mfma_f32_16x16x32_bf16_no_agpr__vgprcd: NoPHIs, TracksLiveness, NoVRegs, TiedOpsRewritten, FailedRegAlloc, TracksDebugUserValues
Function Live Ins: $sgpr4_sgpr5

0B	bb.0 (%ir-block.0):
	  liveins: $sgpr4_sgpr5
64B	  early-clobber renamable $sgpr8_sgpr9_sgpr10_sgpr11_sgpr12_sgpr13_sgpr14_sgpr15 = S_LOAD_DWORDX8_IMM_ec renamable $sgpr4_sgpr5, 52, 0 :: (dereferenceable invariant load (s256) from %ir.arg0.kernarg.offset, align 4, addrspace 4)
80B	  early-clobber renamable $sgpr0_sgpr1_sgpr2_sgpr3 = S_LOAD_DWORDX4_IMM_ec renamable $sgpr4_sgpr5, 84, 0 :: (dereferenceable invariant load (s128) from %ir.arg2.kernarg.offset, align 4, addrspace 4)
96B	  renamable $vgpr0_vgpr1_vgpr2_vgpr3 = COPY renamable $sgpr8_sgpr9_sgpr10_sgpr11
112B	  renamable $vgpr8 = V_MOV_B32_e32 0, implicit $exec
128B	  early-clobber renamable $sgpr6_sgpr7 = S_LOAD_DWORDX2_IMM_ec killed renamable $sgpr4_sgpr5, 36, 0 :: (dereferenceable invariant load (s64) from %ir.out.kernarg.offset, align 4, addrspace 4)
144B	  renamable $vgpr4_vgpr5_vgpr6_vgpr7 = COPY killed renamable $sgpr12_sgpr13_sgpr14_sgpr15
160B	  renamable $vgpr10_vgpr11_vgpr12_vgpr13 = COPY killed renamable $sgpr0_sgpr1_sgpr2_sgpr3
168B	  renamable $agpr0_agpr1_agpr2_agpr3 = COPY killed renamable $vgpr10_vgpr11_vgpr12_vgpr13
176B	  renamable $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_16X16X32_BF16_e64 killed $vgpr0_vgpr1_vgpr2_vgpr3, killed $vgpr4_vgpr5_vgpr6_vgpr7, killed $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
184B	  renamable $vgpr0_vgpr1_vgpr2_vgpr3 = COPY killed renamable $agpr0_agpr1_agpr2_agpr3
192B	  GLOBAL_STORE_DWORDX4_SADDR killed renamable $vgpr8, killed renamable $vgpr0_vgpr1_vgpr2_vgpr3, killed renamable $sgpr6_sgpr7, 0, 0, implicit $exec :: (store (s128) into %ir.out.load, addrspace 1)
208B	  S_ENDPGM 0

# End machine code for function test_mfma_f32_16x16x32_bf16_no_agpr__vgprcd.

*** Bad machine code: isRenamable set on reserved register ***
- function:    test_mfma_f32_16x16x32_bf16_no_agpr__vgprcd
- basic block: %bb.0  (0x55dab89e8b30) [0B;224B)
- instruction: 168B	renamable $agpr0_agpr1_agpr2_agpr3 = COPY killed renamable $vgpr10_vgpr11_vgpr12_vgpr13
- operand 0:   renamable $agpr0_agpr1_agpr2_agpr3

*** Bad machine code: isRenamable set on reserved register ***
- function:    test_mfma_f32_16x16x32_bf16_no_agpr__vgprcd
- basic block: %bb.0  (0x55dab89e8b30) [0B;224B)
- instruction: 176B	renamable $agpr0_agpr1_agpr2_agpr3 = V_MFMA_F32_16X16X32_BF16_e64 killed $vgpr0_vgpr1_vgpr2_vgpr3, killed $vgpr4_vgpr5_vgpr6_vgpr7, killed $agpr0_agpr1_agpr2_agpr3, 0, 0, 0, implicit $mode, implicit $exec
- operand 0:   renamable $agpr0_agpr1_agpr2_agpr3

*** Bad machine code: isRenamable set on reserved register ***
- function:    test_mfma_f32_16x16x32_bf16_no_agpr__vgprcd
- basic block: %bb.0  (0x55dab89e8b30) [0B;224B)
- instruction: 184B	renamable $vgpr0_vgpr1_vgpr2_vgpr3 = COPY killed renamable $agpr0_agpr1_agpr2_agpr3
- operand 1:   killed renamable $agpr0_agpr1_agpr2_agpr3
LLVM ERROR: Found 3 machine code errors.

Can also be seen if you use a normal build and just add "-verify-machineinstrs" to the command on line 3:
llc -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -global-isel-abort=2 < test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx950.ll -verify-machineinstrs

@davemgreen
Copy link
Collaborator Author

Thanks for the heads up - it looks like after we fall back from global-isel to SDAG, the verifier gets called, which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs which caches the result of UsesAGPRs. Because we have just fallen-back the function is empty and it incorrectly gets cached to false. This then causes a regallocation failure. One way around it might be to make sure we do not call getReservedRegs on an empty function in the verifier, but it may be better to reset that cached value somewhere, I'm just not sure where.

@davemgreen
Copy link
Collaborator Author

New, hopefully better idea in #124799

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants