Force FP32 compute in GLM4 FFN Down #13101

city96 · 2025-04-24T20:59:48Z

See discussion in #12946 - the new GLM4 models seems to have issues with FP16 matmul. Should be reproducible with GGML_CUDA_FORCE_CUBLAS=1 on any card that supports FP16.

Based on the comment for GGML_CUDA_FORCE_MMQ in docs/build.md, this would most likely affect V100, CDNA and RDNA3+, though I only have a Volta card to test on. Unsure if anything else ever hits this specific branch of the code.

Seems like it largely affects prompt processing speed (compute bound), but unsure if there's a better way to fix this specific code path other than adding more checks.

Master:

model	size	params	backend	ngl	test	t/s
glm4 32B Q6_K	24.88 GiB	32.57 B	CUDA	99	pp512	606.64 ± 5.58
glm4 32B Q6_K	24.88 GiB	32.57 B	CUDA	99	tg128	15.89 ± 0.07
llama 13B Q8_0	12.12 GiB	12.25 B	CUDA	99	pp512	1661.09 ± 80.06
llama 13B Q8_0	12.12 GiB	12.25 B	CUDA	99	tg128	49.13 ± 0.03

PR:

model	size	params	backend	ngl	test	t/s
glm4 32B Q6_K	24.88 GiB	32.57 B	CUDA	99	pp512	513.01 ± 7.13
glm4 32B Q6_K	24.88 GiB	32.57 B	CUDA	99	tg128	16.05 ± 0.19
llama 13B Q8_0	12.12 GiB	12.25 B	CUDA	99	pp512	1575.93 ± 65.59
llama 13B Q8_0	12.12 GiB	12.25 B	CUDA	99	tg128	49.48 ± 0.13

(Also ran test-backend-ops, seems ok)

Beinsezii · 2025-04-25T04:37:24Z

RDNA3/GFX1100 and this is the first time I've ever seen actual sentences come out of a GLM 4 model.

While the text gen delta is also non-existent, for me the prompt processing delta is absolutely horrendous. 230 t/s on this pr vs 1016 t/s on master for gemma qat

I think a better solution might just be to force-enable mmq, which produces good outputs on GLM 4 and still has usable prompt processing @ 636 t/s.

I wonder if part of the terrible RDNA3 perf is part of the poor f32 library support described #12372 (comment)?

JohannesGaessler

The decision of whether to use 16F or 32F compute should be done based on what is set using ggml_mul_mat_set_prec.

city96 · 2025-04-25T07:49:49Z

Oh, interesting. I assume that's what the dst->op_params[0] == GGML_PREC_DEFAULT in use_fp16 above in the cublas code is for.

So theoretically if I find which tensor/op is causing issues I can add a case for GLM4 to force FP32 for that specific tensor with ggml_mul_mat_set_prec(tensor, GGML_PREC_F32); and it should work.

JohannesGaessler · 2025-04-25T08:10:38Z

Not quite, the current logic around which precision to use is not very good. As it is the code will convert src0 to FP32 which is going to be very slow. I'm just saying that a solution, which still needs to be implemented, should be using per-tensor precision.

This reverts commit 6efd872.

city96 · 2025-04-25T09:51:28Z

Okay, I'm a bit sleep deprived, but the overflow happens inside the FFN.
I've edited this PR to force F32 just for the FFN down and only for GLM4, which does seem to fix it.

Should I:

Change the title of this PR to show it only applies to GLM4 with no cuda changes
Extend the ggml_prec with new types (e.g. GGML_PREC_F16, GGML_PREC_F16_F32C) and rework this PR, then split the GLM4 fix into a separate PR once it's done
(Move the GLM specific code to somewhere else (e.g. llama-model.cpp instead of llama-graph.cpp)?)

JohannesGaessler

I think this is the correct solution for a hotfix, the top priority should be to get correct results even if the performance is suboptimal. The performance can then be improved in subsequent PRs.

src/llama-graph.cpp

Co-authored-by: Johannes Gäßler <[email protected]>

city96 · 2025-04-25T11:02:46Z

@Beinsezii Could you re-test on AMD for correctness / performance impact if you have the time? I've done a few tests and it seems to be correct on my end with this even with long-ish context input / multi-turn.

JohannesGaessler · 2025-04-25T12:39:01Z

I'm merging this for now, if more fixes are needed those can be done in separate PRs.

Mushoz · 2025-04-25T14:31:59Z

While this PR fixes the issue with my AMD 7900xtx under ROCm, I am still getting the repeating characters with the same 7900xtx under the Vulkan backend. The issue does differ slightly from the one that I previously experienced under ROCm:

The issue did not manifest with very short (< 64 token) inputs under ROCm. With Vulkan it does.
The issue did not manifest with very small (< 64 token) batchsize under ROCm. With Vulkan, this does not fix it.

They do share similarities:

In both cases, it was usually the letter G that was repeated an infinite amount of times.

city96 · 2025-04-25T16:33:41Z

Hmm, I don't have a working vulkan build setup so I'm not sure I can help throubleshoot that part.

The < 64 tokens still working & the breakage only happening on the second turn/with longer prompts on Volta/ROCm was because it was overflowing during prompt processing I think. The vulkan issue almost sounds like it happens during generation instead but I'm mostly just guessing.

* Force FP32 compute in cuBLAS GEMM * Revert "Force FP32 compute in cuBLAS GEMM" This reverts commit 6efd872. * Force F32 compute in GLM4 ffn down * Edit comment to clarify issue Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>

Force FP32 compute in cuBLAS GEMM

6efd872

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 24, 2025

ikawrakow mentioned this pull request Apr 25, 2025

Add GLM-4-0414 Model Support ikawrakow/ik_llama.cpp#344

Merged

JohannesGaessler requested changes Apr 25, 2025

View reviewed changes

city96 added 2 commits April 25, 2025 11:49

Revert "Force FP32 compute in cuBLAS GEMM"

db52579

This reverts commit 6efd872.

Force F32 compute in GLM4 ffn down

7097567

JohannesGaessler approved these changes Apr 25, 2025

View reviewed changes

src/llama-graph.cpp Outdated Show resolved Hide resolved

city96 changed the title ~~CUDA: Force FP32 compute in cuBLAS GEMM~~ Force FP32 compute in GLM4 FFN Down Apr 25, 2025

Edit comment to clarify issue

06113f0

Co-authored-by: Johannes Gäßler <[email protected]>

JohannesGaessler merged commit 558a764 into ggml-org:master Apr 25, 2025
48 checks passed

city96 deleted the glm4_cublas_fix branch April 25, 2025 16:33

city96 mentioned this pull request Apr 26, 2025

Eval bug: GLM-Z1-9B-0414 #12946

Closed

0cc4m mentioned this pull request May 18, 2025

Vulkan: Support fp32 accumulator in quantized matmul to fix GLM4-32B incoherence #13607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force FP32 compute in GLM4 FFN Down #13101

Force FP32 compute in GLM4 FFN Down #13101

Uh oh!

city96 commented Apr 24, 2025

Uh oh!

Beinsezii commented Apr 25, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

city96 commented Apr 25, 2025

Uh oh!

JohannesGaessler commented Apr 25, 2025

Uh oh!

city96 commented Apr 25, 2025

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

city96 commented Apr 25, 2025

Uh oh!

Uh oh!

JohannesGaessler commented Apr 25, 2025

Uh oh!

Mushoz commented Apr 25, 2025

Uh oh!

city96 commented Apr 25, 2025

Uh oh!

Uh oh!

Force FP32 compute in GLM4 FFN Down #13101

Force FP32 compute in GLM4 FFN Down #13101

Uh oh!

Conversation

city96 commented Apr 24, 2025

Uh oh!

Beinsezii commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

city96 commented Apr 25, 2025

Uh oh!

JohannesGaessler commented Apr 25, 2025

Uh oh!

city96 commented Apr 25, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

city96 commented Apr 25, 2025

Uh oh!

Uh oh!

JohannesGaessler commented Apr 25, 2025

Uh oh!

Mushoz commented Apr 25, 2025

Uh oh!

city96 commented Apr 25, 2025

Uh oh!

Uh oh!

Beinsezii commented Apr 25, 2025 •

edited

Loading