Skip to content

Force FP32 compute in GLM4 FFN Down #13101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 25, 2025

Conversation

city96
Copy link
Contributor

@city96 city96 commented Apr 24, 2025

See discussion in #12946 - the new GLM4 models seems to have issues with FP16 matmul. Should be reproducible with GGML_CUDA_FORCE_CUBLAS=1 on any card that supports FP16.

Based on the comment for GGML_CUDA_FORCE_MMQ in docs/build.md, this would most likely affect V100, CDNA and RDNA3+, though I only have a Volta card to test on. Unsure if anything else ever hits this specific branch of the code.

Seems like it largely affects prompt processing speed (compute bound), but unsure if there's a better way to fix this specific code path other than adding more checks.

Master:

model size params backend ngl test t/s
glm4 32B Q6_K 24.88 GiB 32.57 B CUDA 99 pp512 606.64 ± 5.58
glm4 32B Q6_K 24.88 GiB 32.57 B CUDA 99 tg128 15.89 ± 0.07
llama 13B Q8_0 12.12 GiB 12.25 B CUDA 99 pp512 1661.09 ± 80.06
llama 13B Q8_0 12.12 GiB 12.25 B CUDA 99 tg128 49.13 ± 0.03

PR:

model size params backend ngl test t/s
glm4 32B Q6_K 24.88 GiB 32.57 B CUDA 99 pp512 513.01 ± 7.13
glm4 32B Q6_K 24.88 GiB 32.57 B CUDA 99 tg128 16.05 ± 0.19
llama 13B Q8_0 12.12 GiB 12.25 B CUDA 99 pp512 1575.93 ± 65.59
llama 13B Q8_0 12.12 GiB 12.25 B CUDA 99 tg128 49.48 ± 0.13

(Also ran test-backend-ops, seems ok)

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 24, 2025
@Beinsezii
Copy link
Contributor

Beinsezii commented Apr 25, 2025

RDNA3/GFX1100 and this is the first time I've ever seen actual sentences come out of a GLM 4 model.

While the text gen delta is also non-existent, for me the prompt processing delta is absolutely horrendous. 230 t/s on this pr vs 1016 t/s on master for gemma qat

I think a better solution might just be to force-enable mmq, which produces good outputs on GLM 4 and still has usable prompt processing @ 636 t/s.

I wonder if part of the terrible RDNA3 perf is part of the poor f32 library support described #12372 (comment)?

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decision of whether to use 16F or 32F compute should be done based on what is set using ggml_mul_mat_set_prec.

@city96
Copy link
Contributor Author

city96 commented Apr 25, 2025

Oh, interesting. I assume that's what the dst->op_params[0] == GGML_PREC_DEFAULT in use_fp16 above in the cublas code is for.

So theoretically if I find which tensor/op is causing issues I can add a case for GLM4 to force FP32 for that specific tensor with ggml_mul_mat_set_prec(tensor, GGML_PREC_F32); and it should work.

@JohannesGaessler
Copy link
Collaborator

Not quite, the current logic around which precision to use is not very good. As it is the code will convert src0 to FP32 which is going to be very slow. I'm just saying that a solution, which still needs to be implemented, should be using per-tensor precision.

@city96
Copy link
Contributor Author

city96 commented Apr 25, 2025

Okay, I'm a bit sleep deprived, but the overflow happens inside the FFN.
I've edited this PR to force F32 just for the FFN down and only for GLM4, which does seem to fix it.

Should I:

  • Change the title of this PR to show it only applies to GLM4 with no cuda changes
  • Extend the ggml_prec with new types (e.g. GGML_PREC_F16, GGML_PREC_F16_F32C) and rework this PR, then split the GLM4 fix into a separate PR once it's done
  • (Move the GLM specific code to somewhere else (e.g. llama-model.cpp instead of llama-graph.cpp)?)

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the correct solution for a hotfix, the top priority should be to get correct results even if the performance is suboptimal. The performance can then be improved in subsequent PRs.

@city96 city96 changed the title CUDA: Force FP32 compute in cuBLAS GEMM Force FP32 compute in GLM4 FFN Down Apr 25, 2025
Co-authored-by: Johannes Gäßler <[email protected]>
@city96
Copy link
Contributor Author

city96 commented Apr 25, 2025

@Beinsezii Could you re-test on AMD for correctness / performance impact if you have the time? I've done a few tests and it seems to be correct on my end with this even with long-ish context input / multi-turn.

@JohannesGaessler JohannesGaessler merged commit 558a764 into ggml-org:master Apr 25, 2025
48 checks passed
@JohannesGaessler
Copy link
Collaborator

I'm merging this for now, if more fixes are needed those can be done in separate PRs.

@Mushoz
Copy link

Mushoz commented Apr 25, 2025

While this PR fixes the issue with my AMD 7900xtx under ROCm, I am still getting the repeating characters with the same 7900xtx under the Vulkan backend. The issue does differ slightly from the one that I previously experienced under ROCm:

  1. The issue did not manifest with very short (< 64 token) inputs under ROCm. With Vulkan it does.
  2. The issue did not manifest with very small (< 64 token) batchsize under ROCm. With Vulkan, this does not fix it.

They do share similarities:

  1. In both cases, it was usually the letter G that was repeated an infinite amount of times.

@city96
Copy link
Contributor Author

city96 commented Apr 25, 2025

Hmm, I don't have a working vulkan build setup so I'm not sure I can help throubleshoot that part.

The < 64 tokens still working & the breakage only happening on the second turn/with longer prompts on Volta/ROCm was because it was overflowing during prompt processing I think. The vulkan issue almost sounds like it happens during generation instead but I'm mostly just guessing.

@city96 city96 deleted the glm4_cublas_fix branch April 25, 2025 16:33
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
* Force FP32 compute in cuBLAS GEMM

* Revert "Force FP32 compute in cuBLAS GEMM"

This reverts commit 6efd872.

* Force F32 compute in GLM4 ffn down

* Edit comment to clarify issue

Co-authored-by: Johannes Gäßler <[email protected]>

---------

Co-authored-by: Johannes Gäßler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants