-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Force FP32 compute in GLM4 FFN Down #13101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
RDNA3/GFX1100 and this is the first time I've ever seen actual sentences come out of a GLM 4 model. While the text gen delta is also non-existent, for me the prompt processing delta is absolutely horrendous. 230 t/s on this pr vs 1016 t/s on master for gemma qat I think a better solution might just be to force-enable mmq, which produces good outputs on GLM 4 and still has usable prompt processing @ 636 t/s. I wonder if part of the terrible RDNA3 perf is part of the poor f32 library support described #12372 (comment)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision of whether to use 16F or 32F compute should be done based on what is set using ggml_mul_mat_set_prec
.
Oh, interesting. I assume that's what the So theoretically if I find which tensor/op is causing issues I can add a case for GLM4 to force FP32 for that specific tensor with |
Not quite, the current logic around which precision to use is not very good. As it is the code will convert |
This reverts commit 6efd872.
Okay, I'm a bit sleep deprived, but the overflow happens inside the FFN. Should I:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the correct solution for a hotfix, the top priority should be to get correct results even if the performance is suboptimal. The performance can then be improved in subsequent PRs.
Co-authored-by: Johannes Gäßler <[email protected]>
@Beinsezii Could you re-test on AMD for correctness / performance impact if you have the time? I've done a few tests and it seems to be correct on my end with this even with long-ish context input / multi-turn. |
I'm merging this for now, if more fixes are needed those can be done in separate PRs. |
While this PR fixes the issue with my AMD 7900xtx under ROCm, I am still getting the repeating characters with the same 7900xtx under the Vulkan backend. The issue does differ slightly from the one that I previously experienced under ROCm:
They do share similarities:
|
Hmm, I don't have a working vulkan build setup so I'm not sure I can help throubleshoot that part. The < 64 tokens still working & the breakage only happening on the second turn/with longer prompts on Volta/ROCm was because it was overflowing during prompt processing I think. The vulkan issue almost sounds like it happens during generation instead but I'm mostly just guessing. |
* Force FP32 compute in cuBLAS GEMM * Revert "Force FP32 compute in cuBLAS GEMM" This reverts commit 6efd872. * Force F32 compute in GLM4 ffn down * Edit comment to clarify issue Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>
See discussion in #12946 - the new GLM4 models seems to have issues with FP16 matmul. Should be reproducible with
GGML_CUDA_FORCE_CUBLAS=1
on any card that supports FP16.Based on the comment for
GGML_CUDA_FORCE_MMQ
in docs/build.md, this would most likely affect V100, CDNA and RDNA3+, though I only have a Volta card to test on. Unsure if anything else ever hits this specific branch of the code.Seems like it largely affects prompt processing speed (compute bound), but unsure if there's a better way to fix this specific code path other than adding more checks.
Master:
PR:
(Also ran
test-backend-ops
, seems ok)