-
Notifications
You must be signed in to change notification settings - Fork 12.2k
CUDA: batched+noncont MMQ, refactor bs>1 MoE code #13199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: batched+noncont MMQ, refactor bs>1 MoE code #13199
Conversation
I also see a good improvement on Windows:
|
* origin/master: sync : ggml whisper : add check that target name exists (whisper/3103) ggml : suppress Windows compiler warnings (whisper/3075) mtmd : add **vision** support for Mistral Small 3.1 (ggml-org#13231) arg : remove CURLINFO_EFFECTIVE_METHOD (ggml-org#13228) llama-model : fix the reported size class for nomic-embed-text-v2-moe (ggml-org#13223) sync : ggml ggml : fix ggml_gallocr_ptr type (ggml/1205) cuda : fix unused variable compile warning (whisper/0) CUDA: batched+noncont MMQ, refactor bs>1 MoE code (ggml-org#13199) arg : -hf do not fail if url mismatch (ggml-org#13219) fix typo: `n_ctx_pre_seq` -> `n_ctx_per_seq` (ggml-org#13221) convert : improve model arch handling (ggml-org#13122) llava : remove duplicate include (ggml-org#13207) common : add -jf / --json-schema-file flag (ggml-org#12011)
@JohannesGaessler Sadly I'm getting:
If I revert the commit, then everything works fine. I'm using H100 and CUDA 12.6 |
Using which model and which exact command? |
@JohannesGaessler I used Qwen 235B in BF16 format on 8xH100s during imatrix. I think small batches are fine, but when the ubatch-size / physical batch size exceeds some limit, I think it errors out |
…gml-org#13199)"" This reverts commit 41b4171.
…gml-org#13199)"" This reverts commit 2a00d34.
Oh actually I noticed even Qwen 30B doesn't work as well:
|
It seems the imatrix code on master has a bug where out-of-bounds writes can occur. On my system this resulted in a segfault, but can you check whether #13286 fixes your issue before I investigate further? |
There was another bug affecting model evaluation in general: https://github.com/ggml-org/llama.cpp/pull/13294/files . |
Wow great job @JohannesGaessler your recent PRs give impressive performance on fully CUDA offloaded Qwen3-30B-A3B. I went back this morning and re-tested after the couple fixups applied in the last 24 hours or so.
I'm presuming @danielhanchen used a command something like this from what I can glean:
I too am curious as I'm wondering the effects of increasing context length from the default value of 512 when computing imatrix, as well as the value of adding model specific chat templates into the imatrix corpus. I have some visualizations of three different imatrix computed for Qwen3-30B-A3B that show a lot of similarity though unsloths has a few different patterns in the tea leaves here.. But knowing the exact command for how each imatrix was created would be useful. Anyway thanks and I'm really impressed with how the friendly competition and collaboration across organizations is lifting up the whole community. Cheers! |
This PR makes the following changes:
GET_ROWS
to allow for type conversion during the operation.src1
to be sorted by expert viaGET_ROWS
as well as for the inverse operation ondst
. The sorting in either direction can be done in a single kernel launch, the dedicated kernels that have been used so far can be removed.Performance changes
Performance increases most for small batch sizes and fast GPUs where the kernel launch overhead has more impact. I think there is still a lot of potential for optimization in the MMQ kernel. For the generic MoE code there are currently still unnecessary type conversions for FP16 and BF16; eliminating them will require some changes to the cuBLAS code. I did not try cublasGemmGroupedBatchedEx because it to my disappointment only supports
CUBLAS_COMPUTE_32F
, so no tensor cores. It may be worthwhile to instead do an implementation with regular batched GEMM by padding allsrc1
matrices to the max. number of tokens per expert - on modern GPUs this may end up being faster even if some of the work is wasted.