Skip to content

sycl : variable sg_size support for mmvq kernels #12336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 12, 2025

Conversation

Alcpz
Copy link
Collaborator

@Alcpz Alcpz commented Mar 11, 2025

MMVQ kernels were based on CUDA's mmvq kernels, which were tailored for subgroups of size 32. While bigger subgroups were considered, smaller subgroups were not. The changes allow for more flexibility in these kernels and reduce the code base size if we make these kernels fast enough in intel architectures.

The introduced changes distribute more work per thread / workitem if WARP_SIZE < 32.

This PR doesn't affect performance.

QK_WARP_SIZE is still needed for some kernels that still require subgroup size == 32, so I am renaming the existing variable for now.

Note: To enable these changes on Intel devices, further changes are required in the entry point of the MUL_MAT op.

Note2: This code path requires FMA intrinsics that may not be available for all architectures, even more changes are required to make it the default path on Intel devices without potentially losing support on older archs

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Mar 11, 2025
Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Alcpz Alcpz force-pushed the Alcpz/variable-sg-mmvq branch from 403a5ad to 65ecd1a Compare March 11, 2025 20:43
Copy link
Collaborator

@qnixsynapse qnixsynapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Thank you!
LGTM

@Alcpz Alcpz merged commit 363f8c5 into ggml-org:master Mar 12, 2025
47 checks passed
ishaangandhi pushed a commit to ishaangandhi/llama.cpp that referenced this pull request Mar 12, 2025
jpohhhh pushed a commit to Telosnex/llama.cpp that referenced this pull request Mar 14, 2025
@NeoZhangJianyu
Copy link
Collaborator

@Alcpz
QK_WARP_SIZE is renamed to WARP_SIZE is wrong to Intel GPU.

QK_WARP_SIZE is 32, WARP_SIZE is 16 for Intel GPU.
This change will lead to issue for Intel GPU.

I suggest to revert this PR since most of codes are used to rename them.
The other code could be updated by new PR.

@Alcpz
Copy link
Collaborator Author

Alcpz commented Mar 19, 2025

What do you mean? I've tested the changes in multiple Intel GPUs and there are no issues in the kernels changed here. That's the purpose of these changes.
There are still kernels that haven't been touched by this PR that still use QK_WARP_SIZE, it's not an overall substitution. Have you encountered any issues on Intel GPUs so that I can also fix the variable for those cases?

@NeoZhangJianyu
Copy link
Collaborator

QK_WARP_SIZE is 32, WARP_SIZE is 16 for Intel GPU.
Because this reason, these two Macro can be replaced.

The QK_WARP_SIZE is created by PR: a9554e2

@luoyu-intel Could you provide more info about it?

@Alcpz
Copy link
Collaborator Author

Alcpz commented Mar 19, 2025

Yes, that I understand what you are saying. The purpose of these changes is just to allow the mmvq kernels (It's also not the path followed by Intel architectures) to support sub group size 16, so we can use WARP_SIZE, which is the variable defined when compiling (GGML_SYCL_WARP_SIZE in the CMakeLists). I left QK_WARP_SIZE on other kernels that are used by Intel outside of these changes.

I checked if entirely removing the variable was possible but that would have introduced errors in the dmmv (like dequantize_mul_mat_vec_q6_k) when the mul mat values are reduced at the end of the kernel. That's maybe what you are referring to?

@luoyu-intel
Copy link
Contributor

As far as I can remember, QK_WARP_SIZE was created because I didn't rewrite SYCL kernels for these QK dtypes. I rewrote some SYCL kernels that were directly translated from CUDA to suit Intel's actual GPU warp size. WARP_SIZE=16 may get the incorrect result from these QK kernels.

@NeoZhangJianyu
Copy link
Collaborator

@Alcpz
How about this suggestion?
I think use 32 replace 16 not be good method. The UT can't cover the small error (default error is 0.5, which is not enough to some user cases).

@Alcpz
Copy link
Collaborator Author

Alcpz commented Mar 25, 2025

Sorry @NeoZhangJianyu, I don´t understand why you are suggesting this.
It's true that Unittests are not very good for some quantizations, but I haven't fond the correctness issue.
For example the q4_K unittest wouldn't fail for subgroup_size 16 before this patch, because of the size of the input (m,n,k = 16, 1, 256). 256 is a single Superblock and every subgroup in q4_K is programmed to perform calculations over 2 superblocks. If I edited the unit test to have two superblocks (m,n,k = 16, 1, 256), the error would be detected.

I don't mind reverting these changes if they actually break something, but I still haven't found a case where this happens. If you provide me an example of this breaking models or can point me to a model that is broken due to this change, I will be more than happy to revert and go back to the implementation to ensure correctness.

@NeoZhangJianyu
Copy link
Collaborator

why you think use 16 replace 32 won't impact the case?
That will make the sum result is wrong.
Because the UT case use smaller float value, the error of sum will be ignored in UT case. We allow the error is 0.5.

I can't provide a detailed case to approve the side effect by now.

I want to revert the code, use QK_WARP_SIZE replace WARP_SIZE.
This change won't impact NV/AMD GPU, since both are 32.
But it will restore the legacy behavior of Intel GPU.

@qnixsynapse
Copy link
Collaborator

But it will restore the legacy behavior of Intel GPU.

Using QK_WARP_SIZE is not a "legacy" behavior, it's a workaround behavior.

@NeoZhangJianyu I have been testing for non UT case here and this PR doesn't seem to cause any problems so far.

If it does in case, I will open a PR to revert it.
But my suggestion is to not revert for now.
Not all kernel require QK_WARP_SIZE of 32. Only the Q4_k, Q5_k and Q6_k require I think.

@NeoZhangJianyu
Copy link
Collaborator

There are some accuracy issues of real LLMs of Q4_K, Q5_K.
I think revert it will help for them.

@Alcpz
Copy link
Collaborator Author

Alcpz commented Mar 26, 2025

But how will it help reverting this code if the functions inside mmvq shouldn't be executed on Intel devices path unless you change ggml_sycl_mul_mat?

This is why I am confused about reverting this and why I am asking you for a more detailed explanation. Again, if you know of which LLMs we have accuracy issues, we should check what is going on there and fix the problems there, not revert some work to see if by chance we get better performance/accuracy.

@NeoZhangJianyu
Copy link
Collaborator

Could you explain why use 16 to replace 32 won't impact the result?
I just want to revert this operation. :(

I find there is accuracy issue in at least 2 models. I hope check it from a clean base.
This issue (32 ->16) should be fixed firstly.

I don't want to revert the whole PR.
Just revert the part of code.

@Alcpz
Copy link
Collaborator Author

Alcpz commented Mar 26, 2025

Alright, the key thing here is that switching back to 32, even if it's a small thing, makes redundant the key part of the PR, which is the changes to mul_mat_vec_q. I'm gonna be a bit verbose explaining the PR, so sorry if you already understand most of it, I want want to avoid more confusion.

Before the PR changing from 32 to 16: the subgroup_size would skip half of the quants in a superblock, due to how these kernels are designed (one subgroup per workgroup).

For Q4_K it is slightly different, as Q4_K was programmed to tackle more than one superblock per workgroup, so you end having the first 16 threads dealing with the first superblock, and the second 16 threads with the second superblock. In practice, what this means is: If you changed 32 to 16, Unit tests would pass even though computations were incorrect because for size m = 256 you would only have a single superblock and the results would be correct.
If you added a test_case with m = 512, to have two superblocks, the unittest would fail. This seems like a precision issue at first, but it's actually a correctness issue. I don't know if this is the cause of the precision issues you are mentioning or if there is another problem as well. Edit: This is why I think it doesn't impact the result. I also checked with some Q4_K_M models and they returned correct prompts.

this loop was added to ensure that all the work is distributed correctly and ensure the number of operations remain the same.
The order of the operations is slightly different though: Instead of doing the a single muladd of a vecdot per thread and reducing, we are doing more vecdots per thread and then reducing them. But that also happens already if you calculate the vecdot for a complete model. But due to the nature of FP i don't entirely understand why smaller test_cases are not an issue and bigger test cases are: Do you mean bigger layer sizes / weights? Or bigger values?

Your small change (16->32) makes that loop pointless, and it would be simply better to entirely revert the PR.

Can you please tell me which build has the precision issues? That will help in the future as well to ensure other PRs don't introduce problems.
Also, are you sure these kernels are being the source of the precision issues? Intel GPU's shouldn't be using this codepath unless you modify ggml_sycl_mul_mat.
If you are right and these kernels are being called somehow, then:
a) I'd like to solve the precision issue, not revert the code
b) We may get performance from ensuring we go to the dmmv codepath for Intel in all cases, as currently DMMV performs better.

@NeoZhangJianyu
Copy link
Collaborator

Thank you for so detailed explain!
I cancel my comment.

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Mar 27, 2025

#12096
I test with the model file from huggingface: granite-3.0-3b-a800m-instruct-Q4_0.gguf.
The result is different with CPU.
After close the Q4_0 reorder, it's same issue.

This issue is closed due to user use 3.2 and passed.
But this model show there is accuracy issue.

It's good if you could check it.

Thank you!

@qnixsynapse
Copy link
Collaborator

@NeoZhangJianyu Seems to be working okay here:

$ build/bin/llama-cli -n 100 -ngl 99 -p "Five easy ways to build a website" -m /home/qnixsynapse/Downloads/Weights/granite-3.0-3b-a800m-instruct-q4_k_m.gguf

build: 4967 (f17a3bb4) with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A750 Graphics) - 7721 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 323 tensors from /home/qnixsynapse/Downloads/Weights/granite-3.0-3b-a800m-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = granitemoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Granite 3.0 3b A800M Instruct
llama_model_loader: - kv   3:                           general.finetune str              = instruct
llama_model_loader: - kv   4:                           general.basename str              = granite-3.0
llama_model_loader: - kv   5:                         general.size_label str              = 3B-a800M
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,3]       = ["language", "granite-3.0", "text-gen...
llama_model_loader: - kv   8:                     granitemoe.block_count u32              = 32
llama_model_loader: - kv   9:                  granitemoe.context_length u32              = 4096
llama_model_loader: - kv  10:                granitemoe.embedding_length u32              = 1536
llama_model_loader: - kv  11:             granitemoe.feed_forward_length u32              = 512
llama_model_loader: - kv  12:            granitemoe.attention.head_count u32              = 24
llama_model_loader: - kv  13:         granitemoe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                  granitemoe.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15: granitemoe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                    granitemoe.expert_count u32              = 40
llama_model_loader: - kv  17:               granitemoe.expert_used_count u32              = 8
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                      granitemoe.vocab_size u32              = 49155
llama_model_loader: - kv  20:            granitemoe.rope.dimension_count u32              = 64
llama_model_loader: - kv  21:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  22:                 granitemoe.attention.scale f32              = 0.015625
llama_model_loader: - kv  23:                 granitemoe.embedding_scale f32              = 12.000000
llama_model_loader: - kv  24:                  granitemoe.residual_scale f32              = 0.220000
llama_model_loader: - kv  25:                     granitemoe.logit_scale f32              = 6.000000
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = refact
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,49155]   = ["<|end_of_text|>", "<fim_prefix>", "...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,49155]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,48891]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|start_of_r...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.92 GiB (4.88 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.2826 MB
print_info: arch             = granitemoe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 4096
print_info: n_embd           = 1536
print_info: n_layer          = 32
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 6.0e+00
print_info: f_attn_scale     = 1.6e-02
print_info: n_ff             = 512
print_info: n_expert         = 40
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.37 B
print_info: general.name     = Granite 3.0 3b A800M Instruct
print_info: f_embedding_scale = 12.000000
print_info: f_residual_scale  = 0.220000
print_info: f_attention_scale = 0.015625
print_info: vocab type       = BPE
print_info: n_vocab          = 49155
print_info: n_merges         = 48891
print_info: BOS token        = 0 '<|end_of_text|>'
print_info: EOS token        = 0 '<|end_of_text|>'
print_info: PAD token        = 0 '<|end_of_text|>'
print_info: LF token         = 203 'Ċ'
print_info: EOG token        = 0 '<|end_of_text|>'
print_info: max token length = 512
load_tensors: loading model tensors, this can take a while... (mmap = true)
make_cpu_buft_list: disabling extra buffer types (i.e. repacking) since a GPU device is available
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =    40.50 MiB
load_tensors:        SYCL0 model buffer size =  1921.79 MiB
...............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 1
  GGML_SYCL_DISABLE_GRAPH: 1
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A750 Graphics|  12.55|    448|    1024|   32|  8096M|     1.6.32224.500000|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
llama_context:  SYCL_Host  output buffer size =     0.19 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
init:      SYCL0 KV buffer size =   256.00 MiB
llama_context: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_context:      SYCL0 compute buffer size =   212.00 MiB
llama_context:  SYCL_Host compute buffer size =    11.01 MiB
llama_context: graph nodes  = 2024
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|start_of_role|>system<|end_of_role|>You are a helpful assistant<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Hi there<|end_of_text|>
<|start_of_role|>user<|end_of_role|>How are you?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>


system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 4136895462
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = 100, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

userFive easy ways to build a website
assistant1. Choose a website builder: There are many website builders available, such as Wix, Squarespace, and WordPress. These platforms offer user-friendly interfaces and pre-designed templates to help you create a website quickly.

2. Select a domain name: Choose a unique and memorable domain name that reflects your website's purpose. You can register a domain name through registrars like GoDaddy or Namecheap.

3. Design your website: Use

@Alcpz
Copy link
Collaborator Author

Alcpz commented Mar 27, 2025

It's ok, as long as we have constructive discussions I think it's good that you raise your concerns.

That issue was opened before I merged this code, so you may be right in that we may have a bug somewhere worth investigating.

@NeoZhangJianyu
Copy link
Collaborator

$ build/bin/llama-cli -n 100 -ngl 99 -p "Five easy ways to build a website" -m /home/qnixsynapse/Downloads/Weights/granite-3.0-3b-a800m-instruct-q4_k_m.

I tested it with Q4_0 model.
Let's keep one's eye on it.

Thank you!

@NeoZhangJianyu
Copy link
Collaborator

It's ok, as long as we have constructive discussions I think it's good that you raise your concerns.

That issue was opened before I merged this code, so you may be right in that we may have a bug somewhere worth investigating.

Yes.
In fact, I find some bugs in SYCL backend is due to PRs without fully test.
Hope CI help it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants