Skip to content

[SYCL] Fix the sub group size of Intel #8106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 2, 2024

Conversation

luoyu-intel
Copy link
Contributor

@luoyu-intel luoyu-intel commented Jun 25, 2024

Changes:

  1. sync the WARP_SIZE macro, and replace hard-coding 32 numbers.
  2. add a macro condition to change WARP_SIZE to 16 for Intel's GPUs
  3. remove hard-coding work_group_size=1024 constraint of *_norm_f32
  4. remove unused shared local memory
  5. output the correct tokens for Debug build, before the logits are all -nan. It's an issue from dpcpp: [SYCL]subgroup shuffle bug when sg_size=32 intel/llvm#14274
  6. move *_norm_f32 to norm.cpp file
  7. fix the bug of ignoring src1_ncols of mmvq

Debug can output the same tokens as release in this PR(master runs into exceptions):

 What's your name?ggml_gallocr_needs_realloc: node inp_embd is not valid
ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve
ggml_backend_sched_alloc_splits: failed to allocate graph, reserving

 Unterscheidung between the two words "name" and "namee"

Release output:

What's your name?
 Unterscheidung between the two words "name" and "namee"

Performance benefit

Intel Arc A770
37 tokens/s to 39 tokens/s (Windows + 9600K):

 Building a website can be done in 10 simple steps: Choosing a hosting platform before you begin anything, though, should come first.
llama_print_timings:        load time =    5109.68 ms
llama_print_timings:      sample time =       0.80 ms /    16 runs   (    0.05 ms per token, 20025.03 tokens per second)
llama_print_timings: prompt eval time =     271.43 ms /    14 tokens (   19.39 ms per token,    51.58 tokens per second)
llama_print_timings:        eval time =     383.75 ms /    15 runs   (   25.58 ms per token,    39.09 tokens per second)
llama_print_timings:       total time =     666.89 ms /    29 tokens
Log end

38.9 tokens/s to 41.8 tokens/s (Linux + Xeon 4th):

 Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun in the way that little girls like you usually have. So she played outside with
llama_print_timings:        load time =   14153.51 ms
llama_print_timings:      sample time =       0.72 ms /    16 runs   (    0.04 ms per token, 22284.12 tokens per second)
llama_print_timings: prompt eval time =     391.66 ms /    33 tokens (   11.87 ms per token,    84.26 tokens per second)
llama_print_timings:        eval time =     359.29 ms /    15 runs   (   23.95 ms per token,    41.75 tokens per second)
llama_print_timings:       total time =     757.59 ms /    48 tokens

@github-actions github-actions bot added build Compilation issues SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 25, 2024
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 25, 2024
@airMeng airMeng requested review from airMeng and AidanBeltonS June 26, 2024 05:12
Copy link
Collaborator

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need confirmation from our codeplay mates to verify on other HW

@airMeng
Copy link
Collaborator

airMeng commented Jun 27, 2024

@AidanBeltonS @OuadiElfarouki @joeatodd if no regression on your side, I will merge it soon

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 27, 2024
@luoyu-intel
Copy link
Contributor Author

The quantize functions limit the WARP_SIZE equals block size=32, there is remaining work for this.

@luoyu-intel luoyu-intel marked this pull request as draft June 27, 2024 07:22
@airMeng airMeng added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs and removed Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Jun 28, 2024
Copy link
Contributor

@joeatodd joeatodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, changes look good, nice to see some more work to split ggml-sycl.cpp into more small files! 👍

Tested on our side, all fine except for the 2 small things I've raised.

@luoyu-intel luoyu-intel marked this pull request as ready for review July 1, 2024 01:42
@luoyu-intel
Copy link
Contributor Author

Hi @joeatodd please review the change.

@luoyu-intel
Copy link
Contributor Author

luoyu-intel commented Jul 1, 2024

I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9.

This PR branch with prompt_length=9:
Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside
prompt_length=8:
Once upon a time, there is gepubliceerd a new version of the game Civilization IV called Civilization IV: Col

The master branch with prompt_length=9:
Once upon a time, there is a small village nestled in the mountains. Unterscheidung between the two is not always clear
prompt_length=8:
Once upon a time, there isiech shut shut shut shut shutoni oni shut shut shutoni oni

The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one.

Copy link
Contributor

@joeatodd joeatodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the changes 😃

}
}

template<int QUANT_BLOCK_TILE>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall there be some check of block_size? We don't want all block size supported, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is for Q8_1, so therer is only one fixed block size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, shall we add some check like

static_assert(QUANT_BLOCK_TILE ==32)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUANT_BLOCK_TILE = QK8_1/WARP_SIZE. It's not necessary to add this check, the kernel works for all BlockSize%WARP_SIZE==0.

@airMeng airMeng merged commit d08c20e into ggml-org:master Jul 2, 2024
53 checks passed
@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Jul 2, 2024

I somehow missed this. Using this patch, the gemma model is broken, atleast on Q4_K_S

image

@luoyu-intel
Copy link
Contributor Author

@qnixsynapse Is your prompt "Hi"? The SYCL backend had this repeat issue a long time ago.

@qnixsynapse
Copy link
Collaborator

This is llama -3 8B. Not sure what went wrong but speed has been increased.
image

@luoyu-intel
Copy link
Contributor Author

I've fixed the src1_ncols bug of mmvq. But there is a remaining accuracy bug when the prompt length is less than 9.

This PR branch with prompt_length=9: Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside prompt_length=8: Once upon a time, there is gepubliceerd a new version of the game Civilization IV called Civilization IV: Col

The master branch with prompt_length=9: Once upon a time, there is a small village nestled in the mountains. Unterscheidung between the two is not always clear prompt_length=8: Once upon a time, there isiech shut shut shut shut shutoni oni shut shut shutoni oni

The main branch is worse than this PR. And I can tell that this is not related to the sub-group size. So I will fix it in the next PR instead of this one.

You can check this comment. How about a longer prompt?

@qnixsynapse
Copy link
Collaborator

llama-8B on a longer prompt. I have Arc A750 GPU if that matters.

image

@luoyu-intel
Copy link
Contributor Author

I think there are two issues: a). short prompt produces the repeating tokens. b). garbage tokens when the context length is larger than some values.

@luoyu-intel
Copy link
Contributor Author

@qnixsynapse The first one is confirmed as an existing issue of the master branch. I will look into the second one to see whether it is introduced by this PR.

@qnixsynapse
Copy link
Collaborator

It's a regression since before this patch it used to work well(although a bit slower). I am still trying to debug. Sorry that I couldn't test it before because I was hooked in testing Gemma models.

@luoyu-intel
Copy link
Contributor Author

I didn't test Q4_K_S models. I will test it on A770.

@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Jul 2, 2024

Yup confirmed. Works great on CPU.
Tested iQ4_XS and Q4_K_S models.
image

Edit: Will test on Q4_0 model (although this is a legacy quant)

Edit 2: Broken on q4_0 model as well.
image

Edit 3: I will test with increasing the warp size manually later to see if that fixes the issue. (I know it shouldn't but still)

if (GGML_SYCL_TARGET STREQUAL "NVIDIA")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl-targets=nvptx64-nvidia-cuda")
add_compile_definitions(GGML_SYCL_WARP_SIZE=32)
else()
add_compile_definitions(GGML_SYCL_WARP_SIZE=16)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qnixsynapse Could you change this size to 32 and test your model?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works perfectly on warp size of 32!!!!
image

@luoyu-intel
Copy link
Contributor Author

PR Q4_0, warp_size=32
 Once upon a time, there is a small village nestled in the rolling hills of the countryside. Unterscheidung between the two is not always clear-cut, and both terms are often used interchangeably. The village is home to a small population of people who live and work together in a close-knit community.

PR Q4_0, warp_size=16
 Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside.rezzo The villagers of the village were known for their exceptional craftsmanship and artistic abilities. They were skilled in the art of woodworking, weaving, and pottery. The villagers were also

PR Q4K_S, warp_size=32
 Once upon a time, there is a small village nestled in the mountains. The villagers lived simple lives, farming the land and raising their families. But one day, a great evil descended upon the village, in the form of a powerful sorcerer.
The sorcerer was angry and resentful towards the villagers, and

PR Q4K_S, warp_size=16
 Once upon a time, there is a smalloshtztztzrtrtrtrtttt tt tt tuleuleuleuleuleuleuleuleuleuleuleule Roman Roman Roman Roman Roman Romananeaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneaneane@{ane

I've confirmed this bug on Q4K_S.

@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Jul 2, 2024

PR Q4_0, warp_size=16
Once upon a time, there is a small, remote village nestled among the rolling hills of the countryside.rezzo The villagers of the village were known for their exceptional craftsmanship and artistic abilities. They were skilled in the art of woodworking, weaving, and pottery. The villagers were also

hills of the countryside.rezzo

It's also broken on Q4_0

Working great on iQ4_XS quant as well.
image

@airMeng
Copy link
Collaborator

airMeng commented Jul 2, 2024

@qnixsynapse BTW which UI are you using, looking quite cool

@qnixsynapse
Copy link
Collaborator

@airMeng It's chainlit.

@luoyu-intel
Copy link
Contributor Author

@qnixsynapse WARP_SIZE=32 works fine for me. I can change WARP_SIZE to 32 for Intel GPUs in the new PR to revert this regression. Do you agree with this?

@qnixsynapse
Copy link
Collaborator

qnixsynapse commented Jul 2, 2024

@luoyu-intel Sure. :)

Edit: BTW, I am getting about 30 tokens/sec with iQ4_XS, earlier generation speed was 20 tokens/sec; with warp_size of 32 and the other portions of this PR, so please don't revert anything else. :)

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 2, 2024
* use warp_size macro for all sycl kernels

* fix mask of permute_sub_group_by_xor

* fix rms_norm with correct warp number

* fix rms_norm_f32/group_norm_f32

* move norm to norm.cpp file

* fix quantize bug

* fix mmvq's batch size
@luoyu-intel luoyu-intel deleted the intel_sgsize branch July 3, 2024 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning Review Complexity : High Generally require indepth knowledge of LLMs or GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants