Skip to content

Dequant improvements rebase #8255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 3, 2024

Conversation

AidanBeltonS
Copy link
Contributor

This PR provides improvements to the dequantize_block_q4_K kernel. It focuses on improving the global memory accesses.

Three main changes are implemented:

  • Single 32 bit load for half2 rather than two 16 bit loads
  • Load all scales in to local memory then do random access on results
  • Vectorize the q load so we load 32bits each time rather than 8bits

All results below collected on A100 GPU

Without Changes With Changes % Change
LLama-bench 70 B PP Throughput (t/s) 503.36 564.04 -11.85 Negative change is better
NSYS Avg Kernel time (us) 587.54 409.52 30.30 Positive change is better

No meaningful change in Intel GPU results have been observed.

@JohannesGaessler JohannesGaessler added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jul 2, 2024
Copy link
Contributor

@OuadiElfarouki OuadiElfarouki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement observed on Nvidia A4000 & RTX 4070 as well (7B & 13B - Q4_K_*).
Thanks!

@joeatodd
Copy link
Contributor

joeatodd commented Jul 2, 2024

Ping @airMeng to check for regressions on Intel side

Copy link
Collaborator

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird that I don't see any performance improvements on Arc A770, no regression either.

ping our performance expert @luoyu-intel

@luoyu-intel
Copy link
Contributor

The code looks good. It can prevent cache misses. So you may not see the performance improvement if there are no cache misses in your case.

@airMeng airMeng merged commit fadde67 into ggml-org:master Jul 3, 2024
53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 3, 2024
* Single load for half2

* Store scales in local mem

* Vec load quantized values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants