Dequant improvements rebase #8255

AidanBeltonS · 2024-07-02T11:45:58Z

This PR provides improvements to the dequantize_block_q4_K kernel. It focuses on improving the global memory accesses.

Three main changes are implemented:

Single 32 bit load for half2 rather than two 16 bit loads
Load all scales in to local memory then do random access on results
Vectorize the q load so we load 32bits each time rather than 8bits

All results below collected on A100 GPU

		Without Changes	With Changes	% Change
LLama-bench 70 B	PP Throughput (t/s)	503.36	564.04	-11.85	Negative change is better
	NSYS Avg Kernel time (us)	587.54	409.52	30.30	Positive change is better

No meaningful change in Intel GPU results have been observed.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

OuadiElfarouki

Improvement observed on Nvidia A4000 & RTX 4070 as well (7B & 13B - Q4_K_*).
Thanks!

joeatodd · 2024-07-02T12:04:52Z

Ping @airMeng to check for regressions on Intel side

airMeng

It is weird that I don't see any performance improvements on Arc A770, no regression either.

ping our performance expert @luoyu-intel

luoyu-intel · 2024-07-03T01:39:17Z

The code looks good. It can prevent cache misses. So you may not see the performance improvement if there are no cache misses in your case.

* Single load for half2 * Store scales in local mem * Vec load quantized values

Aidan added 4 commits July 2, 2024 12:41

Remove double lines

730819c

Single load for half2

80d3b39

Store scales in local mem

6c7c937

Vec load quantized values

504a47a

JohannesGaessler added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jul 2, 2024

OuadiElfarouki approved these changes Jul 2, 2024

View reviewed changes

airMeng approved these changes Jul 2, 2024

View reviewed changes

airMeng merged commit fadde67 into ggml-org:master Jul 3, 2024
53 checks passed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 3, 2024

Dequant improvements rebase (ggml-org#8255)

51be862

* Single load for half2 * Store scales in local mem * Vec load quantized values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dequant improvements rebase #8255

Dequant improvements rebase #8255

Uh oh!

AidanBeltonS commented Jul 2, 2024

Uh oh!

OuadiElfarouki left a comment

Uh oh!

joeatodd commented Jul 2, 2024

Uh oh!

airMeng left a comment •

edited

Loading

Uh oh!

luoyu-intel commented Jul 3, 2024

Uh oh!

Uh oh!

Uh oh!

Dequant improvements rebase #8255

Dequant improvements rebase #8255

Uh oh!

Conversation

AidanBeltonS commented Jul 2, 2024

Uh oh!

OuadiElfarouki left a comment

Choose a reason for hiding this comment

Uh oh!

joeatodd commented Jul 2, 2024

Uh oh!

airMeng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luoyu-intel commented Jul 3, 2024

Uh oh!

Uh oh!

Uh oh!

airMeng left a comment •

edited

Loading