Just thinking about iGPU #1185

FNsi · 2023-04-26T00:42:28Z

FNsi
Apr 26, 2023

Allocate huge vram to delicated AMD gpu

As we know 680m in 6700h, close to 2050,

May the cheapest way to do anything😅😂

SlyEcho · 2023-04-26T13:52:58Z

SlyEcho
Apr 26, 2023
Collaborator

It runs on the Steam Deck, so why not.

1 reply

FNsi Apr 26, 2023
Author

It runs on the Steam Deck, so why not.

Yep, I agree that boards without limits of 4g vram allocated, would be awesome, think about it, 60g vram vega iGPU...single APU fine tune 13b models hahahaha.😂

replete · 2023-05-16T18:13:24Z

replete
May 16, 2023

AFAIK, a maximum 4GB of system RAM can be shared with an AMD APU integrated GPU. If 16GB were possible, I would immediately order a framework ryzen laptop... please update if you learn anything different..

EDIT: 16GB VRAM (UMA Frame Buffer Size in BIOS) possible on some manufacturer BIOS with at least 680M/780M (Zen 3/Zen 4)

1 reply

FNsi May 16, 2023
Author

It's depends on bios, and it's easy to find 5600g with 8g or above igpu share memory.

FNsi · 2023-05-23T12:53:10Z

FNsi
May 23, 2023
Author

I spent some time use setup_var changed my igpu size to 8g.

But it's sad the speed not as good as my expectation, especially considered the openblas via clblas speed difference: the 7B speed in GPU evol time in CLBLas version --ngl 1000 is almost same as cpu, and yes the max mem is around 5.6.😂

9 replies

replete May 23, 2023

7840HS roughly 21% faster than 6800U on CPU, 780M is supposedly around 25% faster than 680M with 3ghz boost vs 2.4ghz. Have a feeling the DDR5 clock will have some impact.

Even if the speed isn't that much better than CPU, the value of being able to load a model and while working have a reasonable offline LLM available using just the GPU on an integrated iGPU is appealing, esp if you can allocate it 16GB VRAM and run a 13b model.. thanks for sharing your results!

FNsi May 23, 2023
Author

Yep, you can find some guide about setup_var, and maybe you can change your bios sittings, in my case there's no option for 8g or 16g, and I just guessed the value and it works😂😂😂

replete May 23, 2023

3600mhz is on the lower end like the original ddr5 kits released, 5600 available on crucial for OEM type stuff. 48GB modules just landed too, so this local laptop LLM space is starting to get interesting... I would expect to see a reasonable improvement with faster RAM, how was your GPU utilization under load?

FNsi May 23, 2023
Author

3600mhz is on the lower end like the original ddr5 kits released, 5600 available on crucial for OEM type stuff. 48GB modules just landed too, so this local laptop LLM space is starting to get interesting... I would expect to see a reasonable improvement with faster RAM, how was your GPU utilization under load?

100% while sampling. I checked the bios file and there's only above4g mmio. So I guess nothing work like resize bar or Sam. But at least, I can use it draw high res Ai girls 😂

FNsi May 23, 2023
Author

More detail: it's the printing evol time problem. So basically it's ram speed I think. 13b take 11g maximum vram.

The printing time is extended from 7B's 40ms to 13b' 300ms. Must because of resize bar, and it's hard to enable it without flash bios in my laptop.

ghost · 2023-05-24T03:44:27Z

ghost
May 24, 2023

If there is a lot of interest in iGPUs it might be worth creating a zero-copy GPU implementation. That's only possible on iGPUs since they share the main memory with the CPU.

3 replies

FNsi May 24, 2023
Author

That would be really interesting!

replete May 28, 2023

I'm not sure if the interest has yet fully materialized from my own internet searches, but the workflow of running a 13b model on a 'spare' iGPU on a decent system very much has practical merit for real-world use and I hope people realize soon how useful this workflow can be on modern hardware

FNsi May 28, 2023
Author

I'm not sure if the interest has yet fully materialized from my own internet searches, but the workflow of running a 13b model on a 'spare' iGPU on a decent system very much has practical merit for real-world use and I hope people realize soon how useful this workflow can be on modern hardware

I have an idea, just set vram to 512 or less, then use opencl. I assume that would reduce copy time. *using windows

Btw, I'd like to wait openllama 3b, for reason it's like

Easy to see, training efficiency in data scale almost equal to parameters scale. Not sure how for that would be in the end.

FNsi · 2023-05-28T06:22:38Z

FNsi
May 28, 2023
Author

Today I do some test in windows, but I think it also will work in Linux. Just reduce the thread of cpu will increase the speed of printing.
Finally get the same printing time of cpu.

Update: not the same in Linux, still too slow to use.

13b(interactive mode)
Pure CPU printing time 80;
Linux -ngl 1000 -t 4 150;
Windows -ngl 1000 -t 4 80;

Update 2:
Increase the batch size will increase that speed.

And I failed to reproduce the windows speed,
They all to around 140 ms in 13b

I assume it's the same problem like Intel p/e core...

30b: q4_0 60 layors only take 19g vram.(plus Blas)
Speed
Printing time 330, same as cpu -t 8

0 replies

xiangyang-95 · 2023-11-03T15:27:52Z

xiangyang-95
Nov 3, 2023

May I know is there currently an iGPU zero copy implementation in llama.cpp?

7 replies

FNsi Nov 5, 2023
Author

I guess the only thing needed is DML backend...in windows 😅

shibe2 Nov 5, 2023

prompt processing is on par with the performance of the CPU

Which is faster?

I would be interested in working on shared memory support in OpenCL back-end, but I don't have suitable hardware, and I'm not sure if there are iGPUs that are worth supporting.

FNsi Nov 7, 2023
Author

I can almost confirm the speed limit is due to ram. So best solution is offload layers base on your bandwidth(like 50G/s)...so in that case barely guess offload 2g use 40ms would be helpful.

The thing needed is like 7200mhz ddr5 but amd laptops are not supported...

(I found that when I was playing games in 4k, while the small containers it's okay , but the big one likes doing PowerPoint. )

xiangyang-95 Nov 7, 2023

I am running llama.cpp on Ubuntu 22.04 LTS. The hardware that I have used is Intel 11th Gen i7-11665G7, with dual channel memory installed. The GPU is Intel Iris Xe Graphics. The result I have gotten when I run llama-bench with different number of layer offloaded is as below:
ggml_opencl: selecting platform: 'Intel(R) OpenCL HD Graphics'
ggml_opencl: selecting device: 'Intel(R) Iris(R) Xe Graphics [0x9a49]'
ggml_opencl: device FP16 support: true

model	size	params	backend	ngl	test	t/s
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	1	pp 512	26.31 ± 0.73
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	1	tg 128	7.33 ± 0.06
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	2	pp 512	26.50 ± 0.34
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	2	tg 128	6.99 ± 0.23
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	4	pp 512	25.57 ± 2.16
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	4	tg 128	6.55 ± 0.12
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	8	pp 512	26.31 ± 0.78
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	8	tg 128	6.14 ± 0.11
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	16	pp 512	26.69 ± 0.60
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	16	tg 128	5.31 ± 0.02
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	32	pp 512	27.15 ± 1.25
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	32	tg 128	4.12 ± 0.10
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	33	pp 512	26.45 ± 1.28
llama 7B mostly Q4_K - Medium	4.07 GiB	7.24 B	OpenCL	33	tg 128	4.16 ± 0.05

The CPU result I have tested is around 25t/s for pp 512 and 7t/s for tg128. This is why I wondering if a zero-copy implementation can help in increasing the token generation t/s when I offload all layers to the integrated GPU. I can help to work on this if this is a feasible way to do.

shibe2 Nov 7, 2023

It looks strange that ngl doesn't significantly affect pp performance. I would think that copying model parameters beforehand eliminates most of copying during inference. In this light, implementing shared buffers does not look very lucrative.

Potentially more advantageous thing would be to avoid storing de-quantized parameters in RAM. OpenCL back-end already has kernels for quantized matrix-vector multiplication that can be extended to support matrix multiplication.

by offloading some/all layers to the integrated GPU, I could free up some of the CPU resources for some other processes

While it would free some CPU, memory would still be busy.

Just thinking about iGPU #1185

Uh oh!

Uh oh!

Replies: 6 comments · 21 replies

Uh oh!

SlyEcho Apr 26, 2023 Collaborator

Uh oh!

FNsi Apr 26, 2023 Author

Uh oh!

Uh oh!

Uh oh!

FNsi May 16, 2023 Author

Uh oh!

Uh oh!

FNsi May 23, 2023 Author

Uh oh!

Uh oh!

FNsi May 23, 2023 Author

Uh oh!

Uh oh!

Uh oh!

FNsi May 23, 2023 Author

Uh oh!

Uh oh!

FNsi May 23, 2023 Author

Uh oh!

Uh oh!

FNsi May 24, 2023 Author

Uh oh!

Uh oh!

Uh oh!

FNsi May 28, 2023 Author

Uh oh!

Uh oh!

FNsi May 28, 2023 Author

Uh oh!

Uh oh!

FNsi Nov 5, 2023 Author

Uh oh!

Uh oh!

FNsi Nov 7, 2023 Author

Uh oh!

Uh oh!

Replies: 6 comments 21 replies

SlyEcho
Apr 26, 2023
Collaborator

FNsi Apr 26, 2023
Author

FNsi May 16, 2023
Author

FNsi
May 23, 2023
Author

FNsi May 23, 2023
Author

FNsi May 23, 2023
Author

FNsi May 23, 2023
Author

FNsi May 24, 2023
Author

FNsi May 28, 2023
Author

FNsi
May 28, 2023
Author

FNsi Nov 5, 2023
Author

FNsi Nov 7, 2023
Author