How much VRAM for all 83 layers of 65B and how to split among GPUs? #2202

quarterturn · 2023-07-12T21:14:15Z

quarterturn
Jul 12, 2023

My llama.cpp setup now has the following GPUs:
2 P40 24GB
1 P4 8GB

I've tried setting the split to 4,4,1 and defining GPU0 (a P40) as the primary (this seems to be the default anyway), but the most layers I can get in GPU without hitting an OOM, however, is 82. It will start up with 83 layers in GPU, but it always throws a CUDA OOM on the first reply.

That said, the little P4 is definitely making a difference. It is noticeably faster with it installed.

I'd use a bigger card, but my server hosting the P40s only has room left in the half-height/length riser bay, and only a P4 fits there. Physically, there's room for a total of 3 P4 cards, but I suspect the riser can't handle that much power being pulled from the PCIe interface alone.

africalimedrop · 2023-07-13T22:53:17Z

africalimedrop
Jul 13, 2023

~70-72GB for a q8 65B model I believe.

You should be able to get a q5_K_M 65B model in your VRAM though, and while you're playing with that kick off a download of a q6_K version as a potential upgrade, although that will be incredibly tight if it fits at all (on a 3,3,1 split) as that'll be in the ballpark of all 56GB VRAM.

Now, being an AMD consumer-grade babby I've never used split before so don't expect any of this to work. I've probably made an incorrect assumption (3x8GB,3x8GB,1x8GB). ~~Other pie-in-the-sky ideas are, if you get the third P40 and are worried about peak draw, to under-volt/clock them. Is that even possible? Who knows.~~ Nevermind, looks like the risers can only handle 75W and I doubt they'll drop that low.

2 replies

quarterturn Jul 13, 2023
Author

I should have mentioned I am using a q4_1 model

africalimedrop Jul 14, 2023

I'm stumped then, that should use ~44GB total. Is anything else using the VRAM? Something like nvtop should show you, and also give you an idea if one card is maxing out on VRAM, as it's possible one might be taking all the extra non-repeating layers / context. Visualising the actual usage might help you balance the split.

quarterturn · 2023-07-14T18:00:42Z

quarterturn
Jul 14, 2023
Author

I don't have nvtop, but nvidia-smi shows the following while llama.cpp is regenerating a chat in Silly Tavern (82/83 layers):

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    346766      C   ./server                                  22268MiB |
|    1   N/A  N/A    346766      C   ./server                                  18766MiB |
|    2   N/A  N/A    346766      C   ./server                                   6302MiB |
+---------------------------------------------------------------------------------------+

There's room to spare at the moment.

1 reply

africalimedrop Jul 14, 2023

Give 3,4,1 split a go. I've done some calcs, working on the assumption you're using a 3,3,1 split in the above example, and it should come out to 16.5GB, 22GB, 5.5GB respectively for the main 80 layers, which leaves some head room for the guestimated 6GB of extra layers to go on GPU0.

Lemme know if I've gone wrong on that assumption, and show me the nvidia-smi table of this new split. If I'm close I'll try and write up how I figured it so you can refine it and maybe shoehorn in a less quantized version in your own time.

quarterturn · 2023-07-14T20:32:47Z

quarterturn
Jul 14, 2023
Author

Nice! 3,4,1 split did the trick. Edited: not sure if it's faster though.

Here's an nvidia-smi during a 'regen' in Silly Tavern

Fri Jul 14 11:30:33 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                       On | 00000000:05:00.0 Off |                  Off |
| N/A   41C    P0              114W / 250W|  22580MiB / 24576MiB |     50%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P40                       On | 00000000:42:00.0 Off |                  Off |
| N/A   44C    P0              111W / 250W|  20708MiB / 24576MiB |     44%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla P4                        On | 00000000:43:00.0 Off |                    0 |
| N/A   45C    P0               31W /  75W|   5814MiB /  7680MiB |     23%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    386534      C   ./server                                  22562MiB |
|    1   N/A  N/A    386534      C   ./server                                  20690MiB |
|    2   N/A  N/A    386534      C   ./server                                   5796MiB |
+---------------------------------------------------------------------------------------+

Looks like 2x P40 & 1x P4 is a working combo to get 65B on all 83 layers.

1 reply

quarterturn Jul 14, 2023
Author

I wanted to add that of course there's still heavy CPU usage

top - 11:35:04 up 1 day,  8:23,  3 users,  load average: 18.99, 11.69, 7.84
Tasks: 605 total,   2 running, 603 sleeping,   0 stopped,   0 zombie
%Cpu(s): 48.0 us,  2.2 sy,  0.0 ni, 47.8 id,  2.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem : 128846.6 total,    609.0 free,   6177.6 used, 122060.0 buff/cache
MiB Swap:   8192.0 total,   7904.3 free,    287.7 used. 112800.3 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 386534 xxxx      20   0  128.4g  47.6g  47.0g S  2304  37.9  79:29.99 server

That's 24 cores of super watt-waster E5-2695 v2 running at about 50% (full tilt would be 4800% I think)

quarterturn · 2023-07-14T20:53:03Z

quarterturn
Jul 14, 2023
Author

speed comparison time:

3,4,1 split, 83/83

llama_print_timings:        load time = 18598.35 ms
llama_print_timings:      sample time =  9867.49 ms /   108 runs   (   91.37 ms per token,    10.95 tokens per second)
llama_print_timings: prompt eval time = 49385.62 ms /  1276 tokens (   38.70 ms per token,    25.84 tokens per second)
llama_print_timings:        eval time = 230031.31 ms /   107 runs   ( 2149.83 ms per token,     0.47 tokens per second)

82/83, no split specified:

llama_print_timings:        load time = 19594.83 ms
llama_print_timings:      sample time =  5832.93 ms /   167 runs   (   34.93 ms per token,    28.63 tokens per second)
llama_print_timings: prompt eval time = 10723.90 ms /   179 tokens (   59.91 ms per token,    16.69 tokens per second)
llama_print_timings:        eval time = 48435.31 ms /   166 runs   (  291.78 ms per token,     3.43 tokens per second)
llama_print_timings:       total time = 65119.56 ms

Yeah, 3,4,1 makes it work, but it's a lot slower.

Update:
Ah, I get bog-downs either way a few replies in, so I'd say it's a toss-up:


llama_print_timings:        load time = 19594.83 ms
llama_print_timings:        load time = 19594.83 ms /   167 runs   (   34.93 ms per token,    28.63 tokens per second)
llama_print_timings:      sample time =  7764.35 ms /   126 runs   (   61.62 ms per token,    16.23 tokens per second)
llama_print_timings: prompt eval time = 57723.31 ms /  1169 tokens (   49.38 ms per token,    20.25 tokens per second)
llama_print_timings:        eval time = 129327.60 ms /   125 runs   ( 1034.62 ms per token,     0.97 tokens per second)

1 reply

africalimedrop Jul 14, 2023

That's a real shame. I would have expected a minor hit to perf as the P40s are no longer balanced, but not on that scale. The first 'solution' that springs to mind is to keep the P40 at an even split and load as little onto the P4 as possible and use it mainly for the three non-repeating layers (which I assume could be done by making it the primary card). I wonder if a 1,1,0 split would work in that case. That would drop about 23GB on each of the P40 and leave the P4 for the ~6GB of non-repeating layers (again, assuming that's how making a card primary works).

Just to rule out the CPU, you're not assigning too many threads are you? I know, it sounds counter-intuive, but for example on my machine with 16 cores (32 threads) I actually get marginally better performance using just 8 threads.

How much VRAM for all 83 layers of 65B and how to split among GPUs? #2202

Uh oh!

quarterturn Jul 12, 2023

Replies: 4 comments · 5 replies

Uh oh!

Uh oh!

africalimedrop Jul 13, 2023

Uh oh!

quarterturn Jul 13, 2023 Author

Uh oh!

africalimedrop Jul 14, 2023

Uh oh!

Uh oh!

quarterturn Jul 14, 2023 Author

Uh oh!

Uh oh!

africalimedrop Jul 14, 2023

Uh oh!

Uh oh!

quarterturn Jul 14, 2023 Author

Uh oh!

Uh oh!

quarterturn Jul 14, 2023 Author

Uh oh!

Uh oh!

quarterturn Jul 14, 2023 Author

Uh oh!

africalimedrop Jul 14, 2023

quarterturn
Jul 12, 2023

Replies: 4 comments 5 replies

africalimedrop
Jul 13, 2023

quarterturn Jul 13, 2023
Author

quarterturn
Jul 14, 2023
Author

quarterturn
Jul 14, 2023
Author

quarterturn Jul 14, 2023
Author

quarterturn
Jul 14, 2023
Author