How much VRAM for all 83 layers of 65B and how to split among GPUs? #2202
Replies: 4 comments 5 replies
-
~70-72GB for a q8 65B model I believe. You should be able to get a q5_K_M 65B model in your VRAM though, and while you're playing with that kick off a download of a q6_K version as a potential upgrade, although that will be incredibly tight if it fits at all (on a 3,3,1 split) as that'll be in the ballpark of all 56GB VRAM. Now, being an AMD consumer-grade babby I've never used split before so don't expect any of this to work. I've probably made an incorrect assumption (3x8GB,3x8GB,1x8GB). |
Beta Was this translation helpful? Give feedback.
-
I don't have nvtop, but nvidia-smi shows the following while llama.cpp is regenerating a chat in Silly Tavern (82/83 layers):
There's room to spare at the moment. |
Beta Was this translation helpful? Give feedback.
-
Nice! 3,4,1 split did the trick. Edited: not sure if it's faster though. Here's an nvidia-smi during a 'regen' in Silly Tavern
Looks like 2x P40 & 1x P4 is a working combo to get 65B on all 83 layers. |
Beta Was this translation helpful? Give feedback.
-
speed comparison time: 3,4,1 split, 83/83
82/83, no split specified:
Yeah, 3,4,1 makes it work, but it's a lot slower. Update:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
My llama.cpp setup now has the following GPUs:
2 P40 24GB
1 P4 8GB
I've tried setting the split to 4,4,1 and defining GPU0 (a P40) as the primary (this seems to be the default anyway), but the most layers I can get in GPU without hitting an OOM, however, is 82. It will start up with 83 layers in GPU, but it always throws a CUDA OOM on the first reply.
That said, the little P4 is definitely making a difference. It is noticeably faster with it installed.
I'd use a bigger card, but my server hosting the P40s only has room left in the half-height/length riser bay, and only a P4 fits there. Physically, there's room for a total of 3 P4 cards, but I suspect the riser can't handle that much power being pulled from the PCIe interface alone.
Beta Was this translation helpful? Give feedback.
All reactions