Issue with cuBLAS acceleration on vGPUs #1943

bbielsa · 2023-06-19T23:15:53Z

bbielsa
Jun 19, 2023

I'm trying to inference on a virtual GPU from my cloud provider (NVIDIA A16). (I believe) llama.cpp is hard coded to support only native devices in the Makefile, I was able to get it to compile and run somewhat successfully by changing this flag to -arch=compute_50.

When I run main with ./main -m <path> -ngl 1 I see this output:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A16-1Q

...

llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 5331.34 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 1 repeating layers to GPU
llama_model_load_internal: offloaded 1/35 layers to GPU
llama_model_load_internal: total VRAM used: 621 MB

However after a few seconds I hit this fatal error:

CUDA error 2 at ggml-cuda.cu:2582: out of memory

I was under the assumption that with the -ngl llama.cpp would only allocate 621MB of VRAM. Which would be well within the 1024MB on my vGPU. Does llama.cpp still allocate the full required memory on both the CPU and GPU (5331.34MB on both CPU and GPU)? Or should it only allocate the VRAM memory used by the offloaded layers = 621MB on the GPU?

If it should only be allocating 621MB on the GPU then my quick hack clearly didn't work.

Has anyone gotten llama.cpp to run on virtual GPUs?

KerfuffleV2 · 2023-06-20T16:00:53Z

KerfuffleV2
Jun 20, 2023
Collaborator

You could possibly try using the --low-vram option, but offloading one layer doesn't really seem like it's going to have a super dramatic effect. You might be better of just not offloading and seeing if cuBLAS at least makes prompt ingestion faster. If you have less than 1GB VRAM available you likely can't have both (and you probably can't use a full context size either).

Does llama.cpp still allocate the full required memory on both the CPU and GPU

#1597 got merged fairly recently and I think that was supposed to help with this, it'll only have an effect if you're using mmap though and stuff like using LoRA prevents that (also some older GGML files may not be mmapable).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with cuBLAS acceleration on vGPUs #1943

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Issue with cuBLAS acceleration on vGPUs #1943

Uh oh!

bbielsa Jun 19, 2023

Replies: 1 comment

Uh oh!

KerfuffleV2 Jun 20, 2023 Collaborator

bbielsa
Jun 19, 2023

KerfuffleV2
Jun 20, 2023
Collaborator