Replies: 1 comment
-
You could possibly try using the
#1597 got merged fairly recently and I think that was supposed to help with this, it'll only have an effect if you're using mmap though and stuff like using LoRA prevents that (also some older GGML files may not be mmapable). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm trying to inference on a virtual GPU from my cloud provider (NVIDIA A16). (I believe) llama.cpp is hard coded to support only native devices in the Makefile, I was able to get it to compile and run somewhat successfully by changing this flag to
-arch=compute_50
.When I run main with
./main -m <path> -ngl 1
I see this output:...
However after a few seconds I hit this fatal error:
I was under the assumption that with the
-ngl
llama.cpp would only allocate 621MB of VRAM. Which would be well within the 1024MB on my vGPU. Does llama.cpp still allocate the full required memory on both the CPU and GPU (5331.34MB on both CPU and GPU)? Or should it only allocate the VRAM memory used by the offloaded layers = 621MB on the GPU?If it should only be allocating 621MB on the GPU then my quick hack clearly didn't work.
Has anyone gotten llama.cpp to run on virtual GPUs?
Beta Was this translation helpful? Give feedback.
All reactions