-
Notifications
You must be signed in to change notification settings - Fork 12.2k
vulkan: add environment variable to avoid VRAM allocation #11592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you for the contribution!
Nice, but can it be more interesting to simply use host_buffer when device memory is ask? ie: someting like: static ggml_backend_buffer_type_t ggml_backend_vk_device_get_buffer_type(ggml_backend_dev_t dev) {
ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
if (ctx->prefer_host_memory) { // move from vk_device ...
return ggml_backend_vk_device_get_host_buffer_type(dev);
}
return ggml_backend_vk_buffer_type(ctx->device);
} Did test this simple change => do not work (crache) need more change to make it work |
If you find changes that help on your device, go ahead and submit a PR. There's probably things that can be improved about how we handle UMA devices. |
I try some change... but for now need more time do see if I have some gain. For now not real gain on LLM on a Ryzen 7940HS. (08/05/2025:) static vk_buffer ggml_vk_create_buffer_device(vk_device& device, size_t size) {
vk_buffer buf;
try {
/*if (device->prefer_host_memory) {
buf = ggml_vk_create_buffer(device, size,
vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent,
vk::MemoryPropertyFlagBits::eDeviceLocal);
} else */if (device->uma) {
// actual code
//buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal, vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent);
// test code
buf = ggml_vk_create_buffer(device, size,
vk::MemoryPropertyFlagBits::eHostCached | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent,
vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent);
} else {
buf = ggml_vk_create_buffer(device, size,
// actual code use that
// vk::MemoryPropertyFlagBits::eDeviceLocal | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent,
vk::MemoryPropertyFlagBits::eDeviceLocal);
}
} catch (const vk::SystemError& e) {
std::cerr << "ggml_vulkan: Device memory allocation of size " << size << " failed." << std::endl;
std::cerr << "ggml_vulkan: " << e.what() << std::endl;
throw e;
}
return buf;
} I don't really understand the logic that was used. Form me without uma the alloc with ggml_vulkan: 0 = AMD Radeon RX 560 Series (RADV POLARIS11) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | matrix cores: none
On more resent GPU (AMD Radeon RX 6900 XT - 16.0 GiB) I see same perf on both case. Did someone know why it have be done like that ?
(Note: there is a little more than this part on the pach for that perf.) and I do not see same gain on a AMD Radeon 780M (RADV PHOENIX) ... |
I need to find the time to publish my latest test case... Does anyone know why we use |
We only use |
The fact that it is available did not mean it is faster ... And for now eHostVisible | eHostCoherent is use for weight tensor here: llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp Lines 1461 to 1469 in 6602304
|
It always uses device-local for weight tensors, unless you force prefer_host_memory for the reasons shown in this PR. I'm not sure what you are trying to say. |
buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent, vk::MemoryPropertyFlagBits::eDeviceLocal); You ask first for memory that is on device with sync/access from host, why? weight don't need host access. And I have devices that can provide such memory, but at the cost of a big loss of performance. Why is it define like that? |
See #9251. It enables faster memory transfers if a large BAR space is available through ReBAR. |
Yes it can be faster with ReBAR but it is horrible on system that allow this config without ReBAR. For me the "probleme" is to assume that only system with ReBAR can create buffer with In my case I have a x3 from |
Please open an issue about it and provide benchmarks and the output of vulkaninfo for your configuration. |
With Vulkan on my PC (Ryzen 5 3400G APU, DDR4-3000, Debian 12), I noticed big performance drops (~2x or ~3x) associated with buffer allocations on VRAM.
It's easier to test with stable-diffusion.cpp: the VAE step on a 512x512 sd1.5 generation usually takes around 40 seconds with the default 2G dedicated VRAM. But if I restrict VRAM to a very small value (64M-80M), that timing drops to around 13 seconds.
I noticed a similar performance drop on LLMs, but it's harder to pinpoint. For instance, prompt processing on smaller models running nearly twice as slow as larger ones, performance changing right after a koboldcpp restart, or inconsistent results between benchmarks and generation.
Checking with GGML_VULKAN_MEMORY_DEBUG, the slower behavior seems to be always associated with allocations on device memory, so I added this env var to confirm. And forcing host memory allocations seems to fix the performance drop.
OTOH, I don't see the original performance issue on a 4500U laptop (Ubuntu 24.04, DDR4-3200), so this would benefit from testing on different iGPU+OS combinations.