vulkan: add environment variable to avoid VRAM allocation #11592

wbruna · 2025-02-02T12:27:24Z

With Vulkan on my PC (Ryzen 5 3400G APU, DDR4-3000, Debian 12), I noticed big performance drops (~2x or ~3x) associated with buffer allocations on VRAM.

It's easier to test with stable-diffusion.cpp: the VAE step on a 512x512 sd1.5 generation usually takes around 40 seconds with the default 2G dedicated VRAM. But if I restrict VRAM to a very small value (64M-80M), that timing drops to around 13 seconds.

I noticed a similar performance drop on LLMs, but it's harder to pinpoint. For instance, prompt processing on smaller models running nearly twice as slow as larger ones, performance changing right after a koboldcpp restart, or inconsistent results between benchmarks and generation.

Checking with GGML_VULKAN_MEMORY_DEBUG, the slower behavior seems to be always associated with allocations on device memory, so I added this env var to confirm. And forcing host memory allocations seems to fix the performance drop.

OTOH, I don't see the original performance issue on a 4500U laptop (Ubuntu 24.04, DDR4-3200), so this would benefit from testing on different iGPU+OS combinations.

…VRAM allocation

0cc4m

Looks good to me. Thank you for the contribution!

…VRAM allocation (ggml-org#11592)

Djip007 · 2025-04-07T00:48:21Z

Nice, but can it be more interesting to simply use host_buffer when device memory is ask?
That way we get is_host to tree and reduce the need for copie when le CPU need to use it?

ie: someting like:

static ggml_backend_buffer_type_t ggml_backend_vk_device_get_buffer_type(ggml_backend_dev_t dev) {
    ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
    if (ctx->prefer_host_memory) { // move from vk_device ...
        return ggml_backend_vk_device_get_host_buffer_type(dev);
    }
    return ggml_backend_vk_buffer_type(ctx->device);
}

Did test this simple change => do not work (crache) need more change to make it work

0cc4m · 2025-04-07T07:30:00Z

If you find changes that help on your device, go ahead and submit a PR. There's probably things that can be improved about how we handle UMA devices.

Djip007 · 2025-04-08T01:30:08Z

I try some change... but for now need more time do see if I have some gain.
With resent linux kernel (>=6.10) the AMD driver can use all GTT RAM, so hard to find some gain over default full offload...

For now not real gain on LLM on a Ryzen 7940HS.

(08/05/2025:)
I think I found some cases where there were gains. with an old laptop with an AMD Radeon Vega 8 Graphics (ryzen 3550H) and a Radeon RX 560 Series (RADV POLARIS11) but for now I crach the laptop...

static vk_buffer ggml_vk_create_buffer_device(vk_device& device, size_t size) {
    vk_buffer buf;
    try {
        /*if (device->prefer_host_memory) {
            buf = ggml_vk_create_buffer(device, size,
             vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent,
             vk::MemoryPropertyFlagBits::eDeviceLocal);
        } else */if (device->uma) {
            // actual code
            //buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal, vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent);
            // test code
            buf = ggml_vk_create_buffer(device, size,
            vk::MemoryPropertyFlagBits::eHostCached | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent,
            vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent);
        } else {
            buf = ggml_vk_create_buffer(device, size,
             // actual code use that
             //  vk::MemoryPropertyFlagBits::eDeviceLocal | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent,
             vk::MemoryPropertyFlagBits::eDeviceLocal);
        }
    } catch (const vk::SystemError& e) {
        std::cerr << "ggml_vulkan: Device memory allocation of size " << size << " failed." << std::endl;
        std::cerr << "ggml_vulkan: " << e.what() << std::endl;
        throw e;
    }

    return buf;
}

I don't really understand the logic that was used. Form me without uma the alloc with eDeviceLocal|eHostVisible|eHostCoherent is horible on harware with "emulated" bare for exemple I have:

model	size	params	backend	ngl	test	actual code t/s	patch t/s
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp1	14.85 ± 0.00	54.46 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp2	28.33 ± 0.00	88.13 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp4	49.97 ± 0.00	112.53 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp8	78.60 ± 0.00	121.21 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp16	132.05 ± 0.00	183.24 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp32	250.93 ± 0.00	331.17 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp64	323.37 ± 0.00	360.72 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp128	379.70 ± 0.00	423.15 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp256	417.00 ± 0.00	461.93 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp384	414.64 ± 0.00	453.03 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp512	224.93 ± 0.00	473.94 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp1024	207.79 ± 0.00	465.68 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp2048	180.25 ± 0.00	444.88 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp4096	378.38 ± 0.00	409.04 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	tg16	14.61 ± 0.00	50.87 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	tg256	14.57 ± 0.00	50.41 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	tg512	14.21 ± 0.00	49.06 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	tg1024	14.06 ± 0.00	47.64 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	tg2048	13.74 ± 0.00	45.81 ± 0.00
llama 1B Q8_0	1.22 GiB	1.24 B	Vulkan	999	pp4096+tg256	145.07 ± 0.00	255.31 ± 0.00

On more resent GPU (AMD Radeon RX 6900 XT - 16.0 GiB) I see same perf on both case.

model	backend	test	actual code t/s	PREFER_HOST t/s	this patch t/s
llama 1B Q8_0	Vulkan	pp1	17.61	15.76	18.07
llama 1B Q8_0	Vulkan	pp2	27.05	28.99	35.75
llama 1B Q8_0	Vulkan	pp4	45.25	52.52	59.53
llama 1B Q8_0	Vulkan	pp8	72.21	75.35	80.91
llama 1B Q8_0	Vulkan	pp16	53.75	94.11	92.78
llama 1B Q8_0	Vulkan	pp32	102.53	180.88	184.13
llama 1B Q8_0	Vulkan	pp64	205.37	264.48	259.10
llama 1B Q8_0	Vulkan	pp128	217.26	301.16	304.47
llama 1B Q8_0	Vulkan	pp256	345.60	343.57	348.40
llama 1B Q8_0	Vulkan	pp384	349.22	345.43	349.46
llama 1B Q8_0	Vulkan	pp512	343.13	337.65	343.06
llama 1B Q8_0	Vulkan	pp1024	165.07	327.96	328.96
llama 1B Q8_0	Vulkan	pp2048	188.35	304.67	304.96
llama 1B Q8_0	Vulkan	pp4096	134.79	266.44	267.35
llama 1B Q8_0	Vulkan	tg16	19.79	17.59	20.06
llama 1B Q8_0	Vulkan	tg256	19.46	18.08	20.08
llama 1B Q8_0	Vulkan	tg512	19.17	17.44	20.10
llama 1B Q8_0	Vulkan	tg1024	18.86	16.93	19.45
llama 1B Q8_0	Vulkan	tg2048	17.69	16.18	18.45
llama 1B Q8_0	Vulkan	pp4096+tg256	84.89	121.80	129.11

(Note: there is a little more than this part on the pach for that perf.) and I do not see same gain on a AMD Radeon 780M (RADV PHOENIX) ...

Djip007 · 2025-04-18T22:44:43Z

I need to find the time to publish my latest test case...
For iGPU it may have different result on other use of ggml, did only test with llama.cpp.

Does anyone know why we use
vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent on dGPU?

0cc4m · 2025-04-19T07:51:52Z

We only use eHostVisible | eHostCoherent as staging buffers or pinned RAM on dGPU, to speed up copies to the GPU. We try to use eDeviceLocal | eHostVisible | eHostCoherent if available cause that is BAR memory if resizable BAR is available, that also speeds up copies. If it is not available, it will use just eDeviceLocal to access VRAM.

Djip007 · 2025-04-20T14:40:11Z

The fact that it is available did not mean it is faster ...

And for now eHostVisible | eHostCoherent is use for weight tensor here:

llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp

Lines 1461 to 1469 in 6602304

    
           if (device->prefer_host_memory) { 
        
               buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent, vk::MemoryPropertyFlagBits::eDeviceLocal); 
        
           } else if (device->uma) { 
        
               // Fall back to host memory type 
        
               buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal, vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent); 
        
           } else { 
        
               // use rebar if available, otherwise fallback to device only visible memory 
        
               buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent, vk::MemoryPropertyFlagBits::eDeviceLocal); 
        
           }

0cc4m · 2025-04-21T11:11:00Z

It always uses device-local for weight tensors, unless you force prefer_host_memory for the reasons shown in this PR. I'm not sure what you are trying to say.

Djip007 · 2025-05-28T21:55:06Z

     buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent, vk::MemoryPropertyFlagBits::eDeviceLocal);

You ask first for memory that is on device with sync/access from host, why? weight don't need host access. And I have devices that can provide such memory, but at the cost of a big loss of performance.

Why is it define like that?

0cc4m · 2025-06-03T18:34:19Z

See #9251. It enables faster memory transfers if a large BAR space is available through ReBAR.

Djip007 · 2025-06-03T19:41:04Z

Yes it can be faster with ReBAR but it is horrible on system that allow this config without ReBAR.
as you can see on my test with
ggml_vulkan: 0 = AMD Radeon RX 560 Series (RADV POLARIS11) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | matrix cores: none

For me the "probleme" is to assume that only system with ReBAR can create buffer with eDeviceLocal | eHostVisible | eHostCoherent or that the ReBAR is faster (?)

In my case I have a x3 from eDeviceLocal | eHostVisible | eHostCoherent to eDeviceLocal

0cc4m · 2025-06-03T19:59:37Z

Please open an issue about it and provide benchmarks and the output of vulkaninfo for your configuration.

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid …

27df617

…VRAM allocation

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 2, 2025

0cc4m self-requested a review February 3, 2025 09:04

0cc4m approved these changes Feb 9, 2025

View reviewed changes

0cc4m merged commit b044a0f into ggml-org:master Feb 10, 2025
46 checks passed

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid …

3e3db9d

…VRAM allocation (ggml-org#11592)

orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid …

f111685

…VRAM allocation (ggml-org#11592)

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid …

a51d625

…VRAM allocation (ggml-org#11592)

ubergarm pushed a commit to ubergarm/llama.cpp that referenced this pull request Mar 1, 2025

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid …

f7dad52

…VRAM allocation (ggml-org#11592)

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid …

4b28714

…VRAM allocation (ggml-org#11592)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vulkan: add environment variable to avoid VRAM allocation #11592

vulkan: add environment variable to avoid VRAM allocation #11592

Uh oh!

wbruna commented Feb 2, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Djip007 commented Apr 7, 2025 •

edited

Loading

Uh oh!

0cc4m commented Apr 7, 2025

Uh oh!

Djip007 commented Apr 8, 2025 •

edited

Loading

Uh oh!

Djip007 commented Apr 18, 2025

Uh oh!

0cc4m commented Apr 19, 2025

Uh oh!

Djip007 commented Apr 20, 2025

Uh oh!

0cc4m commented Apr 21, 2025

Uh oh!

Djip007 commented May 28, 2025

Uh oh!

0cc4m commented Jun 3, 2025

Uh oh!

Djip007 commented Jun 3, 2025

Uh oh!

0cc4m commented Jun 3, 2025

Uh oh!

Uh oh!

vulkan: add environment variable to avoid VRAM allocation #11592

vulkan: add environment variable to avoid VRAM allocation #11592

Uh oh!

Conversation

wbruna commented Feb 2, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Djip007 commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented Apr 7, 2025

Uh oh!

Djip007 commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Djip007 commented Apr 18, 2025

Uh oh!

0cc4m commented Apr 19, 2025

Uh oh!

Djip007 commented Apr 20, 2025

Uh oh!

0cc4m commented Apr 21, 2025

Uh oh!

Djip007 commented May 28, 2025

Uh oh!

0cc4m commented Jun 3, 2025

Uh oh!

Djip007 commented Jun 3, 2025

Uh oh!

0cc4m commented Jun 3, 2025

Uh oh!

Uh oh!

Djip007 commented Apr 7, 2025 •

edited

Loading

Djip007 commented Apr 8, 2025 •

edited

Loading