Skip to content

vulkan: add environment variable to avoid VRAM allocation #11592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 10, 2025

Conversation

wbruna
Copy link
Contributor

@wbruna wbruna commented Feb 2, 2025

With Vulkan on my PC (Ryzen 5 3400G APU, DDR4-3000, Debian 12), I noticed big performance drops (~2x or ~3x) associated with buffer allocations on VRAM.

It's easier to test with stable-diffusion.cpp: the VAE step on a 512x512 sd1.5 generation usually takes around 40 seconds with the default 2G dedicated VRAM. But if I restrict VRAM to a very small value (64M-80M), that timing drops to around 13 seconds.

I noticed a similar performance drop on LLMs, but it's harder to pinpoint. For instance, prompt processing on smaller models running nearly twice as slow as larger ones, performance changing right after a koboldcpp restart, or inconsistent results between benchmarks and generation.

Checking with GGML_VULKAN_MEMORY_DEBUG, the slower behavior seems to be always associated with allocations on device memory, so I added this env var to confirm. And forcing host memory allocations seems to fix the performance drop.

OTOH, I don't see the original performance issue on a 4500U laptop (Ubuntu 24.04, DDR4-3200), so this would benefit from testing on different iGPU+OS combinations.

@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Feb 2, 2025
@0cc4m 0cc4m self-requested a review February 3, 2025 09:04
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you for the contribution!

@0cc4m 0cc4m merged commit b044a0f into ggml-org:master Feb 10, 2025
46 checks passed
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
ubergarm pushed a commit to ubergarm/llama.cpp that referenced this pull request Mar 1, 2025
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
@Djip007
Copy link
Contributor

Djip007 commented Apr 7, 2025

Nice, but can it be more interesting to simply use host_buffer when device memory is ask?
That way we get is_host to tree and reduce the need for copie when le CPU need to use it?

ie: someting like:

static ggml_backend_buffer_type_t ggml_backend_vk_device_get_buffer_type(ggml_backend_dev_t dev) {
    ggml_backend_vk_device_context * ctx = (ggml_backend_vk_device_context *)dev->context;
    if (ctx->prefer_host_memory) { // move from vk_device ...
        return ggml_backend_vk_device_get_host_buffer_type(dev);
    }
    return ggml_backend_vk_buffer_type(ctx->device);
}

Did test this simple change => do not work (crache) need more change to make it work

@0cc4m
Copy link
Collaborator

0cc4m commented Apr 7, 2025

If you find changes that help on your device, go ahead and submit a PR. There's probably things that can be improved about how we handle UMA devices.

@Djip007
Copy link
Contributor

Djip007 commented Apr 8, 2025

I try some change... but for now need more time do see if I have some gain.
With resent linux kernel (>=6.10) the AMD driver can use all GTT RAM, so hard to find some gain over default full offload...

For now not real gain on LLM on a Ryzen 7940HS.

(08/05/2025:)
I think I found some cases where there were gains. with an old laptop with an AMD Radeon Vega 8 Graphics (ryzen 3550H) and a Radeon RX 560 Series (RADV POLARIS11) but for now I crach the laptop...

static vk_buffer ggml_vk_create_buffer_device(vk_device& device, size_t size) {
    vk_buffer buf;
    try {
        /*if (device->prefer_host_memory) {
            buf = ggml_vk_create_buffer(device, size,
             vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent,
             vk::MemoryPropertyFlagBits::eDeviceLocal);
        } else */if (device->uma) {
            // actual code
            //buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal, vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent);
            // test code
            buf = ggml_vk_create_buffer(device, size,
            vk::MemoryPropertyFlagBits::eHostCached | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent,
            vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent);
        } else {
            buf = ggml_vk_create_buffer(device, size,
             // actual code use that
             //  vk::MemoryPropertyFlagBits::eDeviceLocal | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent,
             vk::MemoryPropertyFlagBits::eDeviceLocal);
        }
    } catch (const vk::SystemError& e) {
        std::cerr << "ggml_vulkan: Device memory allocation of size " << size << " failed." << std::endl;
        std::cerr << "ggml_vulkan: " << e.what() << std::endl;
        throw e;
    }

    return buf;
}

I don't really understand the logic that was used. Form me without uma the alloc with eDeviceLocal|eHostVisible|eHostCoherent is horible on harware with "emulated" bare for exemple I have:

ggml_vulkan: 0 = AMD Radeon RX 560 Series (RADV POLARIS11) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | matrix cores: none

model size params backend ngl test actual code t/s patch t/s
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp1 14.85 ± 0.00 54.46 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp2 28.33 ± 0.00 88.13 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp4 49.97 ± 0.00 112.53 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp8 78.60 ± 0.00 121.21 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp16 132.05 ± 0.00 183.24 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp32 250.93 ± 0.00 331.17 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp64 323.37 ± 0.00 360.72 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp128 379.70 ± 0.00 423.15 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp256 417.00 ± 0.00 461.93 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp384 414.64 ± 0.00 453.03 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp512 224.93 ± 0.00 473.94 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp1024 207.79 ± 0.00 465.68 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp2048 180.25 ± 0.00 444.88 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp4096 378.38 ± 0.00 409.04 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 tg16 14.61 ± 0.00 50.87 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 tg256 14.57 ± 0.00 50.41 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 tg512 14.21 ± 0.00 49.06 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 tg1024 14.06 ± 0.00 47.64 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 tg2048 13.74 ± 0.00 45.81 ± 0.00
llama 1B Q8_0 1.22 GiB 1.24 B Vulkan 999 pp4096+tg256 145.07 ± 0.00 255.31 ± 0.00

On more resent GPU (AMD Radeon RX 6900 XT - 16.0 GiB) I see same perf on both case.

Did someone know why it have be done like that ?
ggml_vulkan: 0 = AMD Radeon Vega 8 Graphics (RADV RAVEN) (radv) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 65536 | matrix cores: none

model backend test actual code t/s PREFER_HOST t/s this patch t/s
llama 1B Q8_0 Vulkan pp1 17.61 15.76 18.07
llama 1B Q8_0 Vulkan pp2 27.05 28.99 35.75
llama 1B Q8_0 Vulkan pp4 45.25 52.52 59.53
llama 1B Q8_0 Vulkan pp8 72.21 75.35 80.91
llama 1B Q8_0 Vulkan pp16 53.75 94.11 92.78
llama 1B Q8_0 Vulkan pp32 102.53 180.88 184.13
llama 1B Q8_0 Vulkan pp64 205.37 264.48 259.10
llama 1B Q8_0 Vulkan pp128 217.26 301.16 304.47
llama 1B Q8_0 Vulkan pp256 345.60 343.57 348.40
llama 1B Q8_0 Vulkan pp384 349.22 345.43 349.46
llama 1B Q8_0 Vulkan pp512 343.13 337.65 343.06
llama 1B Q8_0 Vulkan pp1024 165.07 327.96 328.96
llama 1B Q8_0 Vulkan pp2048 188.35 304.67 304.96
llama 1B Q8_0 Vulkan pp4096 134.79 266.44 267.35
llama 1B Q8_0 Vulkan tg16 19.79 17.59 20.06
llama 1B Q8_0 Vulkan tg256 19.46 18.08 20.08
llama 1B Q8_0 Vulkan tg512 19.17 17.44 20.10
llama 1B Q8_0 Vulkan tg1024 18.86 16.93 19.45
llama 1B Q8_0 Vulkan tg2048 17.69 16.18 18.45
llama 1B Q8_0 Vulkan pp4096+tg256 84.89 121.80 129.11

(Note: there is a little more than this part on the pach for that perf.) and I do not see same gain on a AMD Radeon 780M (RADV PHOENIX) ...

@Djip007
Copy link
Contributor

Djip007 commented Apr 18, 2025

I need to find the time to publish my latest test case...
For iGPU it may have different result on other use of ggml, did only test with llama.cpp.

Does anyone know why we use
vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent on dGPU?

@0cc4m
Copy link
Collaborator

0cc4m commented Apr 19, 2025

We only use eHostVisible | eHostCoherent as staging buffers or pinned RAM on dGPU, to speed up copies to the GPU. We try to use eDeviceLocal | eHostVisible | eHostCoherent if available cause that is BAR memory if resizable BAR is available, that also speeds up copies. If it is not available, it will use just eDeviceLocal to access VRAM.

@Djip007
Copy link
Contributor

Djip007 commented Apr 20, 2025

The fact that it is available did not mean it is faster ...

And for now eHostVisible | eHostCoherent is use for weight tensor here:

if (device->prefer_host_memory) {
buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent, vk::MemoryPropertyFlagBits::eDeviceLocal);
} else if (device->uma) {
// Fall back to host memory type
buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal, vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent);
} else {
// use rebar if available, otherwise fallback to device only visible memory
buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent, vk::MemoryPropertyFlagBits::eDeviceLocal);
}

@0cc4m
Copy link
Collaborator

0cc4m commented Apr 21, 2025

It always uses device-local for weight tensors, unless you force prefer_host_memory for the reasons shown in this PR. I'm not sure what you are trying to say.

@Djip007
Copy link
Contributor

Djip007 commented May 28, 2025

     buf = ggml_vk_create_buffer(device, size, vk::MemoryPropertyFlagBits::eDeviceLocal | vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent, vk::MemoryPropertyFlagBits::eDeviceLocal); 

You ask first for memory that is on device with sync/access from host, why? weight don't need host access. And I have devices that can provide such memory, but at the cost of a big loss of performance.

Why is it define like that?

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 3, 2025

See #9251. It enables faster memory transfers if a large BAR space is available through ReBAR.

@Djip007
Copy link
Contributor

Djip007 commented Jun 3, 2025

Yes it can be faster with ReBAR but it is horrible on system that allow this config without ReBAR.
as you can see on my test with
ggml_vulkan: 0 = AMD Radeon RX 560 Series (RADV POLARIS11) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | matrix cores: none

For me the "probleme" is to assume that only system with ReBAR can create buffer with eDeviceLocal | eHostVisible | eHostCoherent or that the ReBAR is faster (?)

In my case I have a x3 from eDeviceLocal | eHostVisible | eHostCoherent to eDeviceLocal

@0cc4m
Copy link
Collaborator

0cc4m commented Jun 3, 2025

Please open an issue about it and provide benchmarks and the output of vulkaninfo for your configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants