llama2 + CUDA at Windows - how to run main.exe to use GPU resources? #2377

Tarmenale · 2023-07-24T17:23:40Z

Tarmenale
Jul 24, 2023

Hi,

INTRODUCTION

I've compiled llama.cpp under Windows with CUDA support (Visual Studio 2022).

Compilation flags: GGML_USE_CUBLAS;GGML_USE_K_QUANTS;_CRT_SECURE_NO_WARNINGS;WIN32;WIN64;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)

Project compiled correctly (in debug and release).
I've also created model (LLAMA-2 13B-chat) with 4.1 setting
I've loaded this model (cool!)

ISSUE
Model is ultra slow

QUESTION
How to run model to ensure proper performance (boost from GPU/CUDA)?

MY PARAMETERS FOR TESTING PURPOSE
-p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1

Please note that I don't know what parameters should I use to have good performance

My output

main: build = 0 (VS2022)
main: seed  = 1690219369
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Quadro M1000M, compute capability 5.0 (Cores = 512)
llama.cpp: loading model from models/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 7966.92 MB (+  400.00 MB per state)
llama_model_load_internal: allocating batch_size x (640 kB + n_ctx x 160 B) = 360 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 1 repeating layers to GPU
llama_model_load_internal: offloaded 1/43 layers to GPU
llama_model_load_internal: total VRAM used: 550 MB
llama_new_context_with_model: kv self size  =  400.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0


 Building a website can be done in 10 simple steps:
(...) <-- and here I wait very long to start receiving answer

Answered by mirek190

Jul 24, 2023

...128 bit, ddr5 80 GB/s , 4GB - I do know what you expect from such card ... my RAM is faster ;)

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | <- your CPU is even lack of AVX

llama_model_load_internal: total VRAM used: 550 MB <- you used only 550MB VRAM you can try --n-gpu-layers 10 or even 20

View full answer

slaren · 2023-07-24T18:03:56Z

slaren
Jul 24, 2023
Maintainer

What you did looks correct, I think it is just that your GPU is very old, slow, and has very little VRAM. You can try offloading a few more layers while using --low-vram, but I doubt it will help much.

2 replies

Tarmenale Jul 24, 2023
Author

But my GPU is almost idling in Windows Task Manager :/ I don't see any boost comparing to running model on 4 threads (CPU) without GPU. It doesn't sound right. I have 512 CUDA cores available at GPU but I can see zero performance improvement so it raises a question if GPU usage is actually correctly implemented in this project.

At screenshoot you can see that I'm not utilizing hardware resources. GPU is idling.

slaren Jul 24, 2023
Maintainer

You are offloading 1/43 layers to the GPU. Let's say that your GPU is actually faster than your CPU, so that 1 layer takes half as much as each of the CPU layers. Then you could expect about 1% GPU usage.

mirek190 · 2023-07-24T18:34:47Z

mirek190
Jul 24, 2023

...128 bit, ddr5 80 GB/s , 4GB - I do know what you expect from such card ... my RAM is faster ;)

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | <- your CPU is even lack of AVX

llama_model_load_internal: total VRAM used: 550 MB <- you used only 550MB VRAM you can try --n-gpu-layers 10 or even 20

4 replies

Tarmenale Jul 24, 2023
Author

Adjusted parameters, but still GPU is idling:

slaren Jul 24, 2023
Maintainer

You should be looking at CUDA usage, not 3D. Regardless, what you are doing requires over 4GB of VRAM, and your GPU only has 2 GB. If it works at all, it is because the driver is swapping the data to system RAM, and that's extremely slow.

Tarmenale Jul 24, 2023
Author

Thank you for your hints. Now I have 90%+ GPU usage. I've set --n-gpu-layers to 50.

Tarmenale Jul 24, 2023
Author

Performance is still poor but at least I know that GPU is used. Thank you for help.

Tarmenale · 2023-07-24T20:14:40Z

Tarmenale
Jul 24, 2023
Author

I can also add that I've tested same model under Ubuntu 22.04 (via WSL at Windows 10 Pro) with CPU only support and compared it to build done in Visual Studio 2022 (Release build) with CPU only support and... performance is 10 times worse at native Windows comparing to Linux build via WSL...

So... it seems that testing LLAMA2 does not have any sense under Windows.

0 replies

mirek190 · 2023-07-24T20:20:46Z

mirek190
Jul 24, 2023

Under windows is slower because your windows built is not handling AVX2 CPU instructions for no reason .

1 reply

Tarmenale Jul 24, 2023
Author

Yep, I've checked it. It is possible to enable AVMX2 and performance now is better than at WSL Linux. Option which needs to be enabled:

https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-170

"In the Enable Enhanced Instruction Set drop-down box, choose Advanced Vector Extensions (/arch:AVX), Advanced Vector Extensions 2 (/arch:AVX2)"

llama2 + CUDA at Windows - how to run main.exe to use GPU resources? #2377

Uh oh!

Uh oh!

Tarmenale Jul 24, 2023

Replies: 4 comments · 7 replies

Uh oh!

slaren Jul 24, 2023 Maintainer

Uh oh!

Uh oh!

Tarmenale Jul 24, 2023 Author

Uh oh!

slaren Jul 24, 2023 Maintainer

Uh oh!

Uh oh!

mirek190 Jul 24, 2023

Uh oh!

Tarmenale Jul 24, 2023 Author

Uh oh!

slaren Jul 24, 2023 Maintainer

Uh oh!

Tarmenale Jul 24, 2023 Author

Uh oh!

Tarmenale Jul 24, 2023 Author

Uh oh!

Tarmenale Jul 24, 2023 Author

Uh oh!

mirek190 Jul 24, 2023

Uh oh!

Tarmenale Jul 24, 2023 Author

Tarmenale
Jul 24, 2023

Replies: 4 comments 7 replies

slaren
Jul 24, 2023
Maintainer

Tarmenale Jul 24, 2023
Author

slaren Jul 24, 2023
Maintainer

mirek190
Jul 24, 2023

Tarmenale Jul 24, 2023
Author

slaren Jul 24, 2023
Maintainer

Tarmenale Jul 24, 2023
Author

Tarmenale Jul 24, 2023
Author

Tarmenale
Jul 24, 2023
Author

mirek190
Jul 24, 2023

Tarmenale Jul 24, 2023
Author