-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Add checks for buffer size with Metal #1706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for making this more clear, I was getting the nil error myself. It looks like we have very similar machines. I have an M1 Max with 32GB of ram. My question is if I have 32GB of ram, why can it not fit the 18GB model in ram and say that the buffer maximum is 17GB? Is there a way to run 30B models on my M1 Max with 32GB? Will future changes allow me to use the full 32GB? |
@johnrtipton I believe MTLDevice.maxBufferLength always returns 1/2 the size of total RAM |
Interesting. So we could only utilise half of the system RAM for inference? |
See discussion in #1696 (comment) for potential fix |
This is on an iMac 27" w/128GB RAM and an AMD Radeon Pro 5700 XT (16GB) build with 'MPS' (e.g. LLAMA_METAL=1 make).
|
If llama.cpp tries to allocate a Metal buffer that's bigger than the maximum then it only puts out a message that it failed to allocate. This results in the error 'ggml_metal_get_buffer: error: buffer is nil' being given endlessly.
This PR adds a check for the maximum buffer size and adds a check for a false return value from ggml_metal_add_buffer generally. It also propagates the error from llama_init_from_file by returning NULL.
Existing behavior:
main: build = 622 (f4c55d3)
main: seed = 1
llama.cpp: loading model from /Users/spencer/ai/models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 17452.67 MB
llama_model_load_internal: mem required = 2532.68 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 780.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/spencer/ai/repos/llama.cpp/build/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x141e08850
ggml_metal_init: loaded kernel_mul 0x141e08e50
ggml_metal_init: loaded kernel_mul_row 0x141e09480
ggml_metal_init: loaded kernel_scale 0x141e099a0
ggml_metal_init: loaded kernel_silu 0x141e09ec0
ggml_metal_init: loaded kernel_relu 0x141e0a3e0
ggml_metal_init: loaded kernel_soft_max 0x141e0aa90
ggml_metal_init: loaded kernel_diag_mask_inf 0x141e0b0f0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x141e0b770
ggml_metal_init: loaded kernel_rms_norm 0x141e0be20
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x141e0c680
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x141e0d050
ggml_metal_init: loaded kernel_rope 0x141e0d940
ggml_metal_init: loaded kernel_cpy_f32_f16 0x141e0e1d0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x141e0ea60
ggml_metal_add_buffer: failed to allocate 'data ' buffer, size = 17452.67 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1280.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 782.00 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB
system_info: n_threads = 6 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0
USER: Write a one paragraph summary of what happened in 1918. ASSISTANT:Inggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
ggml_metal_get_buffer: error: buffer is nil
...
New behavior:
main: build = 622 (827fd74)
main: seed = 1
llama.cpp: loading model from /Users/spencer/ai/models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 17452.67 MB
llama_model_load_internal: mem required = 2532.68 MB (+ 3124.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size = 780.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/spencer/ai/repos/llama.cpp/build/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x11de07b40
ggml_metal_init: loaded kernel_mul 0x11de08140
ggml_metal_init: loaded kernel_mul_row 0x11de08660
ggml_metal_init: loaded kernel_scale 0x11de08b80
ggml_metal_init: loaded kernel_silu 0x11de090a0
ggml_metal_init: loaded kernel_relu 0x11de095c0
ggml_metal_init: loaded kernel_soft_max 0x11de09c70
ggml_metal_init: loaded kernel_diag_mask_inf 0x11de0a2d0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x11de0a950
ggml_metal_init: loaded kernel_rms_norm 0x11de0b000
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x11de0b860
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x11de0c230
ggml_metal_init: loaded kernel_rope 0x11de0cb20
ggml_metal_init: loaded kernel_cpy_f32_f16 0x11de0d3b0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x11de0dc40
ggml_metal_add_buffer: buffer 'data' size 18300452864 is larger than buffer maximum of 17179869184
llama_init_from_file: failed to add buffer
llama_init_from_gpt_params: error: failed to load model '/Users/spencer/ai/models/WizardLM-30B-Uncensored.ggmlv3.q4_0.bin'
main: error: unable to load model