get corrupted completion when enabling Metal GPU inference with 'ngl 1' #1849

noahwoo · 2023-06-14T09:58:20Z

noahwoo
Jun 14, 2023

I tried to load and inference on llama-7b merged with a Chinese LoRA adapter in CLI mode, but got different completion with or without 'ngl 1' option.

Output with no 'ngl' option:

> Capital of United States
Washington, D.C.
>

The corrupted output with 'ngl 1':

> Capital of United States
W胥其中包括zat droit胥胥其中包括胥胥胥其中包括胥胥胥胥胥zat胥胥zatstronom胥zat胥胥zat胥 varying胥其中包括胥胥胥胥胥胥zat胥胥胥其中包括其中包括胥胥胥胥胥胥胥胥胥胥胥 droit胥其中包括胥胥zatzat胥 varying其中包括胥胥胥胥avanozat胥胥胥胥胥胥zat胥胥胥其中包括胥胥胥其中包括胥胥胥zat胥胥胥胥胥zat胥胥胥胥胥其中包括胥胥胥胥胥胥胥胥胥其中包括zat胥胥胥zat胥胥胥胥其中包括胥zat胥胥胥胥 droit胥其中包括 varying胥其中包括zat胥胥胥胥胥胥胥avanozat胥胥其中包括胥胥胥胥胥其中包括胥胥胥胥胥胥胥胥其中包括胥zat胥胥其中包括胥胥胥zat其中包括其中包括胥胥胥胥胥胥胥胥其中包括胥其中包括胥胥胥胥zat胥其中包括胥其中包括胥zatzat胥胥其中包括胥胥 droit其中包括胥其中包括其中包括胥zat胥其中包括胥zat胥其中包括诙胥胥胥 varying胥zat其中包括胥胥胥胥胥其中包括胥胥胥其中包括胥胥胥其中包括胥胥zat胥胥胥胥 droitzat droit胥胥其中包括zat droit
>

Thanks in advance for any help.

The following are the model loading trace, it looks good? :

main: build = 669 (9254920)
main: seed = 1686735230
llama.cpp: loading model from zh-models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 49954
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5536.92 MB (+ 1026.00 MB per state)
...............................................................................................
llama_init_from_file: kv self size = 1024.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/wujianmin/bak-from-mac/Code/git/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x132904f10
ggml_metal_init: loaded kernel_mul 0x145e0aba0
ggml_metal_init: loaded kernel_mul_row 0x145e0b1e0
ggml_metal_init: loaded kernel_scale 0x145e0b700
ggml_metal_init: loaded kernel_silu 0x145e0bc20
ggml_metal_init: loaded kernel_relu 0x145e0c140
ggml_metal_init: loaded kernel_gelu 0x145e0c660
ggml_metal_init: loaded kernel_soft_max 0x145e0cd10
ggml_metal_init: loaded kernel_diag_mask_inf 0x145e0d370
ggml_metal_init: loaded kernel_get_rows_f16 0x132905790
ggml_metal_init: loaded kernel_get_rows_q4_0 0x132905f30
ggml_metal_init: loaded kernel_get_rows_q4_1 0x132906720
ggml_metal_init: loaded kernel_get_rows_q2_k 0x132906da0
ggml_metal_init: loaded kernel_get_rows_q3_k 0x145f04510
ggml_metal_init: loaded kernel_get_rows_q4_k 0x145f05360
ggml_metal_init: loaded kernel_get_rows_q5_k 0x145f059e0
ggml_metal_init: loaded kernel_get_rows_q6_k 0x145e0d8d0
ggml_metal_init: loaded kernel_rms_norm 0x145e0e0a0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x145e0e900
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x145e0f270
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x145e0f950
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32 0x145e10030
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32 0x145e10730
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32 0x145e10f90
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32 0x145e11670
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32 0x145e11d50
ggml_metal_init: loaded kernel_rope 0x145e12640
ggml_metal_init: loaded kernel_cpy_f32_f16 0x145e130d0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x145e13960
ggml_metal_add_buffer: allocated 'data ' buffer, size = 3745.52 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 768.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1026.00 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
main: interactive mode on.
... ...
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.200000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 256, n_keep = 21`

-Jianmin

shouyiwang · 2023-06-14T11:38:04Z

shouyiwang
Jun 14, 2023

That is because they are still trying to figure out how to allocate more than half of the physical memory for Metal.
Without using "ngl -1", you can access more memory with cpu. However, with it, you face a lack of memory problem.

1 reply

noahwoo Jun 15, 2023
Author

Thank you for reply. Do you mean current version of Metal support in llama.cpp is trying to figure out how to allocate more than half of the physical memory, or it is only in my setting, llama.cpp failed to allocate more memory due to hardware constraints?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

get corrupted completion when enabling Metal GPU inference with 'ngl 1' #1849

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

get corrupted completion when enabling Metal GPU inference with 'ngl 1' #1849

Uh oh!

noahwoo Jun 14, 2023

Replies: 1 comment · 1 reply

Uh oh!

shouyiwang Jun 14, 2023

Uh oh!

noahwoo Jun 15, 2023 Author

noahwoo
Jun 14, 2023

Replies: 1 comment 1 reply

shouyiwang
Jun 14, 2023

noahwoo Jun 15, 2023
Author