-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Faster Q2_K on Metal #2297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Q2_K on Metal #2297
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newbie question: can you please explain why this white space is dangerous?
Thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not dangerous. I'm just being sarcastic about a test failing because of one forgotten trailing white space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
M1 Pro
Model | Master | This PR |
---|---|---|
7B | 38.4 | 29.5 |
13B | 69.48 | 51.5 |
However, I think the calculation seems to be incorrect.
Here is a run with this PR - the generated text is quite incoherent:
I believe the meaning of life is to find the be a friend and. do we want to be here in 201932032222222222312222122222222222222222222222
$ ▶ LLAMA_METAL=1 make -j main && ./main -m ./models/13B/ggml-model-q2_k.bin -p "I believe the meaning of life is" -c 128 --ignore-eos -n 64 -t 8 -ngl 1000
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: `main' is up to date.
main: build = 858 (417546c)
main: seed = 1689921112
llama.cpp: loading model from ./models/13B/ggml-model-q2_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0,09 MB
llama_model_load_internal: mem required = 7055,00 MB (+ 1608,00 MB per state)
llama_new_context_with_model: kv self size = 100,00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x131f08be0
ggml_metal_init: loaded kernel_mul 0x131f091e0
ggml_metal_init: loaded kernel_mul_row 0x131f09810
ggml_metal_init: loaded kernel_scale 0x131f09d30
ggml_metal_init: loaded kernel_silu 0x131f0a250
ggml_metal_init: loaded kernel_relu 0x131f0a770
ggml_metal_init: loaded kernel_gelu 0x131f0ac90
ggml_metal_init: loaded kernel_soft_max 0x131f0b340
ggml_metal_init: loaded kernel_diag_mask_inf 0x131f0b9a0
ggml_metal_init: loaded kernel_get_rows_f16 0x131f0c020
ggml_metal_init: loaded kernel_get_rows_q4_0 0x131f0c6a0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x131f0ce90
ggml_metal_init: loaded kernel_get_rows_q2_K 0x131f0d510
ggml_metal_init: loaded kernel_get_rows_q3_K 0x131f0db90
ggml_metal_init: loaded kernel_get_rows_q4_K 0x131f0e210
ggml_metal_init: loaded kernel_get_rows_q5_K 0x131f0e890
ggml_metal_init: loaded kernel_get_rows_q6_K 0x131f0ef10
ggml_metal_init: loaded kernel_rms_norm 0x131f0f5d0
ggml_metal_init: loaded kernel_norm 0x131f0fc80
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x131f10650
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x131f10d10
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x131f113d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x131f11a90
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x131f12310
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x131f129d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x131f13070
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x131f13710
ggml_metal_init: loaded kernel_rope 0x131f13e30
ggml_metal_init: loaded kernel_alibi_f32 0x131f14950
ggml_metal_init: loaded kernel_cpy_f32_f16 0x131f151e0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x131f15a70
ggml_metal_init: loaded kernel_cpy_f16_f16 0x131f16300
ggml_metal_init: recommendedMaxWorkingSetSize = 21845,34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 128,17 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 5253,34 MB, ( 5253,80 / 21845,34)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1024,00 MB, ( 6277,80 / 21845,34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 102,00 MB, ( 6379,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 266,00 MB, ( 6645,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512,00 MB, ( 7157,80 / 21845,34)
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 128, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to find the be a friend and. do we want to be here in 201932032222222222312222122222222222222222222222
llama_print_timings: load time = 404,54 ms
llama_print_timings: sample time = 44,56 ms / 64 runs ( 0,70 ms per token, 1436,23 tokens per second)
llama_print_timings: prompt eval time = 598,88 ms / 8 tokens ( 74,86 ms per token, 13,36 tokens per second)
llama_print_timings: eval time = 3248,99 ms / 63 runs ( 51,57 ms per token, 19,39 tokens per second)
llama_print_timings: total time = 3897,98 ms
ggml_metal_free: deallocating
For comparison, on master
:
I believe the meaning of life is to find a balance between your own needs and desires while at the same time doing what's best for others.
I would say "I'm here to help" if someone asked me for some advice or guidance. I believe that a person should be happy with who they are and what they do, but
$ ▶ LLAMA_METAL=1 make -j main && ./main -m ./models/13B/ggml-model-q2_k.bin -p "I believe the meaning of life is" -c 128 --ignore-eos -n 64 -t 8 -ngl 1000
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: `main' is up to date.
main: build = 856 (e782c9e)
main: seed = 1689921195
llama.cpp: loading model from ./models/13B/ggml-model-q2_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0,09 MB
llama_model_load_internal: mem required = 7055,00 MB (+ 1608,00 MB per state)
llama_new_context_with_model: kv self size = 100,00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x159e0aa90
ggml_metal_init: loaded kernel_mul 0x159e0b090
ggml_metal_init: loaded kernel_mul_row 0x159e0b6c0
ggml_metal_init: loaded kernel_scale 0x159e0bbe0
ggml_metal_init: loaded kernel_silu 0x159e0c100
ggml_metal_init: loaded kernel_relu 0x159e0c620
ggml_metal_init: loaded kernel_gelu 0x159e0cb40
ggml_metal_init: loaded kernel_soft_max 0x159e0d1f0
ggml_metal_init: loaded kernel_diag_mask_inf 0x159e0d850
ggml_metal_init: loaded kernel_get_rows_f16 0x159e0ded0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x159e0e550
ggml_metal_init: loaded kernel_get_rows_q4_1 0x159e0ed40
ggml_metal_init: loaded kernel_get_rows_q2_K 0x159e0f3c0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x159e0fa40
ggml_metal_init: loaded kernel_get_rows_q4_K 0x159e100c0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x159e10740
ggml_metal_init: loaded kernel_get_rows_q6_K 0x159e10dc0
ggml_metal_init: loaded kernel_rms_norm 0x159e11480
ggml_metal_init: loaded kernel_norm 0x159e11b30
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x159e12500
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x159e12bc0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x159e13280
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x159e13960
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x159e141e0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x159e148a0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x159e14f40
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x159e155e0
ggml_metal_init: loaded kernel_rope 0x159e15d00
ggml_metal_init: loaded kernel_alibi_f32 0x159e16820
ggml_metal_init: loaded kernel_cpy_f32_f16 0x159e170b0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x159e17940
ggml_metal_init: loaded kernel_cpy_f16_f16 0x159e181d0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845,34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 128,17 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 5253,34 MB, ( 5253,80 / 21845,34)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1024,00 MB, ( 6277,80 / 21845,34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 102,00 MB, ( 6379,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 266,00 MB, ( 6645,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512,00 MB, ( 7157,80 / 21845,34)
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 128, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to find a balance between your own needs and desires while at the same time doing what's best for others.
I would say "I'm here to help" if someone asked me for some advice or guidance. I believe that a person should be happy with who they are and what they do, but
llama_print_timings: load time = 419,38 ms
llama_print_timings: sample time = 44,66 ms / 64 runs ( 0,70 ms per token, 1433,18 tokens per second)
llama_print_timings: prompt eval time = 604,51 ms / 8 tokens ( 75,56 ms per token, 13,23 tokens per second)
llama_print_timings: eval time = 4374,82 ms / 63 runs ( 69,44 ms per token, 14,40 tokens per second)
llama_print_timings: total time = 5029,98 ms
ggml_metal_free: deallocating
This is the command that I use:
make clean && LLAMA_METAL=1 make -j main && ./main -m ./models/13B/ggml-model-q2_k.bin -p "I believe the meaning of life is" -c 128 --ignore-eos -n 64 -t 8 -ngl 1000
The perplexity results also confirm that something is not OK. To make the perplexity tool run using the Metal kernels, make sure to add -b 1
command line arg like this:
make clean && LLAMA_METAL=1 make -j && ./perplexity -m ./models/7B/ggml-model-q2_k.bin -f build/wiki.test.raw -t 8 --no-mmap -ngl 100 -b 1
Here are the first three chunks on master
and on this PR (note it's very slow):
# master
[1]4.9103,[2]5.5275,[3]6.3980,
# PR
[1]22.6174,[2]27.6240,[3]29.5259,
Full log of the last command:
$ ▶ make clean && LLAMA_METAL=1 make -j && ./perplexity -m ./models/7B/ggml-model-q2_k.bin -f build/wiki.test.raw -t 8 --no-mmap -ngl 100 -b 1
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
rm -vf *.o *.so main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch embd-input-test build-info.h
common.o
ggml-metal.o
ggml.o
k_quants.o
llama.o
libembdinput.so
main
quantize
quantize-stats
perplexity
embedding
server
simple
vdot
train-text-from-scratch
embd-input-test
build-info.h
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/common.cpp -o common.o
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c -o k_quants.o k_quants.c
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o main -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize-stats -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o perplexity -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embedding -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o k_quants.o ggml-metal.o -o vdot -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o ggml-metal.o -o train-text-from-scratch -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o simple -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o server -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ --shared -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o libembdinput.so -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embd-input-test -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -L. -lembdinput
==== Run ./main -h for help. ====
main: build = 858 (417546c)
main: seed = 1689922165
llama.cpp: loading model from ./models/7B/ggml-model-q2_k.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 10 (mostly Q2_K)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 2733.65 MB
llama_model_load_internal: mem required = 4303.65 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size = 256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x1346098e0
ggml_metal_init: loaded kernel_mul 0x134609ee0
ggml_metal_init: loaded kernel_mul_row 0x13460a510
ggml_metal_init: loaded kernel_scale 0x13460aa30
ggml_metal_init: loaded kernel_silu 0x13460af50
ggml_metal_init: loaded kernel_relu 0x13460b470
ggml_metal_init: loaded kernel_gelu 0x13460b990
ggml_metal_init: loaded kernel_soft_max 0x13460c040
ggml_metal_init: loaded kernel_diag_mask_inf 0x13460c6a0
ggml_metal_init: loaded kernel_get_rows_f16 0x13460cd20
ggml_metal_init: loaded kernel_get_rows_q4_0 0x13460d3a0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13460db90
ggml_metal_init: loaded kernel_get_rows_q2_K 0x13460e210
ggml_metal_init: loaded kernel_get_rows_q3_K 0x13460e890
ggml_metal_init: loaded kernel_get_rows_q4_K 0x13460ef10
ggml_metal_init: loaded kernel_get_rows_q5_K 0x13460f590
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13460fc10
ggml_metal_init: loaded kernel_rms_norm 0x1346102d0
ggml_metal_init: loaded kernel_norm 0x134610980
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x134611350
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x134611a10
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x1346120d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x134612790
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x134613010
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x1346136d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x134613d70
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x134614410
ggml_metal_init: loaded kernel_rope 0x134614b30
ggml_metal_init: loaded kernel_alibi_f32 0x134615650
ggml_metal_init: loaded kernel_cpy_f32_f16 0x134615ee0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x134616770
ggml_metal_init: loaded kernel_cpy_f16_f16 0x134617000
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 102.54 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 2733.66 MB, ( 2734.11 / 21845.34)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 770.00 MB, ( 3504.11 / 21845.34)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 258.00 MB, ( 3762.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 288.00 MB, ( 4050.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB, ( 4562.11 / 21845.34)
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 655 chunks, batch_size=1
perplexity: 16.02 seconds per pass - ETA 2 hours 54 minutes
[1]22.6174,[2]27.6240,[3]29.5259,^C
If we confirm something is wrong, might be worth doing the same checks for the Metal implementation of the other quantizations to make sure we didn't overlook something.
@ggerganov Great catch, thanks! I was getting not too a bad answer on the meaning of life while testing. The bug was that I was always using the mins/scales of the the first 128 weights in the super-block. Normally for such a bug one gets complete gibberish. With the last commit I now get the same perplexities as before. |
Following in the footsteps of #2290 and #2294.
TG-128 in ms/t on M2 Max with 30-core GPU: