Skip to content

Faster Q2_K on Metal #2297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 21, 2023
Merged

Faster Q2_K on Metal #2297

merged 3 commits into from
Jul 21, 2023

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Jul 20, 2023

Following in the footsteps of #2290 and #2294.

TG-128 in ms/t on M2 Max with 30-core GPU:

Model Master This PR Speedup
7B 22.5 18.4 22.3%
13B 37.7 30.1 25.3%
33B 88.9 68.7 29.4%
65B 165.5 128.3 29.0%

@ikawrakow ikawrakow requested a review from ggerganov July 20, 2023 16:57
shouyiwang

This comment was marked as duplicate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newbie question: can you please explain why this white space is dangerous?
Thx!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not dangerous. I'm just being sarcastic about a test failing because of one forgotten trailing white space.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

M1 Pro

Model Master This PR
7B 38.4 29.5
13B 69.48 51.5

However, I think the calculation seems to be incorrect.
Here is a run with this PR - the generated text is quite incoherent:

 I believe the meaning of life is to find the be a friend and. do we want to be here in 201932032222222222312222122222222222222222222222
$ ▶ LLAMA_METAL=1 make -j main && ./main -m ./models/13B/ggml-model-q2_k.bin -p "I believe the meaning of life is" -c 128 --ignore-eos -n 64 -t 8 -ngl 1000
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: `main' is up to date.
main: build = 858 (417546c)
main: seed  = 1689921112
llama.cpp: loading model from ./models/13B/ggml-model-q2_k.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 128
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0,09 MB
llama_model_load_internal: mem required  = 7055,00 MB (+ 1608,00 MB per state)
llama_new_context_with_model: kv self size  =  100,00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x131f08be0
ggml_metal_init: loaded kernel_mul                            0x131f091e0
ggml_metal_init: loaded kernel_mul_row                        0x131f09810
ggml_metal_init: loaded kernel_scale                          0x131f09d30
ggml_metal_init: loaded kernel_silu                           0x131f0a250
ggml_metal_init: loaded kernel_relu                           0x131f0a770
ggml_metal_init: loaded kernel_gelu                           0x131f0ac90
ggml_metal_init: loaded kernel_soft_max                       0x131f0b340
ggml_metal_init: loaded kernel_diag_mask_inf                  0x131f0b9a0
ggml_metal_init: loaded kernel_get_rows_f16                   0x131f0c020
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x131f0c6a0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x131f0ce90
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x131f0d510
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x131f0db90
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x131f0e210
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x131f0e890
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x131f0ef10
ggml_metal_init: loaded kernel_rms_norm                       0x131f0f5d0
ggml_metal_init: loaded kernel_norm                           0x131f0fc80
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x131f10650
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x131f10d10
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x131f113d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x131f11a90
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x131f12310
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x131f129d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x131f13070
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x131f13710
ggml_metal_init: loaded kernel_rope                           0x131f13e30
ggml_metal_init: loaded kernel_alibi_f32                      0x131f14950
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x131f151e0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x131f15a70
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x131f16300
ggml_metal_init: recommendedMaxWorkingSetSize = 21845,34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   128,17 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  5253,34 MB, ( 5253,80 / 21845,34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024,00 MB, ( 6277,80 / 21845,34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   102,00 MB, ( 6379,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   266,00 MB, ( 6645,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512,00 MB, ( 7157,80 / 21845,34)

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 128, n_batch = 512, n_predict = 64, n_keep = 0


 I believe the meaning of life is to find the be a friend and. do we want to be here in 201932032222222222312222122222222222222222222222
llama_print_timings:        load time =   404,54 ms
llama_print_timings:      sample time =    44,56 ms /    64 runs   (    0,70 ms per token,  1436,23 tokens per second)
llama_print_timings: prompt eval time =   598,88 ms /     8 tokens (   74,86 ms per token,    13,36 tokens per second)
llama_print_timings:        eval time =  3248,99 ms /    63 runs   (   51,57 ms per token,    19,39 tokens per second)
llama_print_timings:       total time =  3897,98 ms
ggml_metal_free: deallocating

For comparison, on master:

 I believe the meaning of life is to find a balance between your own needs and desires while at the same time doing what's best for others.
I would say "I'm here to help" if someone asked me for some advice or guidance. I believe that a person should be happy with who they are and what they do, but
$ ▶ LLAMA_METAL=1 make -j main && ./main -m ./models/13B/ggml-model-q2_k.bin -p "I believe the meaning of life is" -c 128 --ignore-eos -n 64 -t 8 -ngl 1000
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: `main' is up to date.
main: build = 856 (e782c9e)
main: seed  = 1689921195
llama.cpp: loading model from ./models/13B/ggml-model-q2_k.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 128
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0,09 MB
llama_model_load_internal: mem required  = 7055,00 MB (+ 1608,00 MB per state)
llama_new_context_with_model: kv self size  =  100,00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x159e0aa90
ggml_metal_init: loaded kernel_mul                            0x159e0b090
ggml_metal_init: loaded kernel_mul_row                        0x159e0b6c0
ggml_metal_init: loaded kernel_scale                          0x159e0bbe0
ggml_metal_init: loaded kernel_silu                           0x159e0c100
ggml_metal_init: loaded kernel_relu                           0x159e0c620
ggml_metal_init: loaded kernel_gelu                           0x159e0cb40
ggml_metal_init: loaded kernel_soft_max                       0x159e0d1f0
ggml_metal_init: loaded kernel_diag_mask_inf                  0x159e0d850
ggml_metal_init: loaded kernel_get_rows_f16                   0x159e0ded0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x159e0e550
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x159e0ed40
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x159e0f3c0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x159e0fa40
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x159e100c0
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x159e10740
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x159e10dc0
ggml_metal_init: loaded kernel_rms_norm                       0x159e11480
ggml_metal_init: loaded kernel_norm                           0x159e11b30
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x159e12500
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x159e12bc0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x159e13280
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x159e13960
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x159e141e0
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x159e148a0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x159e14f40
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x159e155e0
ggml_metal_init: loaded kernel_rope                           0x159e15d00
ggml_metal_init: loaded kernel_alibi_f32                      0x159e16820
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x159e170b0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x159e17940
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x159e181d0
ggml_metal_init: recommendedMaxWorkingSetSize = 21845,34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   128,17 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  5253,34 MB, ( 5253,80 / 21845,34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =  1024,00 MB, ( 6277,80 / 21845,34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   102,00 MB, ( 6379,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   266,00 MB, ( 6645,80 / 21845,34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512,00 MB, ( 7157,80 / 21845,34)

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 128, n_batch = 512, n_predict = 64, n_keep = 0


 I believe the meaning of life is to find a balance between your own needs and desires while at the same time doing what's best for others.
I would say "I'm here to help" if someone asked me for some advice or guidance. I believe that a person should be happy with who they are and what they do, but
llama_print_timings:        load time =   419,38 ms
llama_print_timings:      sample time =    44,66 ms /    64 runs   (    0,70 ms per token,  1433,18 tokens per second)
llama_print_timings: prompt eval time =   604,51 ms /     8 tokens (   75,56 ms per token,    13,23 tokens per second)
llama_print_timings:        eval time =  4374,82 ms /    63 runs   (   69,44 ms per token,    14,40 tokens per second)
llama_print_timings:       total time =  5029,98 ms
ggml_metal_free: deallocating

This is the command that I use:

make clean && LLAMA_METAL=1 make -j main && ./main -m ./models/13B/ggml-model-q2_k.bin -p "I believe the meaning of life is" -c 128 --ignore-eos -n 64 -t 8 -ngl 1000

The perplexity results also confirm that something is not OK. To make the perplexity tool run using the Metal kernels, make sure to add -b 1 command line arg like this:

make clean && LLAMA_METAL=1 make -j && ./perplexity -m ./models/7B/ggml-model-q2_k.bin -f build/wiki.test.raw -t 8 --no-mmap -ngl 100 -b 1

Here are the first three chunks on master and on this PR (note it's very slow):

# master
[1]4.9103,[2]5.5275,[3]6.3980,

# PR
[1]22.6174,[2]27.6240,[3]29.5259,

Full log of the last command:

$ ▶ make clean && LLAMA_METAL=1 make -j && ./perplexity -m ./models/7B/ggml-model-q2_k.bin -f build/wiki.test.raw -t 8 --no-mmap -ngl 100 -b 1
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o *.so main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch embd-input-test build-info.h
common.o
ggml-metal.o
ggml.o
k_quants.o
llama.o
libembdinput.so
main
quantize
quantize-stats
perplexity
embedding
server
simple
vdot
train-text-from-scratch
embd-input-test
build-info.h
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -c examples/common.cpp -o common.o
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG   -c -o k_quants.o k_quants.c
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o main  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o ggml-metal.o -o quantize-stats  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o perplexity  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embedding  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o k_quants.o ggml-metal.o -o vdot  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o ggml-metal.o -o train-text-from-scratch  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o simple  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o server  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ --shared -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o libembdinput.so  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o ggml-metal.o -o embd-input-test  -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders -L. -lembdinput

====  Run ./main -h for help.  ====

main: build = 858 (417546c)
main: seed  = 1689922165
llama.cpp: loading model from ./models/7B/ggml-model-q2_k.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 2733.65 MB
llama_model_load_internal: mem required  = 4303.65 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x1346098e0
ggml_metal_init: loaded kernel_mul                            0x134609ee0
ggml_metal_init: loaded kernel_mul_row                        0x13460a510
ggml_metal_init: loaded kernel_scale                          0x13460aa30
ggml_metal_init: loaded kernel_silu                           0x13460af50
ggml_metal_init: loaded kernel_relu                           0x13460b470
ggml_metal_init: loaded kernel_gelu                           0x13460b990
ggml_metal_init: loaded kernel_soft_max                       0x13460c040
ggml_metal_init: loaded kernel_diag_mask_inf                  0x13460c6a0
ggml_metal_init: loaded kernel_get_rows_f16                   0x13460cd20
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x13460d3a0
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13460db90
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13460e210
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13460e890
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13460ef10
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x13460f590
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13460fc10
ggml_metal_init: loaded kernel_rms_norm                       0x1346102d0
ggml_metal_init: loaded kernel_norm                           0x134610980
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x134611350
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x134611a10
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x1346120d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x134612790
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x134613010
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1346136d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x134613d70
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x134614410
ggml_metal_init: loaded kernel_rope                           0x134614b30
ggml_metal_init: loaded kernel_alibi_f32                      0x134615650
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x134615ee0
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x134616770
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x134617000
ggml_metal_init: recommendedMaxWorkingSetSize = 21845.34 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: max tensor size =   102.54 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  2733.66 MB, ( 2734.11 / 21845.34)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =   770.00 MB, ( 3504.11 / 21845.34)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   258.00 MB, ( 3762.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   288.00 MB, ( 4050.11 / 21845.34)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   512.00 MB, ( 4562.11 / 21845.34)

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity: calculating perplexity over 655 chunks, batch_size=1
perplexity: 16.02 seconds per pass - ETA 2 hours 54 minutes
[1]22.6174,[2]27.6240,[3]29.5259,^C

If we confirm something is wrong, might be worth doing the same checks for the Metal implementation of the other quantizations to make sure we didn't overlook something.

@ikawrakow
Copy link
Contributor Author

@ggerganov Great catch, thanks!

I was getting not too a bad answer on the meaning of life while testing. The bug was that I was always using the mins/scales of the the first 128 weights in the super-block. Normally for such a bug one gets complete gibberish. With the last commit I now get the same perplexities as before.

@ikawrakow ikawrakow merged commit e68c96f into master Jul 21, 2023
@ikawrakow ikawrakow deleted the ik/metal_faster_q2k branch July 21, 2023 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants