Fix scalar version of Q5_K when QK_K = 64 #2362

ikawrakow · 2023-07-24T07:55:40Z

When developing the 64-weight block size for k-quants, at some point I switched Q5_K to use "type-0" quantization. Apparently I forgot to adjust the scalar version of the code.

Thanks to @katsu560 and @netrunnereve for noticing the problem.

Performance is not too bad actually. On a Ryzen-7950x (with AVX/AVX2 disabled), less than a factor of 2 slower than AVX2:

make clean && LLAMA_FAST=1 LLAMA_QKK_64=1 make -j
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

rm -vf *.o *.so *.dll main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch embd-input-test build-info.h tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0
removed 'common.o'
removed 'ggml.o'
removed 'grammar-parser.o'
removed 'k_quants.o'
removed 'llama.o'
removed 'libembdinput.so'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'server'
removed 'simple'
removed 'vdot'
removed 'train-text-from-scratch'
removed 'embd-input-test'
removed 'build-info.h'
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -Ofast -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64
I CXXFLAGS: -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

cc  -I.              -Ofast -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64   -c ggml.c -o ggml.o
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 -c llama.cpp -o llama.o
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 -c examples/common.cpp -o common.o
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 -c examples/grammar-parser.cpp -o grammar-parser.o
cc -I.              -Ofast -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64   -c -o k_quants.o k_quants.c
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/main/main.cpp ggml.o llama.o common.o grammar-parser.o k_quants.o -o main 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/quantize/quantize.cpp ggml.o llama.o k_quants.o -o quantize 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o -o quantize-stats 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o -o perplexity 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o -o embedding 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 pocs/vdot/vdot.cpp ggml.o k_quants.o -o vdot 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o -o train-text-from-scratch 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o -o simple 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o k_quants.o -o server  
g++ --shared -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o -o libembdinput.so 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o -o embd-input-test  -L. -lembdinput

====  Run ./main -h for help.  ====

./main -m q5_64.bin -p "I believe the meaning of life is" --ignore-eos -n 128 -s 1234 -t 16
main: build = 895 (84e09a7)
main: seed  = 1234
llama.cpp: loading model from cuda/q5_64.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 16 (mostly Q5_K - Small)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0,08 MB
llama_model_load_internal: mem required  = 4937,42 MB (+  256,00 MB per state)
llama_new_context_with_model: kv self size  =  256,00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 I believe the meaning of life is to enjoy every minute.
I also believe that we should live by our own values and not those set down in a book written by others, as we’re all different.
That being said, religion does teach many values which are important. But it can also be used to justify hatred and evil actions.
Saying that you are against killing but then going out there every day wearing clothes that were made from the leather of dead animals is hypocritical. If your goal is to save every animal in the world, stop eating meat for starters. You’d be amazed at how many
llama_print_timings:        load time =   260,60 ms
llama_print_timings:      sample time =    45,44 ms /   128 runs   (    0,35 ms per token,  2817,15 tokens per second)
llama_print_timings: prompt eval time =  1194,63 ms /     8 tokens (  149,33 ms per token,     6,70 tokens per second)
llama_print_timings:        eval time = 19893,37 ms /   127 runs   (  156,64 ms per token,     6,38 tokens per second)
llama_print_timings:       total time = 21152,60 ms

Fix scalar version of Q5_K when QK_K = 64

b32538d

ikawrakow requested a review from ggerganov July 24, 2023 07:55

ikawrakow mentioned this pull request Jul 24, 2023

k_quants : add AVX support to dot functions with QK_K as 64 #2339

Merged

ggerganov approved these changes Jul 24, 2023

View reviewed changes

ikawrakow merged commit 42f70cb into master Jul 24, 2023

ikawrakow deleted the ik/fix_scalar_q5k_64 branch July 24, 2023 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix scalar version of Q5_K when QK_K = 64 #2362

Fix scalar version of Q5_K when QK_K = 64 #2362

Uh oh!

ikawrakow commented Jul 24, 2023

Uh oh!

Uh oh!

Fix scalar version of Q5_K when QK_K = 64 #2362

Fix scalar version of Q5_K when QK_K = 64 #2362

Uh oh!

Conversation

ikawrakow commented Jul 24, 2023

Uh oh!

Uh oh!