Skip to content

Fix scalar version of Q5_K when QK_K = 64 #2362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 24, 2023
Merged

Conversation

ikawrakow
Copy link
Contributor

When developing the 64-weight block size for k-quants, at some point I switched Q5_K to use "type-0" quantization. Apparently I forgot to adjust the scalar version of the code.

Thanks to @katsu560 and @netrunnereve for noticing the problem.

Performance is not too bad actually. On a Ryzen-7950x (with AVX/AVX2 disabled), less than a factor of 2 slower than AVX2:

make clean && LLAMA_FAST=1 LLAMA_QKK_64=1 make -j
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

rm -vf *.o *.so *.dll main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state server simple vdot train-text-from-scratch embd-input-test build-info.h tests/test-double-float tests/test-grad0 tests/test-opt tests/test-quantize-fns tests/test-quantize-perf tests/test-sampling tests/test-tokenizer-0
removed 'common.o'
removed 'ggml.o'
removed 'grammar-parser.o'
removed 'k_quants.o'
removed 'llama.o'
removed 'libembdinput.so'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'
removed 'server'
removed 'simple'
removed 'vdot'
removed 'train-text-from-scratch'
removed 'embd-input-test'
removed 'build-info.h'
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -Ofast -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64
I CXXFLAGS: -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

cc  -I.              -Ofast -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64   -c ggml.c -o ggml.o
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 -c llama.cpp -o llama.o
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 -c examples/common.cpp -o common.o
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 -c examples/grammar-parser.cpp -o grammar-parser.o
cc -I.              -Ofast -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64   -c -o k_quants.o k_quants.c
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/main/main.cpp ggml.o llama.o common.o grammar-parser.o k_quants.o -o main 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/quantize/quantize.cpp ggml.o llama.o k_quants.o -o quantize 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o -o quantize-stats 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o -o perplexity 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o -o embedding 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 pocs/vdot/vdot.cpp ggml.o k_quants.o -o vdot 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o -o train-text-from-scratch 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o -o simple 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 -Iexamples/server examples/server/server.cpp ggml.o llama.o common.o k_quants.o -o server  
g++ --shared -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/embd-input/embd-input-lib.cpp ggml.o llama.o common.o k_quants.o -o libembdinput.so 
g++ -I. -I./examples -Ofast -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -DGGML_QKK_64 examples/embd-input/embd-input-test.cpp ggml.o llama.o common.o k_quants.o -o embd-input-test  -L. -lembdinput

====  Run ./main -h for help.  ====

./main -m q5_64.bin -p "I believe the meaning of life is" --ignore-eos -n 128 -s 1234 -t 16
main: build = 895 (84e09a7)
main: seed  = 1234
llama.cpp: loading model from cuda/q5_64.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000,0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 16 (mostly Q5_K - Small)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0,08 MB
llama_model_load_internal: mem required  = 4937,42 MB (+  256,00 MB per state)
llama_new_context_with_model: kv self size  =  256,00 MB

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 I believe the meaning of life is to enjoy every minute.
I also believe that we should live by our own values and not those set down in a book written by others, as we’re all different.
That being said, religion does teach many values which are important. But it can also be used to justify hatred and evil actions.
Saying that you are against killing but then going out there every day wearing clothes that were made from the leather of dead animals is hypocritical. If your goal is to save every animal in the world, stop eating meat for starters. You’d be amazed at how many
llama_print_timings:        load time =   260,60 ms
llama_print_timings:      sample time =    45,44 ms /   128 runs   (    0,35 ms per token,  2817,15 tokens per second)
llama_print_timings: prompt eval time =  1194,63 ms /     8 tokens (  149,33 ms per token,     6,70 tokens per second)
llama_print_timings:        eval time = 19893,37 ms /   127 runs   (  156,64 ms per token,     6,38 tokens per second)
llama_print_timings:       total time = 21152,60 ms

@ikawrakow ikawrakow merged commit 42f70cb into master Jul 24, 2023
@ikawrakow ikawrakow deleted the ik/fix_scalar_q5k_64 branch July 24, 2023 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants