Skip to content

more perfo with llamafile tinyblas on x86_64. #10714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Dec 24, 2024

Conversation

Djip007
Copy link
Contributor

@Djip007 Djip007 commented Dec 8, 2024

ikawrakow/ik_llama.cpp#71 have a good idea.

I'll figure to add it in llamafile/tinyblas sgemm (and a litle more) and id work great:

  • AMD Ryzen 9 7940HS (zen4)

Mistral-Nemo-Instruct-2407.BF16.gguf +kv@bf16

Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B BF16 pp1 2.52 2.51 1.00
llama 13B BF16 pp2 5.00 4.74 0.95
llama 13B BF16 pp3 7.44 7.23 0.97
llama 13B BF16 pp4 9.75 9.91 1.02
llama 13B BF16 pp5 11.95 12.37 1.04
llama 13B BF16 pp6 13.92 14.78 1.06
llama 13B BF16 pp7 15.64 17.09 1.09
llama 13B BF16 pp8 17.24 19.41 1.13
llama 13B BF16 pp9 18.35 21.63 1.18
llama 13B BF16 pp10 19.47 24.02 1.23
llama 13B BF16 pp11 20.48 26.30 1.28
llama 13B BF16 pp12 21.04 28.43 1.35
llama 13B BF16 pp13 21.49 29.41 1.37
llama 13B BF16 pp14 23.10 31.56 1.37
llama 13B BF16 pp15 23.65 33.51 1.42
llama 13B BF16 pp16 23.99 35.87 1.50
llama 13B BF16 pp30 24.19 51.09 2.11
llama 13B BF16 pp32 24.56 51.04 2.08
llama 13B BF16 pp64 25.66 57.23 2.23
llama 13B BF16 pp65 24.57 57.33 2.33
llama 13B BF16 pp120 25.73 65.51 2.55
llama 13B BF16 pp128 25.77 65.81 2.55
llama 13B BF16 pp130 26.86 66.31 2.47
llama 13B BF16 pp240 27.76 70.40 2.54
llama 13B BF16 pp255 26.03 70.32 2.70
llama 13B BF16 pp256 26.04 70.34 2.70
llama 13B BF16 pp510 25.73 68.26 2.65
llama 13B BF16 pp512 25.74 67.97 2.64
llama 13B BF16 pp1024 25.27 66.76 2.64
llama 13B BF16 pp1025 25.04 64.84 2.59
llama 13B BF16 pp2048 24.63 63.96 2.60
llama 13B BF16 tg128 2.52 2.52 1.00

Mistral-Nemo-Instruct-2407.FP16.gguf +kv@fp16

Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B F16 pp1 2.50 2.50 1.00
llama 13B F16 pp2 4.94 4.81 0.97
llama 13B F16 pp3 7.41 7.19 0.97
llama 13B F16 pp4 9.81 9.82 1.00
llama 13B F16 pp5 12.23 12.25 1.00
llama 13B F16 pp6 7.92 14.58 1.84
llama 13B F16 pp7 9.19 16.52 1.80
llama 13B F16 pp8 10.47 18.76 1.79
llama 13B F16 pp9 11.76 20.86 1.77
llama 13B F16 pp10 22.58 22.80 1.01
llama 13B F16 pp11 14.19 24.90 1.75
llama 13B F16 pp12 15.41 26.70 1.73
llama 13B F16 pp13 16.66 26.99 1.62
llama 13B F16 pp14 17.88 28.74 1.61
llama 13B F16 pp15 31.97 29.67 0.93
llama 13B F16 pp16 19.79 31.23 1.58
llama 13B F16 pp30 38.31 36.86 0.96
llama 13B F16 pp32 29.11 36.60 1.26
llama 13B F16 pp64 32.15 38.95 1.21
llama 13B F16 pp65 38.72 38.91 1.00
llama 13B F16 pp120 39.14 40.36 1.03
llama 13B F16 pp128 35.44 40.19 1.13
llama 13B F16 pp130 39.49 40.24 1.02
llama 13B F16 pp240 36.90 40.76 1.10
llama 13B F16 pp255 35.87 40.66 1.13
llama 13B F16 pp256 33.51 40.43 1.21
llama 13B F16 pp510 27.96 40.09 1.43
llama 13B F16 pp512 27.41 40.08 1.46
llama 13B F16 pp1024 27.27 39.03 1.43
llama 13B F16 pp1025 25.91 38.50 1.49
llama 13B F16 pp2048 26.75 37.95 1.42
llama 13B F16 tg128 2.50 2.51 1.00
  • on AMD Ryzen 9 5950X 16-Core Processor (znver3) (AVX2)

Mistral-Nemo-Instruct-2407.BF16.gguf +kv@bf16

Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B BF16 pp1 2.21 2.21 1.00
llama 13B BF16 pp2 4.36 4.31 0.99
llama 13B BF16 pp3 6.44 6.44 1.00
llama 13B BF16 pp4 8.42 8.58 1.02
llama 13B BF16 pp5 10.29 10.71 1.04
llama 13B BF16 pp6 12.00 12.78 1.07
llama 13B BF16 pp7 13.53 14.86 1.10
llama 13B BF16 pp8 14.72 16.92 1.15
llama 13B BF16 pp9 15.61 18.93 1.21
llama 13B BF16 pp10 16.30 20.92 1.28
llama 13B BF16 pp11 16.93 23.01 1.36
llama 13B BF16 pp12 17.35 24.89 1.43
llama 13B BF16 pp13 17.69 26.94 1.52
llama 13B BF16 pp14 17.95 28.78 1.60
llama 13B BF16 pp15 18.21 30.64 1.68
llama 13B BF16 pp16 18.37 32.45 1.77
llama 13B BF16 pp30 19.20 42.87 2.23
llama 13B BF16 pp32 19.36 43.14 2.23
llama 13B BF16 pp64 19.85 45.05 2.27
llama 13B BF16 pp65 19.46 44.94 2.31
llama 13B BF16 pp120 19.98 46.27 2.32
llama 13B BF16 pp128 20.14 46.11 2.29
llama 13B BF16 pp130 19.97 45.93 2.30
llama 13B BF16 pp240 20.23 46.50 2.30
llama 13B BF16 pp255 20.24 46.54 2.30
llama 13B BF16 pp256 20.19 46.40 2.30
llama 13B BF16 pp510 20.09 46.01 2.29
llama 13B BF16 pp512 20.17 45.81 2.27
llama 13B BF16 pp1024 19.94 45.05 2.26
llama 13B BF16 pp1025 19.74 44.18 2.24
llama 13B BF16 pp2048 19.48 43.68 2.24
llama 13B BF16 tg128 2.21 2.21 1.00

Mistral-Nemo-Instruct-2407.FP16.gguf +kv@fp16

Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B F16 pp1 2.19 2.19 1.00
llama 13B F16 pp2 4.30 4.28 1.00
llama 13B F16 pp3 6.46 6.41 0.99
llama 13B F16 pp4 4.84 8.53 1.76
llama 13B F16 pp5 6.01 10.64 1.77
llama 13B F16 pp6 12.90 12.71 0.99
llama 13B F16 pp7 8.50 14.81 1.74
llama 13B F16 pp8 9.64 16.88 1.75
llama 13B F16 pp9 19.25 18.90 0.98
llama 13B F16 pp10 12.25 20.88 1.70
llama 13B F16 pp11 13.39 22.94 1.71
llama 13B F16 pp12 25.40 24.89 0.98
llama 13B F16 pp13 15.89 26.87 1.69
llama 13B F16 pp14 17.02 28.74 1.69
llama 13B F16 pp15 30.82 30.66 0.99
llama 13B F16 pp16 19.45 32.55 1.67
llama 13B F16 pp30 34.23 54.23 1.58
llama 13B F16 pp32 27.34 55.37 2.03
llama 13B F16 pp64 30.66 58.46 1.91
llama 13B F16 pp65 31.03 57.47 1.85
llama 13B F16 pp120 35.31 58.01 1.64
llama 13B F16 pp128 33.09 57.76 1.75
llama 13B F16 pp130 33.05 58.28 1.76
llama 13B F16 pp240 35.19 58.66 1.67
llama 13B F16 pp255 35.24 58.62 1.66
llama 13B F16 pp256 33.98 58.57 1.72
llama 13B F16 pp510 33.76 57.94 1.72
llama 13B F16 pp512 33.34 57.51 1.72
llama 13B F16 pp1024 33.03 56.41 1.71
llama 13B F16 pp1025 32.59 53.96 1.66
llama 13B F16 pp2048 32.08 54.93 1.71
llama 13B F16 tg128 2.18 2.19 1.00

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 8, 2024
@Djip007 Djip007 marked this pull request as draft December 8, 2024 03:25
@Djip007 Djip007 force-pushed the perfo/tinyblas branch 5 times, most recently from f7c5a68 to b1c72b9 Compare December 9, 2024 22:34
@Djip007
Copy link
Contributor Author

Djip007 commented Dec 10, 2024

Some perplexity with new code. (vs master BF16/zen3)

#> zen3:
./build/bin/./llama-perplexity -ctk bf16 -ctv bf16 --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9452 ±    0.5268       0.00070 ±    0.00062       0.00003 ±    0.00000     0.186 ±  0.018 %    99.608 ±  0.392 %
   2       5.4448 ±    0.6044       0.00187 ±    0.00151       0.00004 ±    0.00000     0.183 ±  0.013 %    99.804 ±  0.196 %
   3       4.6848 ±    0.4030       0.00093 ±    0.00104       0.00006 ±    0.00000     0.255 ±  0.030 %    99.869 ±  0.131 %
   4       5.0051 ±    0.3673       0.00039 ±    0.00080       0.00005 ±    0.00000     0.243 ±  0.024 %    99.902 ±  0.098 %
   5       5.2917 ±    0.3433       0.00004 ±    0.00067       0.00006 ±    0.00000     0.243 ±  0.020 %    99.922 ±  0.078 %
   6       5.8289 ±    0.3542       0.00000 ±    0.00057       0.00006 ±    0.00000     0.233 ±  0.017 %    99.869 ±  0.092 %
   7       6.2242 ±    0.3544       0.00025 ±    0.00054       0.00006 ±    0.00000     0.228 ±  0.015 %    99.832 ±  0.097 %
   8       6.4312 ±    0.3454       0.00041 ±    0.00048       0.00006 ±    0.00000     0.229 ±  0.014 %    99.755 ±  0.110 %
   9       6.8865 ±    0.3580       0.00036 ±    0.00043       0.00006 ±    0.00000     0.227 ±  0.013 %    99.739 ±  0.107 %
  10       7.2362 ±    0.3590       0.00026 ±    0.00039       0.00006 ±    0.00000     0.224 ±  0.012 %    99.765 ±  0.096 %
  11       7.2572 ±    0.3420       0.00018 ±    0.00036       0.00005 ±    0.00000     0.218 ±  0.011 %    99.750 ±  0.094 %
  12       7.2827 ±    0.3297       0.00015 ±    0.00033       0.00007 ±    0.00001     0.230 ±  0.011 %    99.673 ±  0.103 %
  13       7.4379 ±    0.3228       0.00011 ±    0.00031       0.00007 ±    0.00001     0.226 ±  0.011 %    99.608 ±  0.109 %
  14       7.3367 ±    0.3061       0.00016 ±    0.00030       0.00007 ±    0.00001     0.224 ±  0.010 %    99.636 ±  0.101 %
  15       7.1258 ±    0.2859       0.00012 ±    0.00028       0.00007 ±    0.00001     0.222 ±  0.010 %    99.634 ±  0.098 %
  16       7.1695 ±    0.2792       0.00012 ±    0.00026       0.00007 ±    0.00001     0.223 ±  0.009 %    99.657 ±  0.092 %
  17       6.8048 ±    0.2538       0.00008 ±    0.00025       0.00007 ±    0.00001     0.223 ±  0.009 %    99.677 ±  0.086 %
  18       6.8631 ±    0.2517       0.00016 ±    0.00024       0.00007 ±    0.00001     0.221 ±  0.008 %    99.651 ±  0.087 %
  19       6.9983 ±    0.2515       0.00016 ±    0.00023       0.00007 ±    0.00001     0.220 ±  0.008 %    99.670 ±  0.082 %
  20       6.7969 ±    0.2383       0.00013 ±    0.00022       0.00007 ±    0.00001     0.229 ±  0.008 %    99.667 ±  0.081 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9430 ±    0.5263       0.00016 ±    0.00050       0.00003 ±    0.00000     0.159 ±  0.014 %    100.000 ±  0.000 %
   2       5.4442 ±    0.6042       0.00177 ±    0.00153       0.00003 ±    0.00000     0.155 ±  0.009 %    99.804 ±  0.196 %
   3       4.6852 ±    0.4029       0.00103 ±    0.00104       0.00003 ±    0.00000     0.179 ±  0.010 %    99.869 ±  0.131 %
   4       5.0069 ±    0.3673       0.00074 ±    0.00079       0.00003 ±    0.00000     0.178 ±  0.009 %    99.902 ±  0.098 %
   5       5.2939 ±    0.3435       0.00046 ±    0.00064       0.00003 ±    0.00000     0.171 ±  0.008 %    99.922 ±  0.078 %
   6       5.8312 ±    0.3544       0.00039 ±    0.00053       0.00003 ±    0.00000     0.165 ±  0.007 %    99.935 ±  0.065 %
   7       6.2260 ±    0.3545       0.00055 ±    0.00050       0.00003 ±    0.00000     0.165 ±  0.006 %    99.888 ±  0.079 %
   8       6.4317 ±    0.3454       0.00047 ±    0.00044       0.00003 ±    0.00000     0.169 ±  0.007 %    99.853 ±  0.085 %
   9       6.8872 ±    0.3580       0.00047 ±    0.00039       0.00003 ±    0.00000     0.168 ±  0.006 %    99.826 ±  0.087 %
  10       7.2376 ±    0.3590       0.00045 ±    0.00036       0.00003 ±    0.00000     0.167 ±  0.006 %    99.804 ±  0.088 %
  11       7.2592 ±    0.3421       0.00045 ±    0.00033       0.00003 ±    0.00000     0.166 ±  0.005 %    99.822 ±  0.080 %
  12       7.2841 ±    0.3298       0.00033 ±    0.00031       0.00003 ±    0.00000     0.172 ±  0.005 %    99.837 ±  0.073 %
  13       7.4398 ±    0.3229       0.00036 ±    0.00029       0.00003 ±    0.00000     0.171 ±  0.005 %    99.849 ±  0.067 %
  14       7.3379 ±    0.3062       0.00033 ±    0.00027       0.00003 ±    0.00000     0.168 ±  0.005 %    99.860 ±  0.063 %
  15       7.1275 ±    0.2859       0.00035 ±    0.00025       0.00003 ±    0.00000     0.167 ±  0.005 %    99.843 ±  0.064 %
  16       7.1714 ±    0.2793       0.00039 ±    0.00024       0.00003 ±    0.00000     0.171 ±  0.005 %    99.828 ±  0.065 %
  17       6.8067 ±    0.2539       0.00036 ±    0.00023       0.00003 ±    0.00000     0.169 ±  0.004 %    99.839 ±  0.061 %
  18       6.8643 ±    0.2518       0.00033 ±    0.00022       0.00003 ±    0.00000     0.168 ±  0.004 %    99.804 ±  0.065 %
  19       6.9991 ±    0.2515       0.00027 ±    0.00021       0.00003 ±    0.00000     0.166 ±  0.004 %    99.814 ±  0.062 %
  20       6.7977 ±    0.2383       0.00026 ±    0.00020       0.00003 ±    0.00000     0.168 ±  0.004 %    99.824 ±  0.059 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.F16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9425 ±    0.5261       0.00002 ±    0.00006       0.00000 ±    0.00000     0.022 ±  0.002 %    100.000 ±  0.000 %
   2       5.4427 ±    0.6040       0.00148 ±    0.00151       0.00000 ±    0.00000     0.023 ±  0.002 %    100.000 ±  0.000 %
   3       4.6851 ±    0.4029       0.00100 ±    0.00101       0.00000 ±    0.00000     0.029 ±  0.003 %    100.000 ±  0.000 %
   4       5.0068 ±    0.3674       0.00073 ±    0.00075       0.00000 ±    0.00000     0.029 ±  0.002 %    100.000 ±  0.000 %
   5       5.2945 ±    0.3436       0.00057 ±    0.00060       0.00000 ±    0.00000     0.028 ±  0.002 %    100.000 ±  0.000 %
   6       5.8317 ±    0.3545       0.00048 ±    0.00050       0.00000 ±    0.00000     0.027 ±  0.002 %    100.000 ±  0.000 %
   7       6.2264 ±    0.3545       0.00061 ±    0.00047       0.00000 ±    0.00000     0.027 ±  0.001 %    100.000 ±  0.000 %
   8       6.4321 ±    0.3454       0.00054 ±    0.00041       0.00000 ±    0.00000     0.027 ±  0.001 %    100.000 ±  0.000 %
   9       6.8873 ±    0.3580       0.00049 ±    0.00036       0.00000 ±    0.00000     0.027 ±  0.001 %    100.000 ±  0.000 %
  10       7.2376 ±    0.3591       0.00045 ±    0.00033       0.00000 ±    0.00000     0.027 ±  0.001 %    99.961 ±  0.039 %
  11       7.2589 ±    0.3421       0.00041 ±    0.00030       0.00000 ±    0.00000     0.026 ±  0.001 %    99.964 ±  0.036 %
  12       7.2846 ±    0.3299       0.00040 ±    0.00027       0.00000 ±    0.00000     0.027 ±  0.001 %    99.967 ±  0.033 %
  13       7.4399 ±    0.3229       0.00038 ±    0.00025       0.00000 ±    0.00000     0.026 ±  0.001 %    99.970 ±  0.030 %
  14       7.3381 ±    0.3062       0.00035 ±    0.00023       0.00000 ±    0.00000     0.026 ±  0.001 %    99.972 ±  0.028 %
  15       7.1273 ±    0.2860       0.00033 ±    0.00022       0.00000 ±    0.00000     0.026 ±  0.001 %    99.974 ±  0.026 %
  16       7.1709 ±    0.2793       0.00031 ±    0.00021       0.00000 ±    0.00000     0.026 ±  0.001 %    99.975 ±  0.025 %
  17       6.8063 ±    0.2539       0.00030 ±    0.00019       0.00000 ±    0.00000     0.027 ±  0.001 %    99.977 ±  0.023 %
  18       6.8639 ±    0.2518       0.00028 ±    0.00018       0.00000 ±    0.00000     0.027 ±  0.001 %    99.956 ±  0.031 %
  19       6.9991 ±    0.2515       0.00027 ±    0.00017       0.00000 ±    0.00000     0.027 ±  0.001 %    99.959 ±  0.029 %
  20       6.7978 ±    0.2384       0.00026 ±    0.00016       0.00000 ±    0.00000     0.027 ±  0.001 %    99.961 ±  0.028 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.Q8_0.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9483 ±    0.5259       0.00149 ±    0.00252       0.00062 ±    0.00005     0.837 ±  0.073 %    99.216 ±  0.554 %
   2       5.4523 ±    0.6051       0.00324 ±    0.00270       0.00075 ±    0.00006     0.820 ±  0.050 %    99.020 ±  0.437 %
   3       4.6906 ±    0.4032       0.00217 ±    0.00211       0.00088 ±    0.00005     0.945 ±  0.056 %    98.824 ±  0.390 %
   4       5.0117 ±    0.3678       0.00170 ±    0.00173       0.00085 ±    0.00004     0.935 ±  0.047 %    98.824 ±  0.338 %
   5       5.3001 ±    0.3441       0.00161 ±    0.00147       0.00085 ±    0.00004     0.928 ±  0.040 %    98.745 ±  0.312 %
   6       5.8416 ±    0.3554       0.00217 ±    0.00130       0.00085 ±    0.00003     0.902 ±  0.036 %    98.889 ±  0.268 %
   7       6.2372 ±    0.3555       0.00234 ±    0.00124       0.00086 ±    0.00003     0.932 ±  0.040 %    98.824 ±  0.255 %
   8       6.4396 ±    0.3462       0.00171 ±    0.00114       0.00087 ±    0.00003     0.923 ±  0.036 %    98.578 ±  0.262 %
   9       6.8928 ±    0.3586       0.00129 ±    0.00106       0.00089 ±    0.00003     0.927 ±  0.036 %    98.562 ±  0.249 %
  10       7.2420 ±    0.3596       0.00106 ±    0.00099       0.00088 ±    0.00002     0.905 ±  0.033 %    98.549 ±  0.237 %
  11       7.2616 ±    0.3424       0.00079 ±    0.00092       0.00085 ±    0.00002     0.887 ±  0.031 %    98.610 ±  0.221 %
  12       7.2866 ±    0.3302       0.00068 ±    0.00087       0.00086 ±    0.00002     0.888 ±  0.029 %    98.529 ±  0.218 %
  13       7.4416 ±    0.3232       0.00061 ±    0.00084       0.00087 ±    0.00002     0.878 ±  0.027 %    98.431 ±  0.216 %
  14       7.3400 ±    0.3065       0.00061 ±    0.00080       0.00087 ±    0.00002     0.869 ±  0.026 %    98.403 ±  0.210 %
  15       7.1295 ±    0.2862       0.00063 ±    0.00076       0.00086 ±    0.00002     0.863 ±  0.025 %    98.431 ±  0.201 %
  16       7.1739 ±    0.2795       0.00074 ±    0.00074       0.00088 ±    0.00002     0.879 ±  0.024 %    98.407 ±  0.196 %
  17       6.8092 ±    0.2541       0.00074 ±    0.00071       0.00086 ±    0.00002     0.873 ±  0.024 %    98.431 ±  0.189 %
  18       6.8646 ±    0.2519       0.00038 ±    0.00069       0.00085 ±    0.00002     0.869 ±  0.023 %    98.453 ±  0.182 %
  19       6.9997 ±    0.2516       0.00035 ±    0.00067       0.00085 ±    0.00002     0.865 ±  0.022 %    98.431 ±  0.179 %
  20       6.7990 ±    0.2385       0.00044 ±    0.00065       0.00085 ±    0.00002     0.877 ±  0.021 %    98.451 ±  0.173 %

#> zen4:
./build/bin/./llama-perplexity -ctk bf16 -ctv bf16 --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9443 ±    0.5267       0.00048 ±    0.00053       0.00004 ±    0.00001     0.169 ±  0.019 %    99.608 ±  0.392 %
   2       5.4419 ±    0.6039       0.00133 ±    0.00153       0.00005 ±    0.00001     0.167 ±  0.012 %    99.412 ±  0.339 %
   3       4.6835 ±    0.4027       0.00066 ±    0.00105       0.00006 ±    0.00001     0.236 ±  0.021 %    99.608 ±  0.226 %
   4       5.0057 ±    0.3672       0.00051 ±    0.00080       0.00005 ±    0.00000     0.231 ±  0.017 %    99.608 ±  0.196 %
   5       5.2931 ±    0.3434       0.00030 ±    0.00065       0.00005 ±    0.00000     0.220 ±  0.014 %    99.686 ±  0.157 %
   6       5.8307 ±    0.3543       0.00030 ±    0.00055       0.00005 ±    0.00000     0.216 ±  0.012 %    99.739 ±  0.131 %
   7       6.2255 ±    0.3544       0.00047 ±    0.00052       0.00005 ±    0.00000     0.210 ±  0.011 %    99.664 ±  0.137 %
   8       6.4316 ±    0.3454       0.00047 ±    0.00046       0.00005 ±    0.00000     0.218 ±  0.010 %    99.657 ±  0.130 %
   9       6.8874 ±    0.3580       0.00050 ±    0.00041       0.00005 ±    0.00000     0.213 ±  0.010 %    99.608 ±  0.130 %
  10       7.2365 ±    0.3589       0.00030 ±    0.00038       0.00005 ±    0.00000     0.209 ±  0.009 %    99.569 ±  0.130 %
  11       7.2584 ±    0.3420       0.00034 ±    0.00035       0.00005 ±    0.00000     0.205 ±  0.008 %    99.572 ±  0.123 %
  12       7.2835 ±    0.3298       0.00025 ±    0.00032       0.00005 ±    0.00000     0.211 ±  0.009 %    99.542 ±  0.122 %
  13       7.4389 ±    0.3228       0.00025 ±    0.00030       0.00005 ±    0.00000     0.209 ±  0.009 %    99.578 ±  0.113 %
  14       7.3370 ±    0.3061       0.00021 ±    0.00029       0.00005 ±    0.00000     0.208 ±  0.008 %    99.524 ±  0.115 %
  15       7.1270 ±    0.2859       0.00028 ±    0.00027       0.00005 ±    0.00000     0.209 ±  0.008 %    99.529 ±  0.111 %
  16       7.1706 ±    0.2792       0.00027 ±    0.00026       0.00005 ±    0.00000     0.216 ±  0.007 %    99.510 ±  0.109 %
  17       6.8060 ±    0.2538       0.00026 ±    0.00025       0.00006 ±    0.00000     0.215 ±  0.007 %    99.539 ±  0.103 %
  18       6.8641 ±    0.2517       0.00030 ±    0.00024       0.00006 ±    0.00000     0.213 ±  0.007 %    99.521 ±  0.102 %
  19       6.9992 ±    0.2515       0.00028 ±    0.00023       0.00006 ±    0.00000     0.214 ±  0.007 %    99.546 ±  0.097 %
  20       6.7980 ±    0.2384       0.00030 ±    0.00022       0.00006 ±    0.00000     0.217 ±  0.007 %    99.569 ±  0.092 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9433 ±    0.5262       0.00024 ±    0.00045       0.00003 ±    0.00000     0.166 ±  0.028 %    99.216 ±  0.554 %
   2       5.4430 ±    0.6041       0.00154 ±    0.00158       0.00003 ±    0.00000     0.167 ±  0.016 %    99.216 ±  0.391 %
   3       4.6865 ±    0.4030       0.00130 ±    0.00108       0.00004 ±    0.00001     0.205 ±  0.024 %    99.477 ±  0.261 %
   4       5.0071 ±    0.3673       0.00079 ±    0.00082       0.00003 ±    0.00000     0.193 ±  0.019 %    99.510 ±  0.219 %
   5       5.2951 ±    0.3436       0.00068 ±    0.00066       0.00003 ±    0.00000     0.185 ±  0.016 %    99.608 ±  0.175 %
   6       5.8319 ±    0.3544       0.00051 ±    0.00056       0.00003 ±    0.00000     0.178 ±  0.014 %    99.608 ±  0.160 %
   7       6.2266 ±    0.3545       0.00064 ±    0.00051       0.00003 ±    0.00000     0.174 ±  0.012 %    99.608 ±  0.148 %
   8       6.4322 ±    0.3455       0.00056 ±    0.00045       0.00003 ±    0.00000     0.177 ±  0.011 %    99.657 ±  0.130 %
   9       6.8885 ±    0.3581       0.00066 ±    0.00041       0.00003 ±    0.00000     0.175 ±  0.010 %    99.608 ±  0.130 %
  10       7.2386 ±    0.3592       0.00059 ±    0.00037       0.00003 ±    0.00000     0.172 ±  0.009 %    99.647 ±  0.117 %
  11       7.2603 ±    0.3422       0.00060 ±    0.00034       0.00003 ±    0.00000     0.170 ±  0.009 %    99.679 ±  0.107 %
  12       7.2852 ±    0.3299       0.00049 ±    0.00031       0.00003 ±    0.00000     0.173 ±  0.008 %    99.673 ±  0.103 %
  13       7.4408 ±    0.3230       0.00049 ±    0.00029       0.00003 ±    0.00000     0.172 ±  0.008 %    99.698 ±  0.095 %
  14       7.3386 ±    0.3063       0.00041 ±    0.00028       0.00003 ±    0.00000     0.170 ±  0.007 %    99.692 ±  0.093 %
  15       7.1278 ±    0.2860       0.00040 ±    0.00026       0.00003 ±    0.00000     0.167 ±  0.007 %    99.686 ±  0.090 %
  16       7.1714 ±    0.2793       0.00038 ±    0.00024       0.00003 ±    0.00000     0.166 ±  0.006 %    99.706 ±  0.085 %
  17       6.8064 ±    0.2539       0.00032 ±    0.00023       0.00003 ±    0.00000     0.164 ±  0.006 %    99.723 ±  0.080 %
  18       6.8641 ±    0.2518       0.00031 ±    0.00022       0.00003 ±    0.00000     0.162 ±  0.006 %    99.695 ±  0.081 %
  19       6.9994 ±    0.2515       0.00032 ±    0.00021       0.00003 ±    0.00000     0.161 ±  0.006 %    99.711 ±  0.077 %
  20       6.7979 ±    0.2384       0.00028 ±    0.00020       0.00003 ±    0.00000     0.165 ±  0.005 %    99.706 ±  0.076 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.F16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9432 ±    0.5262       0.00021 ±    0.00007       0.00000 ±    0.00000     0.023 ±  0.003 %    100.000 ±  0.000 %
   2       5.4435 ±    0.6041       0.00163 ±    0.00150       0.00000 ±    0.00000     0.025 ±  0.002 %    100.000 ±  0.000 %
   3       4.6856 ±    0.4029       0.00111 ±    0.00100       0.00000 ±    0.00000     0.030 ±  0.002 %    100.000 ±  0.000 %
   4       5.0072 ±    0.3674       0.00081 ±    0.00075       0.00000 ±    0.00000     0.029 ±  0.002 %    100.000 ±  0.000 %
   5       5.2951 ±    0.3437       0.00067 ±    0.00060       0.00000 ±    0.00000     0.030 ±  0.002 %    100.000 ±  0.000 %
   6       5.8323 ±    0.3545       0.00057 ±    0.00050       0.00000 ±    0.00000     0.029 ±  0.002 %    100.000 ±  0.000 %
   7       6.2269 ±    0.3546       0.00069 ±    0.00047       0.00000 ±    0.00000     0.028 ±  0.001 %    100.000 ±  0.000 %
   8       6.4324 ±    0.3455       0.00059 ±    0.00041       0.00000 ±    0.00000     0.028 ±  0.001 %    100.000 ±  0.000 %
   9       6.8876 ±    0.3581       0.00053 ±    0.00036       0.00000 ±    0.00000     0.027 ±  0.001 %    100.000 ±  0.000 %
  10       7.2379 ±    0.3591       0.00049 ±    0.00033       0.00000 ±    0.00000     0.027 ±  0.001 %    99.961 ±  0.039 %
  11       7.2590 ±    0.3421       0.00043 ±    0.00030       0.00000 ±    0.00000     0.027 ±  0.001 %    99.964 ±  0.036 %
  12       7.2847 ±    0.3299       0.00041 ±    0.00027       0.00000 ±    0.00000     0.028 ±  0.001 %    99.935 ±  0.046 %
  13       7.4400 ±    0.3229       0.00039 ±    0.00025       0.00000 ±    0.00000     0.028 ±  0.001 %    99.940 ±  0.043 %
  14       7.3381 ±    0.3062       0.00035 ±    0.00023       0.00000 ±    0.00000     0.028 ±  0.001 %    99.888 ±  0.056 %
  15       7.1273 ±    0.2860       0.00032 ±    0.00022       0.00000 ±    0.00000     0.028 ±  0.001 %    99.895 ±  0.052 %
  16       7.1708 ±    0.2793       0.00030 ±    0.00021       0.00000 ±    0.00000     0.030 ±  0.001 %    99.902 ±  0.049 %
  17       6.8062 ±    0.2539       0.00029 ±    0.00019       0.00000 ±    0.00000     0.030 ±  0.001 %    99.908 ±  0.046 %
  18       6.8639 ±    0.2518       0.00027 ±    0.00018       0.00000 ±    0.00000     0.030 ±  0.001 %    99.891 ±  0.049 %
  19       6.9991 ±    0.2515       0.00027 ±    0.00017       0.00000 ±    0.00000     0.030 ±  0.001 %    99.897 ±  0.046 %
  20       6.7978 ±    0.2384       0.00027 ±    0.00016       0.00000 ±    0.00000     0.030 ±  0.001 %    99.882 ±  0.048 %

./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.Q8_0.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9525 ±    0.5268       0.00257 ±    0.00200       0.00062 ±    0.00006     0.780 ±  0.079 %    98.824 ±  0.677 %
   2       5.4551 ±    0.6050       0.00375 ±    0.00244       0.00074 ±    0.00005     0.790 ±  0.052 %    99.020 ±  0.437 %
   3       4.6895 ±    0.4028       0.00195 ±    0.00192       0.00094 ±    0.00007     0.974 ±  0.066 %    98.824 ±  0.390 %
   4       5.0104 ±    0.3675       0.00144 ±    0.00155       0.00089 ±    0.00005     0.958 ±  0.054 %    98.922 ±  0.324 %
   5       5.2958 ±    0.3435       0.00082 ±    0.00133       0.00089 ±    0.00004     0.924 ±  0.046 %    98.902 ±  0.292 %
   6       5.8327 ±    0.3544       0.00064 ±    0.00118       0.00087 ±    0.00004     0.897 ±  0.040 %    98.824 ±  0.276 %
   7       6.2283 ±    0.3547       0.00092 ±    0.00114       0.00088 ±    0.00003     0.888 ±  0.036 %    98.768 ±  0.261 %
   8       6.4327 ±    0.3453       0.00063 ±    0.00107       0.00088 ±    0.00003     0.877 ±  0.033 %    98.529 ±  0.267 %
   9       6.8834 ±    0.3577      -0.00008 ±    0.00100       0.00088 ±    0.00003     0.869 ±  0.030 %    98.562 ±  0.249 %
  10       7.2338 ±    0.3588      -0.00007 ±    0.00094       0.00087 ±    0.00002     0.859 ±  0.028 %    98.431 ±  0.246 %
  11       7.2576 ±    0.3420       0.00023 ±    0.00088       0.00086 ±    0.00002     0.843 ±  0.026 %    98.503 ±  0.229 %
  12       7.2853 ±    0.3300       0.00051 ±    0.00084       0.00088 ±    0.00002     0.851 ±  0.025 %    98.529 ±  0.218 %
  13       7.4384 ±    0.3229       0.00018 ±    0.00080       0.00087 ±    0.00002     0.841 ±  0.024 %    98.522 ±  0.210 %
  14       7.3364 ±    0.3062       0.00011 ±    0.00076       0.00087 ±    0.00002     0.838 ±  0.023 %    98.543 ±  0.201 %
  15       7.1259 ±    0.2859       0.00013 ±    0.00073       0.00086 ±    0.00002     0.833 ±  0.022 %    98.614 ±  0.189 %
  16       7.1687 ±    0.2792       0.00001 ±    0.00070       0.00086 ±    0.00002     0.841 ±  0.021 %    98.603 ±  0.184 %
  17       6.8051 ±    0.2538       0.00013 ±    0.00067       0.00085 ±    0.00002     0.838 ±  0.020 %    98.570 ±  0.180 %
  18       6.8615 ±    0.2516      -0.00007 ±    0.00066       0.00084 ±    0.00002     0.839 ±  0.020 %    98.540 ±  0.177 %
  19       6.9961 ±    0.2514      -0.00016 ±    0.00064       0.00084 ±    0.00002     0.838 ±  0.019 %    98.514 ±  0.174 %
  20       6.7965 ±    0.2384       0.00008 ±    0.00062       0.00085 ±    0.00002     0.855 ±  0.020 %    98.529 ±  0.169 %

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 10, 2024

For me look good.
@ggerganov @slaren what do you think with the 3 failed?

@Djip007 Djip007 marked this pull request as ready for review December 10, 2024 00:52
@slaren
Copy link
Member

slaren commented Dec 10, 2024

@ggerganov @slaren what do you think with the 3 failed?

Not sure, try merging the current master to see if it is some issue in the server that has already been fixed.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

--------------------------- Captured stdout teardown ---------------------------
Stopping server with pid=4213
=========================== short test summary info ============================
FAILED unit/test_completion.py::test_consistent_result_same_seed[2] - AssertionError: assert ' making. Eve...hen, they saw' == ' making. Eve...ining and dan'
  
     making. Everyone are very hungry.
  - One day, it is time to go to the park with his mom. They had a quiet window. They were shining and dan
  + One day, it is time to go to the park with his mom. They had a talking eye to rest. But then, they saw
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
======================== 1 failed, 32 passed in 22.85s =========================

look like a small diff in result.

unit/test_completion.py::test_consistent_result_same_seed[1] PASSED      [ 34%]
unit/test_completion.py::test_consistent_result_same_seed[2] FAILED      [ 35%]
@pytest.mark.parametrize("n_slots", [1, 2])
def test_consistent_result_same_seed(n_slots: int):
    global server
    server.n_slots = n_slots

what is n_slots?

I have to check some elements in my code tomorrow...

@slaren
Copy link
Member

slaren commented Dec 11, 2024

what is n_slots?

I am not sure what's the effect of increasing the number of slots for this test. I suspect that this error might indicate there is a buffer overflow somewhere, and random data beyond the tensor buffer may be causing it to generate different sequences despite using the same seed.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

I suspect that this error might indicate there is a buffer overflow somewhere, and random data beyond the tensor buffer may be causing it to generate different sequences despite using the same seed.

That's what I was thinking last night, but it was too late. I have a little idea, but I was too tired to check/correct it.

@Djip007 Djip007 marked this pull request as draft December 11, 2024 18:58
@ggerganov
Copy link
Member

The failing test seems to be using 2 slots. With 2 slots, the KV cache buffer is shared among the two generations. Initially, the buffer is empty:

..................................................................

Then the first request is processed by slot 0 and thus the beginning of the buffer is occupied:

000000000000000000000000000000000000..............................

The second request is processed on slot 1, so the old data remains in the buffer:

0000000000000000000000000000000000001111111111111111111111111111..

Because we compute the attention on the entire buffer by masking out the cross-sequence values, it is actually possible to get different results between the 2 generations. This happens due to summing floating-point across the length of the KV buffer. In the next example, even-though the data in the buffer is the same, it can lead to different numerical results during the V*QK matrix multiplication simply because the data occupies different cells and the SIMD groups would produce different reults:

000000000000000000000000000000000000..1111111111111111111111111111

I'm thinking that maybe there isn't a bug in the implementation in this PR, and it's a side-effect of the unified KV cache. Probably this test for n_slots > 1 should be disabled for now.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

@ggerganov
Great! Thanks for these explanations. Very clear. And then very nice to learn how it works.

On the other hand, by going over my code step by step, there are a small number of cases (2 to ~5?) where I do too much calculation and write outside the wrong value (possibly by overwriting correct data that I just calculated...)

So I corrected that. It remains to be seen if the test passes.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

Well that wasn't enough,
at least I fixed a small bug.

I'm doing another pass on the perlexity to be sure with my last correction.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been running random tests with test-backend-ops and I haven't seen any failure, so I am fairly confident that this is correct. Let's just disable the server test for 2 slots.

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

I have been running random tests with test-backend-ops and I haven't seen any failure, so I am fairly confident that this is correct. Let's just disable the server test for 2 slots.

not sur how to do it:

# replace 
# @pytest.mark.parametrize("n_slots", [1, 2])
# with that?
@pytest.mark.parametrize("n_slots", [1])
def test_consistent_result_same_seed(n_slots: int):
    global server
    server.n_slots = n_slots
    server.start()
    last_res = None
    for _ in range(4):
        res = server.make_request("POST", "/completion", data={
            "prompt": "I believe the meaning of life is",
            "seed": 42,
            "temperature": 1.0,
            "cache_prompt": False,  # TODO: remove this once test_cache_vs_nocache_prompt is fixed
        })
        if last_res is not None:
            assert res.body["content"] == last_res.body["content"]
        last_res = res

@github-actions github-actions bot added examples python python script changes server labels Dec 11, 2024
@ggerganov
Copy link
Member

A different test is failing now. Add:

--- a/examples/server/tests/unit/test_completion.py
+++ b/examples/server/tests/unit/test_completion.py
@@ -116,6 +116,7 @@ def test_different_result_different_seed(n_slots: int):
 def test_consistent_result_different_batch_size(n_batch: int, temperature: float):
     global server
     server.n_batch = n_batch
+    server.n_slots = 1
     server.start()
     last_res = None
     for _ in range(4):

@slaren
Copy link
Member

slaren commented Dec 11, 2024

On my system (intel 13900k) I see better performance with BF16, but worse with F16 in some cases:

Model Test t/s master t/s perfo/tinyblas Speedup
llama 7B BF16 pp32 33.94 42.27 1.25
llama 7B BF16 pp64 34.82 43.27 1.24
llama 7B BF16 pp128 33.22 43.69 1.32
llama 7B BF16 tg32 6.75 6.41 0.95
llama 7B F16 pp32 41.45 28.85 0.70
llama 7B F16 pp64 42.88 26.34 0.61
llama 7B F16 pp128 43.69 29.75 0.68
llama 7B F16 tg32 6.82 6.47 0.95

With different numbers of threads:

Model Threads Test t/s master t/s perfo/tinyblas Speedup
llama 7B F16 8 pp64 51.98 59.14 1.14
llama 7B F16 16 pp64 35.20 28.97 0.82
llama 7B F16 24 pp64 63.18 43.40 0.69
llama 7B F16 32 pp64 75.45 54.18 0.72

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 11, 2024

On my system (intel 13900k)

It is a AVX512 or a AVX2?

@Djip007 Djip007 force-pushed the perfo/tinyblas branch 3 times, most recently from b2dab60 to 30ae0d2 Compare December 14, 2024 21:06
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
- show-progress is not part of GNU Wget2
@Djip007
Copy link
Contributor Author

Djip007 commented Dec 24, 2024

OK code look good and I get good perf with Ryzen9 5950X and 7945HS.

Need to "remove" not working test in "Server" check

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 24, 2024

Some last bench with ("without" u-batch):

./scripts/compare-commits.sh master perfo/tinyblas -ctk bf16 -ctv bf16 -r 3 -ub 2048 -b 2048 -m "Mistral-Nemo-Instruct-2407.BF16.gguf,Mistral-7B-Instruct-v0.3-BF16.gguf" -p "1,1,1,2,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,30,31,32,33,34,35,60,61,62,63,64,65,127,128,129,130,131,132,255,256,257,258,259,260,510,511,512,513,514,515,1023,1024,1025,1026,1027,1028,2048"
./scripts/compare-commits.sh master perfo/tinyblas -ctk  f16 -ctv  f16 -r 3 -ub 2048 -b 2048 -m "Mistral-Nemo-Instruct-2407.F16.gguf,Mistral-7B-Instruct-v0.3-FP16.gguf"  -p "1,1,1,2,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,30,31,32,33,34,35,60,61,62,63,64,65,127,128,129,130,131,132,255,256,257,258,259,260,510,511,512,513,514,515,1023,1024,1025,1026,1027,1028,2048"

Do not direct compare with preview result, I have some change on Bios config (PBO / max TDP...)

  • AMD Ryzen 9 7940HS
Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B BF16 pp1 2.50 2.50 1.00
llama 13B BF16 pp2 4.93 4.92 1.00
llama 13B BF16 pp3 7.29 7.36 1.01
llama 13B BF16 pp4 9.43 9.77 1.04
llama 13B BF16 pp5 11.52 12.19 1.06
llama 13B BF16 pp6 13.38 14.52 1.08
llama 13B BF16 pp7 14.93 16.82 1.13
llama 13B BF16 pp8 16.25 19.12 1.18
llama 13B BF16 pp9 16.62 21.29 1.28
llama 13B BF16 pp10 17.23 23.61 1.37
llama 13B BF16 pp11 17.79 25.73 1.45
llama 13B BF16 pp12 18.14 27.73 1.53
llama 13B BF16 pp13 18.42 29.29 1.59
llama 13B BF16 pp14 18.72 31.34 1.67
llama 13B BF16 pp15 18.99 32.94 1.73
llama 13B BF16 pp16 19.20 34.85 1.82
llama 13B BF16 pp30 20.39 46.59 2.28
llama 13B BF16 pp31 20.49 42.57 2.08
llama 13B BF16 pp32 20.56 42.90 2.09
llama 13B BF16 pp33 20.87 44.29 2.12
llama 13B BF16 pp34 19.86 44.94 2.26
llama 13B BF16 pp35 20.01 46.20 2.31
llama 13B BF16 pp60 21.04 47.46 2.26
llama 13B BF16 pp61 21.08 46.64 2.21
llama 13B BF16 pp62 21.12 47.06 2.23
llama 13B BF16 pp63 21.17 47.59 2.25
llama 13B BF16 pp64 21.20 47.77 2.25
llama 13B BF16 pp65 20.49 47.88 2.34
llama 13B BF16 pp127 21.41 54.12 2.53
llama 13B BF16 pp128 21.46 54.14 2.52
llama 13B BF16 pp129 22.27 54.93 2.47
llama 13B BF16 pp130 20.71 53.92 2.60
llama 13B BF16 pp131 21.08 53.82 2.55
llama 13B BF16 pp132 20.99 54.07 2.58
llama 13B BF16 pp255 23.21 55.18 2.38
llama 13B BF16 pp256 22.95 55.09 2.40
llama 13B BF16 pp257 22.02 55.01 2.50
llama 13B BF16 pp258 22.19 55.11 2.48
llama 13B BF16 pp259 22.24 54.95 2.47
llama 13B BF16 pp260 22.31 55.03 2.47
llama 13B BF16 pp510 22.73 54.50 2.40
llama 13B BF16 pp511 22.82 54.48 2.39
llama 13B BF16 pp512 22.82 54.48 2.39
llama 13B BF16 pp513 22.23 54.56 2.45
llama 13B BF16 pp514 22.24 54.64 2.46
llama 13B BF16 pp515 22.24 54.62 2.46
llama 13B BF16 pp1023 20.64 52.60 2.55
llama 13B BF16 pp1024 20.78 52.32 2.52
llama 13B BF16 pp1025 20.37 52.84 2.59
llama 13B BF16 pp1026 20.44 52.78 2.58
llama 13B BF16 pp1027 20.46 52.77 2.58
llama 13B BF16 pp1028 20.47 52.63 2.57
llama 13B BF16 pp2048 19.91 49.12 2.47
llama 13B BF16 tg128 2.50 2.50 1.00
llama 7B BF16 pp1 4.06 4.05 1.00
llama 7B BF16 pp2 8.03 8.01 1.00
llama 7B BF16 pp3 11.85 11.93 1.01
llama 7B BF16 pp4 15.41 15.77 1.02
llama 7B BF16 pp5 18.63 19.70 1.06
llama 7B BF16 pp6 21.46 23.47 1.09
llama 7B BF16 pp7 23.87 27.15 1.14
llama 7B BF16 pp8 26.02 30.93 1.19
llama 7B BF16 pp9 27.46 34.42 1.25
llama 7B BF16 pp10 28.66 38.16 1.33
llama 7B BF16 pp11 29.45 41.58 1.41
llama 7B BF16 pp12 30.16 44.81 1.49
llama 7B BF16 pp13 30.69 47.66 1.55
llama 7B BF16 pp14 31.27 50.85 1.63
llama 7B BF16 pp15 31.66 53.77 1.70
llama 7B BF16 pp16 32.04 56.74 1.77
llama 7B BF16 pp30 33.27 68.29 2.05
llama 7B BF16 pp31 30.95 73.07 2.36
llama 7B BF16 pp32 31.91 74.05 2.32
llama 7B BF16 pp33 30.81 75.33 2.45
llama 7B BF16 pp34 30.98 75.20 2.43
llama 7B BF16 pp35 31.13 77.35 2.48
llama 7B BF16 pp60 32.62 77.80 2.38
llama 7B BF16 pp61 32.70 75.81 2.32
llama 7B BF16 pp62 32.78 75.54 2.30
llama 7B BF16 pp63 32.83 76.46 2.33
llama 7B BF16 pp64 32.90 77.16 2.35
llama 7B BF16 pp65 31.68 76.93 2.43
llama 7B BF16 pp127 35.38 83.55 2.36
llama 7B BF16 pp128 35.02 74.07 2.12
llama 7B BF16 pp129 33.97 73.78 2.17
llama 7B BF16 pp130 34.04 73.64 2.16
llama 7B BF16 pp131 34.09 73.67 2.16
llama 7B BF16 pp132 34.13 73.72 2.16
llama 7B BF16 pp255 35.70 86.47 2.42
llama 7B BF16 pp256 35.73 87.25 2.44
llama 7B BF16 pp257 34.33 87.09 2.54
llama 7B BF16 pp258 34.41 87.25 2.54
llama 7B BF16 pp259 34.51 86.82 2.52
llama 7B BF16 pp260 34.60 86.93 2.51
llama 7B BF16 pp510 32.46 85.43 2.63
llama 7B BF16 pp511 32.47 85.47 2.63
llama 7B BF16 pp512 32.51 85.47 2.63
llama 7B BF16 pp513 31.65 85.64 2.71
llama 7B BF16 pp514 31.49 85.62 2.72
llama 7B BF16 pp515 31.75 85.63 2.70
llama 7B BF16 pp1023 31.20 69.76 2.24
llama 7B BF16 pp1024 31.25 69.64 2.23
llama 7B BF16 pp1025 30.55 70.17 2.30
llama 7B BF16 pp1026 30.61 69.97 2.29
llama 7B BF16 pp1027 30.60 69.98 2.29
llama 7B BF16 pp1028 30.68 69.88 2.28
llama 7B BF16 pp2048 29.64 64.58 2.18
llama 7B BF16 tg128 4.07 4.07 1.00
Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B F16 pp1 2.47 2.48 1.00
llama 13B F16 pp2 4.90 4.86 0.99
llama 13B F16 pp3 7.27 7.21 0.99
llama 13B F16 pp4 9.69 9.63 0.99
llama 13B F16 pp5 12.01 11.94 0.99
llama 13B F16 pp6 7.79 14.23 1.83
llama 13B F16 pp7 9.03 16.41 1.82
llama 13B F16 pp8 10.32 18.58 1.80
llama 13B F16 pp9 11.52 20.50 1.78
llama 13B F16 pp10 22.45 22.34 1.00
llama 13B F16 pp11 13.82 24.07 1.74
llama 13B F16 pp12 14.94 25.39 1.70
llama 13B F16 pp13 16.22 25.89 1.60
llama 13B F16 pp14 17.37 26.51 1.53
llama 13B F16 pp15 28.39 26.82 0.94
llama 13B F16 pp16 18.73 27.45 1.47
llama 13B F16 pp30 31.39 30.91 0.98
llama 13B F16 pp31 24.75 30.41 1.23
llama 13B F16 pp32 25.41 30.70 1.21
llama 13B F16 pp33 25.97 30.76 1.18
llama 13B F16 pp34 26.37 30.95 1.17
llama 13B F16 pp35 31.53 31.17 0.99
llama 13B F16 pp60 31.42 32.06 1.02
llama 13B F16 pp61 26.56 31.79 1.20
llama 13B F16 pp62 26.87 31.76 1.18
llama 13B F16 pp63 27.19 31.72 1.17
llama 13B F16 pp64 25.92 31.28 1.21
llama 13B F16 pp65 29.19 32.07 1.10
llama 13B F16 pp127 27.97 31.86 1.14
llama 13B F16 pp128 28.12 32.41 1.15
llama 13B F16 pp129 29.07 32.59 1.12
llama 13B F16 pp130 31.41 32.36 1.03
llama 13B F16 pp131 28.84 31.41 1.09
llama 13B F16 pp132 28.93 32.42 1.12
llama 13B F16 pp255 30.65 32.66 1.07
llama 13B F16 pp256 29.24 32.66 1.12
llama 13B F16 pp257 28.49 32.60 1.14
llama 13B F16 pp258 28.40 32.60 1.15
llama 13B F16 pp259 28.37 32.56 1.15
llama 13B F16 pp260 28.85 32.60 1.13
llama 13B F16 pp510 23.79 32.06 1.35
llama 13B F16 pp511 23.37 32.15 1.38
llama 13B F16 pp512 23.30 32.13 1.38
llama 13B F16 pp513 23.65 31.86 1.35
llama 13B F16 pp514 23.40 32.12 1.37
llama 13B F16 pp515 23.58 32.14 1.36
llama 13B F16 pp1023 16.67 30.88 1.85
llama 13B F16 pp1024 16.46 30.87 1.88
llama 13B F16 pp1025 16.83 30.78 1.83
llama 13B F16 pp1026 16.55 30.81 1.86
llama 13B F16 pp1027 16.51 30.83 1.87
llama 13B F16 pp1028 16.77 30.83 1.84
llama 13B F16 pp2048 14.40 29.17 2.03
llama 13B F16 tg128 2.48 2.48 1.00
llama 7B F16 pp1 4.03 4.01 0.99
llama 7B F16 pp2 7.96 7.82 0.98
llama 7B F16 pp3 11.82 11.63 0.98
llama 7B F16 pp4 15.73 15.44 0.98
llama 7B F16 pp5 19.48 19.25 0.99
llama 7B F16 pp6 12.65 22.85 1.81
llama 7B F16 pp7 14.67 26.31 1.79
llama 7B F16 pp8 16.70 29.88 1.79
llama 7B F16 pp9 18.61 32.89 1.77
llama 7B F16 pp10 36.35 35.95 0.99
llama 7B F16 pp11 22.44 38.63 1.72
llama 7B F16 pp12 24.33 40.96 1.68
llama 7B F16 pp13 26.21 41.44 1.58
llama 7B F16 pp14 27.96 42.89 1.53
llama 7B F16 pp15 46.01 43.46 0.94
llama 7B F16 pp16 30.31 44.67 1.47
llama 7B F16 pp30 49.97 48.86 0.98
llama 7B F16 pp31 39.56 48.18 1.22
llama 7B F16 pp32 40.60 48.55 1.20
llama 7B F16 pp33 41.30 48.53 1.18
llama 7B F16 pp34 42.17 49.00 1.16
llama 7B F16 pp35 50.29 49.02 0.97
llama 7B F16 pp60 50.50 50.52 1.00
llama 7B F16 pp61 42.49 50.14 1.18
llama 7B F16 pp62 43.11 50.26 1.17
llama 7B F16 pp63 43.70 50.44 1.15
llama 7B F16 pp64 44.22 50.64 1.15
llama 7B F16 pp65 50.40 50.32 1.00
llama 7B F16 pp127 46.20 50.67 1.10
llama 7B F16 pp128 46.47 48.38 1.04
llama 7B F16 pp129 46.58 48.64 1.04
llama 7B F16 pp130 49.76 49.00 0.98
llama 7B F16 pp131 45.91 49.13 1.07
llama 7B F16 pp132 46.23 49.10 1.06
llama 7B F16 pp255 46.41 50.86 1.10
llama 7B F16 pp256 45.00 50.82 1.13
llama 7B F16 pp257 44.22 50.70 1.15
llama 7B F16 pp258 44.25 50.72 1.15
llama 7B F16 pp259 43.94 50.64 1.15
llama 7B F16 pp260 45.03 50.68 1.13
llama 7B F16 pp510 37.94 50.18 1.32
llama 7B F16 pp511 37.36 50.13 1.34
llama 7B F16 pp512 37.27 50.13 1.35
llama 7B F16 pp513 37.42 49.84 1.33
llama 7B F16 pp514 37.26 50.24 1.35
llama 7B F16 pp515 38.03 50.25 1.32
llama 7B F16 pp1023 29.16 45.71 1.57
llama 7B F16 pp1024 28.57 45.47 1.59
llama 7B F16 pp1025 28.08 45.52 1.62
llama 7B F16 pp1026 27.90 45.59 1.63
llama 7B F16 pp1027 28.40 45.51 1.60
llama 7B F16 pp1028 29.58 45.46 1.54
llama 7B F16 pp2048 21.20 42.23 1.99
llama 7B F16 tg128 4.02 4.03 1.00
  • AMD Ryzen 9 5950X
Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B BF16 pp1 2.13 2.13 1.00
llama 13B BF16 pp2 4.23 4.15 0.98
llama 13B BF16 pp3 6.30 6.21 0.99
llama 13B BF16 pp4 8.27 8.29 1.00
llama 13B BF16 pp5 10.14 10.33 1.02
llama 13B BF16 pp6 11.87 12.37 1.04
llama 13B BF16 pp7 13.43 14.45 1.08
llama 13B BF16 pp8 14.75 16.49 1.12
llama 13B BF16 pp9 15.77 18.52 1.17
llama 13B BF16 pp10 16.59 20.51 1.24
llama 13B BF16 pp11 17.26 22.56 1.31
llama 13B BF16 pp12 17.88 24.53 1.37
llama 13B BF16 pp13 18.35 26.66 1.45
llama 13B BF16 pp14 18.79 28.53 1.52
llama 13B BF16 pp15 19.12 30.48 1.59
llama 13B BF16 pp16 19.49 32.59 1.67
llama 13B BF16 pp30 20.52 46.80 2.28
llama 13B BF16 pp32 20.74 46.93 2.26
llama 13B BF16 pp64 21.39 49.16 2.30
llama 13B BF16 pp65 20.72 49.68 2.40
llama 13B BF16 pp66 20.78 49.94 2.40
llama 13B BF16 pp127 21.70 50.44 2.32
llama 13B BF16 pp128 21.66 50.56 2.33
llama 13B BF16 pp129 21.34 50.60 2.37
llama 13B BF16 pp255 21.87 50.76 2.32
llama 13B BF16 pp256 21.91 50.62 2.31
llama 13B BF16 pp257 21.40 50.55 2.36
llama 13B BF16 pp510 21.83 50.18 2.30
llama 13B BF16 pp511 21.90 50.07 2.29
llama 13B BF16 pp512 21.92 50.12 2.29
llama 13B BF16 pp513 21.66 50.19 2.32
llama 13B BF16 pp1023 21.33 48.41 2.27
llama 13B BF16 pp1024 21.40 48.46 2.26
llama 13B BF16 pp1025 21.25 48.37 2.28
llama 13B BF16 pp2048 20.46 45.49 2.22
llama 13B BF16 tg128 2.13 2.13 1.00
llama 7B BF16 pp1 3.48 3.48 1.00
llama 7B BF16 pp2 6.87 6.74 0.98
llama 7B BF16 pp3 10.17 10.09 0.99
llama 7B BF16 pp4 13.38 13.45 1.01
llama 7B BF16 pp5 16.39 16.77 1.02
llama 7B BF16 pp6 19.07 20.10 1.05
llama 7B BF16 pp7 21.51 23.44 1.09
llama 7B BF16 pp8 23.70 26.76 1.13
llama 7B BF16 pp9 25.33 30.11 1.19
llama 7B BF16 pp10 26.74 33.45 1.25
llama 7B BF16 pp11 27.85 36.79 1.32
llama 7B BF16 pp12 28.54 39.95 1.40
llama 7B BF16 pp13 29.35 43.38 1.48
llama 7B BF16 pp14 29.98 46.44 1.55
llama 7B BF16 pp15 30.44 49.54 1.63
llama 7B BF16 pp16 30.88 52.88 1.71
llama 7B BF16 pp30 32.19 74.83 2.32
llama 7B BF16 pp32 32.63 74.31 2.28
llama 7B BF16 pp64 33.74 78.01 2.31
llama 7B BF16 pp65 32.78 78.36 2.39
llama 7B BF16 pp66 32.86 78.76 2.40
llama 7B BF16 pp127 34.04 79.20 2.33
llama 7B BF16 pp128 33.99 79.43 2.34
llama 7B BF16 pp129 33.53 79.34 2.37
llama 7B BF16 pp255 34.28 79.33 2.31
llama 7B BF16 pp256 34.35 79.11 2.30
llama 7B BF16 pp257 33.62 78.86 2.35
llama 7B BF16 pp510 34.10 78.20 2.29
llama 7B BF16 pp511 34.18 78.01 2.28
llama 7B BF16 pp512 34.20 77.96 2.28
llama 7B BF16 pp513 33.85 77.83 2.30
llama 7B BF16 pp1023 33.14 74.47 2.25
llama 7B BF16 pp1024 33.09 74.55 2.25
llama 7B BF16 pp1025 32.89 74.52 2.27
llama 7B BF16 pp2048 31.33 69.52 2.22
llama 7B BF16 tg128 3.48 3.48 1.00
Model Test t/s master t/s perfo/tinyblas Speedup
llama 13B F16 pp1 2.12 2.12 1.00
llama 13B F16 pp2 4.14 4.12 1.00
llama 13B F16 pp3 6.23 6.19 0.99
llama 13B F16 pp4 4.80 8.23 1.71
llama 13B F16 pp5 5.93 10.28 1.73
llama 13B F16 pp6 12.43 12.30 0.99
llama 13B F16 pp7 8.38 14.34 1.71
llama 13B F16 pp8 9.47 16.39 1.73
llama 13B F16 pp9 18.64 18.39 0.99
llama 13B F16 pp10 12.13 20.43 1.69
llama 13B F16 pp11 13.25 22.45 1.69
llama 13B F16 pp12 24.73 24.41 0.99
llama 13B F16 pp13 15.87 26.46 1.67
llama 13B F16 pp14 16.95 28.43 1.68
llama 13B F16 pp15 30.41 30.38 1.00
llama 13B F16 pp16 19.48 32.39 1.66
llama 13B F16 pp30 36.04 57.18 1.59
llama 13B F16 pp32 27.90 58.94 2.11
llama 13B F16 pp64 33.24 72.24 2.17
llama 13B F16 pp65 33.64 72.34 2.15
llama 13B F16 pp66 37.38 72.77 1.95
llama 13B F16 pp127 36.07 74.40 2.06
llama 13B F16 pp128 36.18 74.69 2.06
llama 13B F16 pp129 37.63 74.48 1.98
llama 13B F16 pp255 37.67 75.16 2.00
llama 13B F16 pp256 36.81 74.99 2.04
llama 13B F16 pp257 36.91 74.86 2.03
llama 13B F16 pp510 36.38 74.30 2.04
llama 13B F16 pp511 35.98 74.15 2.06
llama 13B F16 pp512 36.04 74.07 2.05
llama 13B F16 pp513 36.37 73.95 2.03
llama 13B F16 pp1023 33.32 70.80 2.12
llama 13B F16 pp1024 33.02 70.94 2.15
llama 13B F16 pp1025 32.96 70.97 2.15
llama 13B F16 pp2048 28.15 65.91 2.34
llama 13B F16 tg128 2.12 2.12 1.00
llama 7B F16 pp1 3.42 3.46 1.01
llama 7B F16 pp2 6.66 6.69 1.00
llama 7B F16 pp3 9.99 10.05 1.01
llama 7B F16 pp4 8.00 13.33 1.67
llama 7B F16 pp5 9.86 16.66 1.69
llama 7B F16 pp6 19.72 19.97 1.01
llama 7B F16 pp7 13.79 23.26 1.69
llama 7B F16 pp8 15.62 26.52 1.70
llama 7B F16 pp9 28.72 29.79 1.04
llama 7B F16 pp10 19.63 33.07 1.68
llama 7B F16 pp11 21.45 36.34 1.69
llama 7B F16 pp12 36.86 39.47 1.07
llama 7B F16 pp13 25.34 42.86 1.69
llama 7B F16 pp14 27.18 45.88 1.69
llama 7B F16 pp15 46.50 49.01 1.05
llama 7B F16 pp16 30.97 52.40 1.69
llama 7B F16 pp30 57.24 91.35 1.60
llama 7B F16 pp32 45.19 95.17 2.11
llama 7B F16 pp64 52.39 113.43 2.16
llama 7B F16 pp65 52.47 113.70 2.17
llama 7B F16 pp66 57.30 114.33 2.00
llama 7B F16 pp127 56.24 117.03 2.08
llama 7B F16 pp128 57.08 117.17 2.05
llama 7B F16 pp129 59.33 116.79 1.97
llama 7B F16 pp255 59.08 118.12 2.00
llama 7B F16 pp256 57.56 117.96 2.05
llama 7B F16 pp257 57.57 117.57 2.04
llama 7B F16 pp510 56.74 114.89 2.02
llama 7B F16 pp511 56.17 114.73 2.04
llama 7B F16 pp512 56.04 114.67 2.05
llama 7B F16 pp513 56.68 114.29 2.02
llama 7B F16 pp1023 52.08 108.43 2.08
llama 7B F16 pp1024 51.50 108.26 2.10
llama 7B F16 pp1025 51.55 108.39 2.10
llama 7B F16 pp2048 45.31 99.35 2.19
llama 7B F16 tg128 3.46 3.46 1.00

@Djip007
Copy link
Contributor Author

Djip007 commented Dec 24, 2024

Perplexity look good.

./build/bin/./llama-perplexity -ctk bf16 -ctv bf16 --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9443 ±    0.5267       0.00048 ±    0.00053       0.00004 ±    0.00001     0.169 ±  0.019 %    99.608 ±  0.392 %
   2       5.4419 ±    0.6039       0.00133 ±    0.00153       0.00005 ±    0.00001     0.167 ±  0.012 %    99.412 ±  0.339 %
   3       4.6835 ±    0.4027       0.00066 ±    0.00105       0.00006 ±    0.00001     0.236 ±  0.021 %    99.608 ±  0.226 %
   4       5.0057 ±    0.3672       0.00051 ±    0.00080       0.00005 ±    0.00000     0.231 ±  0.017 %    99.608 ±  0.196 %
   5       5.2931 ±    0.3434       0.00030 ±    0.00065       0.00005 ±    0.00000     0.220 ±  0.014 %    99.686 ±  0.157 %
   6       5.8307 ±    0.3543       0.00030 ±    0.00055       0.00005 ±    0.00000     0.216 ±  0.012 %    99.739 ±  0.131 %
   7       6.2255 ±    0.3544       0.00047 ±    0.00052       0.00005 ±    0.00000     0.210 ±  0.011 %    99.664 ±  0.137 %
   8       6.4316 ±    0.3454       0.00047 ±    0.00046       0.00005 ±    0.00000     0.218 ±  0.010 %    99.657 ±  0.130 %
   9       6.8874 ±    0.3580       0.00050 ±    0.00041       0.00005 ±    0.00000     0.213 ±  0.010 %    99.608 ±  0.130 %
  10       7.2365 ±    0.3589       0.00030 ±    0.00038       0.00005 ±    0.00000     0.209 ±  0.009 %    99.569 ±  0.130 %
./build/bin/./llama-perplexity                     --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.F16.gguf
chunk             PPL               ln(PPL(Q)/PPL(base))          KL Divergence              Δp RMS            Same top p
   1       3.9432 ±    0.5262       0.00021 ±    0.00007       0.00000 ±    0.00000     0.023 ±  0.003 %    100.000 ±  0.000 %
   2       5.4435 ±    0.6041       0.00163 ±    0.00150       0.00000 ±    0.00000     0.025 ±  0.002 %    100.000 ±  0.000 %
   3       4.6856 ±    0.4029       0.00111 ±    0.00100       0.00000 ±    0.00000     0.030 ±  0.002 %    100.000 ±  0.000 %
   4       5.0072 ±    0.3674       0.00081 ±    0.00075       0.00000 ±    0.00000     0.029 ±  0.002 %    100.000 ±  0.000 %
   5       5.2951 ±    0.3437       0.00067 ±    0.00060       0.00000 ±    0.00000     0.030 ±  0.002 %    100.000 ±  0.000 %
   6       5.8323 ±    0.3545       0.00057 ±    0.00050       0.00000 ±    0.00000     0.029 ±  0.002 %    100.000 ±  0.000 %
   7       6.2269 ±    0.3546       0.00069 ±    0.00047       0.00000 ±    0.00000     0.028 ±  0.001 %    100.000 ±  0.000 %
   8       6.4324 ±    0.3455       0.00059 ±    0.00041       0.00000 ±    0.00000     0.028 ±  0.001 %    100.000 ±  0.000 %
   9       6.8876 ±    0.3581       0.00053 ±    0.00036       0.00000 ±    0.00000     0.027 ±  0.001 %    100.000 ±  0.000 %
  10       7.2379 ±    0.3591       0.00049 ±    0.00033       0.00000 ±    0.00000     0.027 ±  0.001 %    99.961 ±  0.039 %

@Djip007 Djip007 marked this pull request as ready for review December 24, 2024 14:40
@Djip007 Djip007 requested a review from ngxson as a code owner December 24, 2024 14:40
@slaren slaren merged commit 2cd43f4 into ggml-org:master Dec 24, 2024
50 checks passed
@Djip007
Copy link
Contributor Author

Djip007 commented Dec 24, 2024

Thanks !

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
* more perfo with llamafile tinyblas on x86_64.

- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly

* tinyblas dynamic dispaching

* sgemm: add M blocs.

* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2

* remove not stable test
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* more perfo with llamafile tinyblas on x86_64.

- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly

* tinyblas dynamic dispaching

* sgemm: add M blocs.

* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2

* remove not stable test
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* more perfo with llamafile tinyblas on x86_64.

- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly

* tinyblas dynamic dispaching

* sgemm: add M blocs.

* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2

* remove not stable test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning python python script changes script Script related server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants