-
Notifications
You must be signed in to change notification settings - Fork 12.2k
more perfo with llamafile tinyblas on x86_64. #10714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f7c5a68
to
b1c72b9
Compare
Some perplexity with new code. (vs master BF16/zen3)
|
For me look good. |
Not sure, try merging the current |
b1c72b9
to
d4a2a20
Compare
look like a small diff in result.
@pytest.mark.parametrize("n_slots", [1, 2])
def test_consistent_result_same_seed(n_slots: int):
global server
server.n_slots = n_slots what is n_slots? I have to check some elements in my code tomorrow... |
I am not sure what's the effect of increasing the number of slots for this test. I suspect that this error might indicate there is a buffer overflow somewhere, and random data beyond the tensor buffer may be causing it to generate different sequences despite using the same seed. |
That's what I was thinking last night, but it was too late. I have a little idea, but I was too tired to check/correct it. |
The failing test seems to be using 2 slots. With 2 slots, the KV cache buffer is shared among the two generations. Initially, the buffer is empty:
Then the first request is processed by slot 0 and thus the beginning of the buffer is occupied:
The second request is processed on slot 1, so the old data remains in the buffer:
Because we compute the attention on the entire buffer by masking out the cross-sequence values, it is actually possible to get different results between the 2 generations. This happens due to summing floating-point across the length of the KV buffer. In the next example, even-though the data in the buffer is the same, it can lead to different numerical results during the
I'm thinking that maybe there isn't a bug in the implementation in this PR, and it's a side-effect of the unified KV cache. Probably this test for |
d4a2a20
to
6c398db
Compare
@ggerganov On the other hand, by going over my code step by step, there are a small number of cases (2 to ~5?) where I do too much calculation and write outside the wrong value (possibly by overwriting correct data that I just calculated...) So I corrected that. It remains to be seen if the test passes. |
Well that wasn't enough, I'm doing another pass on the perlexity to be sure with my last correction. |
6c398db
to
01ba9f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been running random tests with test-backend-ops
and I haven't seen any failure, so I am fairly confident that this is correct. Let's just disable the server test for 2 slots.
not sur how to do it: # replace
# @pytest.mark.parametrize("n_slots", [1, 2])
# with that?
@pytest.mark.parametrize("n_slots", [1])
def test_consistent_result_same_seed(n_slots: int):
global server
server.n_slots = n_slots
server.start()
last_res = None
for _ in range(4):
res = server.make_request("POST", "/completion", data={
"prompt": "I believe the meaning of life is",
"seed": 42,
"temperature": 1.0,
"cache_prompt": False, # TODO: remove this once test_cache_vs_nocache_prompt is fixed
})
if last_res is not None:
assert res.body["content"] == last_res.body["content"]
last_res = res |
A different test is failing now. Add: --- a/examples/server/tests/unit/test_completion.py
+++ b/examples/server/tests/unit/test_completion.py
@@ -116,6 +116,7 @@ def test_different_result_different_seed(n_slots: int):
def test_consistent_result_different_batch_size(n_batch: int, temperature: float):
global server
server.n_batch = n_batch
+ server.n_slots = 1
server.start()
last_res = None
for _ in range(4): |
On my system (intel 13900k) I see better performance with BF16, but worse with F16 in some cases:
With different numbers of threads:
|
It is a AVX512 or a AVX2? |
b2dab60
to
30ae0d2
Compare
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
30ae0d2
to
94cd488
Compare
- show-progress is not part of GNU Wget2
94cd488
to
4bf8cd9
Compare
OK code look good and I get good perf with Ryzen9 5950X and 7945HS. Need to "remove" not working test in "Server" check |
4bf8cd9
to
7b9119b
Compare
Some last bench with ("without" u-batch):
Do not direct compare with preview result, I have some change on Bios config (PBO / max TDP...)
|
Perplexity look good. ./build/bin/./llama-perplexity -ctk bf16 -ctv bf16 --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.BF16.gguf
chunk PPL ln(PPL(Q)/PPL(base)) KL Divergence Δp RMS Same top p
1 3.9443 ± 0.5267 0.00048 ± 0.00053 0.00004 ± 0.00001 0.169 ± 0.019 % 99.608 ± 0.392 %
2 5.4419 ± 0.6039 0.00133 ± 0.00153 0.00005 ± 0.00001 0.167 ± 0.012 % 99.412 ± 0.339 %
3 4.6835 ± 0.4027 0.00066 ± 0.00105 0.00006 ± 0.00001 0.236 ± 0.021 % 99.608 ± 0.226 %
4 5.0057 ± 0.3672 0.00051 ± 0.00080 0.00005 ± 0.00000 0.231 ± 0.017 % 99.608 ± 0.196 %
5 5.2931 ± 0.3434 0.00030 ± 0.00065 0.00005 ± 0.00000 0.220 ± 0.014 % 99.686 ± 0.157 %
6 5.8307 ± 0.3543 0.00030 ± 0.00055 0.00005 ± 0.00000 0.216 ± 0.012 % 99.739 ± 0.131 %
7 6.2255 ± 0.3544 0.00047 ± 0.00052 0.00005 ± 0.00000 0.210 ± 0.011 % 99.664 ± 0.137 %
8 6.4316 ± 0.3454 0.00047 ± 0.00046 0.00005 ± 0.00000 0.218 ± 0.010 % 99.657 ± 0.130 %
9 6.8874 ± 0.3580 0.00050 ± 0.00041 0.00005 ± 0.00000 0.213 ± 0.010 % 99.608 ± 0.130 %
10 7.2365 ± 0.3589 0.00030 ± 0.00038 0.00005 ± 0.00000 0.209 ± 0.009 % 99.569 ± 0.130 % ./build/bin/./llama-perplexity --kl-divergence-base ~/LLM/Mistral-Nemo-Instruct-2407.BF16.kld --kl-divergence -s 31337 -m ~/LLM/Mistral-Nemo-Instruct-2407.F16.gguf
chunk PPL ln(PPL(Q)/PPL(base)) KL Divergence Δp RMS Same top p
1 3.9432 ± 0.5262 0.00021 ± 0.00007 0.00000 ± 0.00000 0.023 ± 0.003 % 100.000 ± 0.000 %
2 5.4435 ± 0.6041 0.00163 ± 0.00150 0.00000 ± 0.00000 0.025 ± 0.002 % 100.000 ± 0.000 %
3 4.6856 ± 0.4029 0.00111 ± 0.00100 0.00000 ± 0.00000 0.030 ± 0.002 % 100.000 ± 0.000 %
4 5.0072 ± 0.3674 0.00081 ± 0.00075 0.00000 ± 0.00000 0.029 ± 0.002 % 100.000 ± 0.000 %
5 5.2951 ± 0.3437 0.00067 ± 0.00060 0.00000 ± 0.00000 0.030 ± 0.002 % 100.000 ± 0.000 %
6 5.8323 ± 0.3545 0.00057 ± 0.00050 0.00000 ± 0.00000 0.029 ± 0.002 % 100.000 ± 0.000 %
7 6.2269 ± 0.3546 0.00069 ± 0.00047 0.00000 ± 0.00000 0.028 ± 0.001 % 100.000 ± 0.000 %
8 6.4324 ± 0.3455 0.00059 ± 0.00041 0.00000 ± 0.00000 0.028 ± 0.001 % 100.000 ± 0.000 %
9 6.8876 ± 0.3581 0.00053 ± 0.00036 0.00000 ± 0.00000 0.027 ± 0.001 % 100.000 ± 0.000 %
10 7.2379 ± 0.3591 0.00049 ± 0.00033 0.00000 ± 0.00000 0.027 ± 0.001 % 99.961 ± 0.039 % |
7b9119b
to
2c6864a
Compare
Thanks ! |
* more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test
* more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test
* more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test
ikawrakow/ik_llama.cpp#71 have a good idea.
I'll figure to add it in llamafile/tinyblas sgemm (and a litle more) and id work great: