-
Notifications
You must be signed in to change notification settings - Fork 12.2k
bugfix: Respect n_predict=-2 in server (#12264) #12323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could you add a small test case for it? See |
@ngxson Done! |
I think this may break another test case, could you check it @ishaangandhi ? |
8826cf5
to
8511ec5
Compare
@ngxson Got it. I rebased and allowed n_predict to be infinity for that test. The issue was the inferred text matched on the first token, and the server was only producing 1 token. They should now pass. Can you re-run the CI tests? |
Restarted the tests - let's see how it goes. |
@@ -143,13 +143,15 @@ def test_consistent_result_same_seed(n_slots: int): | |||
def test_different_result_different_seed(n_slots: int): | |||
global server | |||
server.n_slots = n_slots | |||
server.n_predict = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why it is needed here. The default value of n_predict
is already -1
, you should not change a test case that is unrelated the the currently change, that's the whole I idea of having tests: to make sure that your change does not break exiting use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a bit of a readability and maintainability issue in this file in that the server object is global, and as a test fixture is resused across tests, even though every test changes its parameters.
// The request or server have specified limits on the number of tokens to generate. | ||
if ((params.n_predict >= 0) || (global_params.n_predict >= 0)) { | ||
n_remaining = std::min(n_remaining, params.n_predict - n_decoded); | ||
} | ||
|
||
if (params.n_predict != -1) { | ||
n_remaining = params.n_predict - n_decoded; | ||
} else if (global_params.n_predict != -1) { | ||
n_remaining = global_params.n_predict - n_decoded; | ||
// The request or server have limits based on the context window. | ||
if (params.n_predict == -2 || global_params.n_predict == -2) { | ||
n_remaining = std::min(n_remaining, n_ctx - n_decoded); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh this code become a bit hard to understand now (and probably that's why the CI fails)
What I recommend here is instead of having this double if
check, it should be an if..else
like before.
Indeed, my idea is much simple, why risking modifying all of this while we can just set n_predict = n_ctx - n_prompt
when the n_predict == -2
?
superseded by #12364 |
This pull request fixes issue #12264: Eval bug: server API endpoint not respecting n_predict with -2 (until context filled).
Previously, if you set
n_predict
to -2, the server would ignore that it had a special meaning, and immediately stop producing tokens withreason
aslength
.After the change, we get this (correct) output:
We now indeed stop for
length
, but only after producing4096
completion tokens.