Skip to content

bugfix: Respect n_predict=-2 in server (#12264) #12323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ishaangandhi
Copy link
Contributor

This pull request fixes issue #12264: Eval bug: server API endpoint not respecting n_predict with -2 (until context filled).

Previously, if you set n_predict to -2, the server would ignore that it had a special meaning, and immediately stop producing tokens with reason as length.

curl --location 'http://localhost:8080/v1/chat/completions' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer no-key' \
    --data '{
    "messages": [
        {
            "role": "user",
            "content": "Write a minimum 5,000 word essay (30+ pages) on the history of the United States, starting with the American Revolution."
        }
    ],
    "n_predict": -2
    }'

After the change, we get this (correct) output:

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nAlright, I'm supposed to write a minimum 5,000-word essay on the history of the United States,...His leadership was a symbol of unity and cooperation, and he was also seen as a symbol of unity that changed the way the colonies behaved.\n\n"
      }
    }
  ],
  "created": 1741646088,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b4869-2c9f833d",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 4096,
    "prompt_tokens": 34,
    "total_tokens": 4130
  },
  "id": "chatcmpl-LFedbjPLZLJyrIP4mc9ax0uuXLMaX4a9",
  "timings": {
    "prompt_n": 32,
    "prompt_ms": 165.535,
    "prompt_per_token_ms": 5.17296875,
    "prompt_per_second": 193.31259250309603,
    "predicted_n": 4096,
    "predicted_ms": 79655.512,
    "predicted_per_token_ms": 19.447146484375,
    "predicted_per_second": 51.42142580164446
  }
}

We now indeed stop for length, but only after producing 4096 completion tokens.

@ngxson
Copy link
Collaborator

ngxson commented Mar 11, 2025

Could you add a small test case for it? See server/tests/test_completion.py

@ishaangandhi
Copy link
Contributor Author

@ngxson Done!

@github-actions github-actions bot added the python python script changes label Mar 11, 2025
ngxson
ngxson previously approved these changes Mar 12, 2025
@ngxson
Copy link
Collaborator

ngxson commented Mar 12, 2025

I think this may break another test case, could you check it @ishaangandhi ?

@ngxson ngxson dismissed their stale review March 12, 2025 10:26

CI didn't pass

@ishaangandhi
Copy link
Contributor Author

@ngxson Got it. I rebased and allowed n_predict to be infinity for that test.

The issue was the inferred text matched on the first token, and the server was only producing 1 token.

They should now pass. Can you re-run the CI tests?

@ggerganov
Copy link
Member

Restarted the tests - let's see how it goes.

@@ -143,13 +143,15 @@ def test_consistent_result_same_seed(n_slots: int):
def test_different_result_different_seed(n_slots: int):
global server
server.n_slots = n_slots
server.n_predict = -1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why it is needed here. The default value of n_predict is already -1, you should not change a test case that is unrelated the the currently change, that's the whole I idea of having tests: to make sure that your change does not break exiting use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a bit of a readability and maintainability issue in this file in that the server object is global, and as a test fixture is resused across tests, even though every test changes its parameters.

Comment on lines +1334 to +1341
// The request or server have specified limits on the number of tokens to generate.
if ((params.n_predict >= 0) || (global_params.n_predict >= 0)) {
n_remaining = std::min(n_remaining, params.n_predict - n_decoded);
}

if (params.n_predict != -1) {
n_remaining = params.n_predict - n_decoded;
} else if (global_params.n_predict != -1) {
n_remaining = global_params.n_predict - n_decoded;
// The request or server have limits based on the context window.
if (params.n_predict == -2 || global_params.n_predict == -2) {
n_remaining = std::min(n_remaining, n_ctx - n_decoded);
Copy link
Collaborator

@ngxson ngxson Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh this code become a bit hard to understand now (and probably that's why the CI fails)

What I recommend here is instead of having this double if check, it should be an if..else like before.

Indeed, my idea is much simple, why risking modifying all of this while we can just set n_predict = n_ctx - n_prompt when the n_predict == -2 ?

@ngxson
Copy link
Collaborator

ngxson commented Mar 13, 2025

superseded by #12364

@ngxson ngxson closed this Mar 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants