Skip to content

server : fix speculative decoding with context shift #10641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 4, 2024

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Dec 3, 2024

fix #10547

Make sure the speculative batch does not exceed the slot's context.

@ngxson
Copy link
Collaborator

ngxson commented Dec 3, 2024

Do you think we should add a test case for this? Something like:

# test_speculative.py

def test_with_ctx_shift():
    global server
    server.n_ctx = 64
    server.start()
    res = server.make_request("POST", "/completion", data={
        "prompt": "Hello " * 64,
        "temperature": 0.0,
        "top_k": 1,
    })
    assert res.status_code == 200
    assert len(res.body["content"]) > 0

@ggerganov
Copy link
Member Author

ggerganov commented Dec 4, 2024

Yes, the error can be triggered with "prompt": "Hello " * 56, and then setting speculative.p_min = 0 in order to always generate a full draft batch of 16 tokens. This would cause to try to evaluate a speculative batch of 16 tokens when there is only 8 free tokens left in the context of the slot. With the changes in this PR, this should no longer fail.

@ggerganov ggerganov force-pushed the gg/server-fix-spec-ctx-shift branch from 05837cf to b436eda Compare December 4, 2024 09:00
@github-actions github-actions bot added the python python script changes label Dec 4, 2024
@ggerganov ggerganov requested a review from ngxson December 4, 2024 11:12
@unclemusclez
Copy link

I've been running this PR for about an hour. it seems stable.

@josharian
Copy link

Works well for me. Thanks.

@ggerganov ggerganov merged commit 1da7b76 into master Dec 4, 2024
47 checks passed
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024
* server : fix speculative decoding with context shift

ggml-ci

* server : take into account speculative limits

ggml-ci

* server : add tests
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* server : fix speculative decoding with context shift

ggml-ci

* server : take into account speculative limits

ggml-ci

* server : add tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: issues with draft model and Cline+VSCode
4 participants