server : fix speculative decoding with context shift #10641

ggerganov · 2024-12-03T20:45:24Z

Make sure the speculative batch does not exceed the slot's context.

ggml-ci

ngxson · 2024-12-03T21:34:38Z

Do you think we should add a test case for this? Something like:

# test_speculative.py

def test_with_ctx_shift():
    global server
    server.n_ctx = 64
    server.start()
    res = server.make_request("POST", "/completion", data={
        "prompt": "Hello " * 64,
        "temperature": 0.0,
        "top_k": 1,
    })
    assert res.status_code == 200
    assert len(res.body["content"]) > 0

ggerganov · 2024-12-04T08:48:06Z

Yes, the error can be triggered with "prompt": "Hello " * 56, and then setting speculative.p_min = 0 in order to always generate a full draft batch of 16 tokens. This would cause to try to evaluate a speculative batch of 16 tokens when there is only 8 free tokens left in the context of the slot. With the changes in this PR, this should no longer fail.

ggml-ci

unclemusclez · 2024-12-04T16:46:40Z

I've been running this PR for about an hour. it seems stable.

josharian · 2024-12-04T19:57:12Z

Works well for me. Thanks.

* server : fix speculative decoding with context shift ggml-ci * server : take into account speculative limits ggml-ci * server : add tests

server : fix speculative decoding with context shift

a5a915b

ggml-ci

github-actions bot added examples server labels Dec 3, 2024

ggerganov mentioned this pull request Dec 3, 2024

Eval bug: issues with draft model and Cline+VSCode #10547

Closed

server : take into account speculative limits

b436eda

ggml-ci

ggerganov force-pushed the gg/server-fix-spec-ctx-shift branch from 05837cf to b436eda Compare December 4, 2024 09:00

server : add tests

81611be

github-actions bot added the python python script changes label Dec 4, 2024

ggerganov requested a review from ngxson December 4, 2024 11:12

ngxson approved these changes Dec 4, 2024

View reviewed changes

ggerganov merged commit 1da7b76 into master Dec 4, 2024
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : fix speculative decoding with context shift #10641

server : fix speculative decoding with context shift #10641

Uh oh!

ggerganov commented Dec 3, 2024 •

edited

Loading

Uh oh!

ngxson commented Dec 3, 2024

Uh oh!

ggerganov commented Dec 4, 2024 •

edited

Loading

Uh oh!

unclemusclez commented Dec 4, 2024

Uh oh!

josharian commented Dec 4, 2024

Uh oh!

Uh oh!

Uh oh!

server : fix speculative decoding with context shift #10641

server : fix speculative decoding with context shift #10641

Uh oh!

Conversation

ggerganov commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 3, 2024

Uh oh!

ggerganov commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

unclemusclez commented Dec 4, 2024

Uh oh!

josharian commented Dec 4, 2024

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Dec 3, 2024 •

edited

Loading

ggerganov commented Dec 4, 2024 •

edited

Loading