server: Ensure batches are either all embed or all completion (#8076) #8420

iamlemec · 2024-07-10T20:02:25Z

This ensures that batches are never mixed between embedding and completion tasks. It also adds a check in completion handlers ensuring the server was not run in embedding mode (otherwise it would crash for embedding-only models). Thus to allow for completions and embeddings on the same server, one must run it without the --embeddings flag.

Currently, this does not work well with n_parallel > 1 since we are changing the embedding flag on a shared context, which will lead to missing logits and a failed assertion in many cases. Not sure if that's a deal-breaker.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

compilade · 2024-07-10T23:01:21Z

examples/server/server.cpp

+                    // check that we are in the right batch_type, if not defer the slot
+                    bool slot_type = slot.embedding ? 1 : 0;
+                    if (batch_type == -1) {
+                        batch_type = slot_type;
+                    } else if (batch_type != slot_type) {
+                        continue;
+                    }


Should this be done earlier? There's some things like slot.n_past which should probably not be changed when the batch_type is not the right one.

I feel like this check should be done at least before tokens are put in the batch, but the multitude of loops over slots does make it hard to find one right spot.

Indeed. But we also want to make sure we don't assign the batch to a particular type but then bail out later on (say, because the prompt is bigger than the batch size).

My current read is that everything before where the check is currently is tokenization, and it's fine to do that, bail, and then pick up the slot on the next go around. That's why it's okay to have the conditional continue just above it (2173-2178) too.

everything before where the check is currently is tokenization and it's fine to do that, bail, and then pick up the slot on the next go around.

Right, and slot.n_past is recalculated anyway, so you're right.

But I think if batch.n_tokens is non-zero, the initial batch_type should be non-embedding, or else this could potentially attempt continuous batching with embeddings and previous text completions.

Ah I see, just pushed the fix!

ggerganov · 2024-07-11T07:38:17Z

Currently, this does not work well with n_parallel > 1 since we are changing the embedding flag on a shared context, which will lead to missing logits and a failed assertion in many cases

I don't see why it wouldn't work. n_parallel > 1 means more slots and this new logic with the batch type now correctly handles multiple slots. Can you provide a fail case?

iamlemec · 2024-07-11T19:36:20Z

I don't see why it wouldn't work. n_parallel > 1 means more slots and this new logic with the batch type now correctly handles multiple slots. Can you provide a fail case?

You know, it was crashing before when I tried requesting a completion and an embedding at the same time (with -np 2), but now it's running fine. It's possible the initial batch_type fix discussed above was what was happening.

…org#8420) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning

okigan · 2024-07-20T18:07:19Z

With this change, it appears that either the completion endpoint or the embeddings endpoint will work, but not both simultaneously.

Just to clarify, did both endpoints ever work simultaneously before?

iamlemec · 2024-07-21T01:14:40Z

@okigan If you're using a generative (non-embedding) model, you can run it as usual, and it will be able to serve both types of endpoints. In that case, you'll probably want to provide it with a --pooling option to tell it how to make sequence embeddings. If you're using an embedding-only model, you need to pass --embeddings to the server when you launch it, and it will only be able to provide embeddings (by design).

okigan · 2024-07-21T07:58:24Z

@iamlemec so seems like "--embeddings" flag shall be used with "embedding" only model, right? how would one identify if model file is for embeddings only? and if embeddings only model is passed would llama-server require the flag?

I also made a script to test this with mistral model (3rd response fails with 501):

./test_llama_server.sh
Model already exists. Skipping download.
Running server WITHOUT --embeddings flag...
Waiting for server to start...
Server is ready.
Testing chat completion...
> POST /v1/chat/completions HTTP/1.1
< HTTP/1.1 200 OK


Testing embeddings...
> POST /v1/embeddings HTTP/1.1
< HTTP/1.1 200 OK

#################

Running server with --embeddings flag...
Waiting for server to start...
Server is ready.
Testing chat completion...
> POST /v1/chat/completions HTTP/1.1
< HTTP/1.1 501 Not Implemented                 <<<<<---- here


Testing embeddings...
> POST /v1/embeddings HTTP/1.1
< HTTP/1.1 200 OK

so, if the above expected - i guess the docs are confusing for the --embeddings flag, would you agree?

iamlemec · 2024-07-21T17:37:35Z

@okigan Yeah, --embeddings would make most sense for embedding-only models, or perhaps if you want to ensure that completions are not served.

Funily enough, despite the many configuation flags seen in HF world, there isn't one that cleanly identifies whether it's an embedding model or not. Usually embedding models are clearly identified as such in the README. If you want to do it programatically, the best way right now is to look at the hparams.causal_attn flag. There are probably some exceptions, but embedding models typically have causal_attn = False while generative models have causal_attn = True. I think there was talk at some point of auto-detecting based on this heuristic, but it hasn't been done.

The main reason this slightly awkward setup is needed is that if you try to get a completion from an embedding model it'll actually crash. Basically, it looks for logits in the output, doesn't find them, and gives up. So you want to protect embedding models from crashing themselves if you ask for a completion.

I agree, the docs are not really accurate anymore (at least what's written in the command line help). It should probably now say something like "embedding only mode" rather than "enable embedding endpoint".

okigan · 2024-07-30T03:28:37Z

@iamlemec @ggerganov trying to clarify this, created #8763, please review and comment there. If we have a clear path forward I can propose a PR.

make sure batches are all embed or all non-embed

40c99ab

github-actions bot added examples server labels Jul 10, 2024

iamlemec mentioned this pull request Jul 10, 2024

Bug: llama-server crashes when started with --embeddings #8076

Closed

iamlemec changed the title ~~server: Ensure batches are either all embed or all completion~~ server: Ensure batches are either all embed or all completion (#8076) Jul 10, 2024

compilade reviewed Jul 10, 2024

View reviewed changes

non-embedding batch for sampled tokens; fix unused params warning

371cb8d

ggerganov approved these changes Jul 12, 2024

View reviewed changes

ggerganov merged commit c3ebcfa into ggml-org:master Jul 12, 2024
52 of 53 checks passed

okigan mentioned this pull request Jul 30, 2024

Bug: documentation for server.cpp --embeddings is confusing/incorrect #8763

Closed

a-h mentioned this pull request Jul 30, 2024

Server allow /completion and /embedding #3815

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: Ensure batches are either all embed or all completion (#8076) #8420

server: Ensure batches are either all embed or all completion (#8076) #8420

Uh oh!

iamlemec commented Jul 10, 2024 •

edited by ggerganov

Loading

Uh oh!

compilade Jul 10, 2024

Uh oh!

iamlemec Jul 11, 2024

Uh oh!

compilade Jul 11, 2024

Uh oh!

iamlemec Jul 11, 2024

Uh oh!

ggerganov commented Jul 11, 2024

Uh oh!

iamlemec commented Jul 11, 2024

Uh oh!

Uh oh!

okigan commented Jul 20, 2024 •

edited

Loading

Uh oh!

iamlemec commented Jul 21, 2024

Uh oh!

okigan commented Jul 21, 2024

Uh oh!

iamlemec commented Jul 21, 2024

Uh oh!

okigan commented Jul 30, 2024

Uh oh!

Uh oh!

server: Ensure batches are either all embed or all completion (#8076) #8420

server: Ensure batches are either all embed or all completion (#8076) #8420

Uh oh!

Conversation

iamlemec commented Jul 10, 2024 • edited by ggerganov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade Jul 10, 2024

Choose a reason for hiding this comment

Uh oh!

iamlemec Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

compilade Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

iamlemec Jul 11, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jul 11, 2024

Uh oh!

iamlemec commented Jul 11, 2024

Uh oh!

Uh oh!

okigan commented Jul 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamlemec commented Jul 21, 2024

Uh oh!

okigan commented Jul 21, 2024

Uh oh!

iamlemec commented Jul 21, 2024

Uh oh!

okigan commented Jul 30, 2024

Uh oh!

Uh oh!

iamlemec commented Jul 10, 2024 •

edited by ggerganov

Loading

okigan commented Jul 20, 2024 •

edited

Loading