-
Notifications
You must be signed in to change notification settings - Fork 12.2k
server: Ensure batches are either all embed or all completion (#8076) #8420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// check that we are in the right batch_type, if not defer the slot | ||
bool slot_type = slot.embedding ? 1 : 0; | ||
if (batch_type == -1) { | ||
batch_type = slot_type; | ||
} else if (batch_type != slot_type) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done earlier? There's some things like slot.n_past
which should probably not be changed when the batch_type is not the right one.
I feel like this check should be done at least before tokens are put in the batch
, but the multitude of loops over slots does make it hard to find one right spot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. But we also want to make sure we don't assign the batch to a particular type but then bail out later on (say, because the prompt is bigger than the batch size).
My current read is that everything before where the check is currently is tokenization, and it's fine to do that, bail, and then pick up the slot on the next go around. That's why it's okay to have the conditional continue
just above it (2173-2178) too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything before where the check is currently is tokenization and it's fine to do that, bail, and then pick up the slot on the next go around.
Right, and slot.n_past
is recalculated anyway, so you're right.
But I think if batch.n_tokens
is non-zero, the initial batch_type
should be non-embedding
, or else this could potentially attempt continuous batching with embeddings and previous text completions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, just pushed the fix!
I don't see why it wouldn't work. |
You know, it was crashing before when I tried requesting a completion and an embedding at the same time (with |
…org#8420) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning
…org#8420) * make sure batches are all embed or all non-embed * non-embedding batch for sampled tokens; fix unused params warning
With this change, it appears that either the completion endpoint or the embeddings endpoint will work, but not both simultaneously. Just to clarify, did both endpoints ever work simultaneously before? |
@okigan If you're using a generative (non-embedding) model, you can run it as usual, and it will be able to serve both types of endpoints. In that case, you'll probably want to provide it with a |
@iamlemec so seems like "--embeddings" flag shall be used with "embedding" only model, right? how would one identify if model file is for embeddings only? and if embeddings only model is passed would llama-server require the flag? I also made a script to test this with mistral model (3rd response fails with 501):
so, if the above expected - i guess the docs are confusing for the --embeddings flag, would you agree? |
@okigan Yeah, Funily enough, despite the many configuation flags seen in HF world, there isn't one that cleanly identifies whether it's an embedding model or not. Usually embedding models are clearly identified as such in the README. If you want to do it programatically, the best way right now is to look at the The main reason this slightly awkward setup is needed is that if you try to get a completion from an embedding model it'll actually crash. Basically, it looks for logits in the output, doesn't find them, and gives up. So you want to protect embedding models from crashing themselves if you ask for a completion. I agree, the docs are not really accurate anymore (at least what's written in the command line help). It should probably now say something like "embedding only mode" rather than "enable embedding endpoint". |
@iamlemec @ggerganov trying to clarify this, created #8763, please review and comment there. If we have a clear path forward I can propose a PR. |
fix #8076
This ensures that batches are never mixed between embedding and completion tasks. It also adds a check in completion handlers ensuring the server was not run in embedding mode (otherwise it would crash for embedding-only models). Thus to allow for completions and embeddings on the same server, one must run it without the
--embeddings
flag.Currently, this does not work well with
n_parallel > 1
since we are changing theembedding
flag on a shared context, which will lead to missing logits and a failed assertion in many cases. Not sure if that's a deal-breaker.