Skip to content

server: Ensure batches are either all embed or all completion (#8076) #8420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 12, 2024

Conversation

iamlemec
Copy link
Collaborator

@iamlemec iamlemec commented Jul 10, 2024

fix #8076

This ensures that batches are never mixed between embedding and completion tasks. It also adds a check in completion handlers ensuring the server was not run in embedding mode (otherwise it would crash for embedding-only models). Thus to allow for completions and embeddings on the same server, one must run it without the --embeddings flag.

Currently, this does not work well with n_parallel > 1 since we are changing the embedding flag on a shared context, which will lead to missing logits and a failed assertion in many cases. Not sure if that's a deal-breaker.

@iamlemec iamlemec changed the title server: Ensure batches are either all embed or all completion server: Ensure batches are either all embed or all completion (#8076) Jul 10, 2024
Comment on lines +2180 to +2186
// check that we are in the right batch_type, if not defer the slot
bool slot_type = slot.embedding ? 1 : 0;
if (batch_type == -1) {
batch_type = slot_type;
} else if (batch_type != slot_type) {
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be done earlier? There's some things like slot.n_past which should probably not be changed when the batch_type is not the right one.

I feel like this check should be done at least before tokens are put in the batch, but the multitude of loops over slots does make it hard to find one right spot.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. But we also want to make sure we don't assign the batch to a particular type but then bail out later on (say, because the prompt is bigger than the batch size).

My current read is that everything before where the check is currently is tokenization, and it's fine to do that, bail, and then pick up the slot on the next go around. That's why it's okay to have the conditional continue just above it (2173-2178) too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything before where the check is currently is tokenization and it's fine to do that, bail, and then pick up the slot on the next go around.

Right, and slot.n_past is recalculated anyway, so you're right.

But I think if batch.n_tokens is non-zero, the initial batch_type should be non-embedding, or else this could potentially attempt continuous batching with embeddings and previous text completions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, just pushed the fix!

@ggerganov
Copy link
Member

Currently, this does not work well with n_parallel > 1 since we are changing the embedding flag on a shared context, which will lead to missing logits and a failed assertion in many cases

I don't see why it wouldn't work. n_parallel > 1 means more slots and this new logic with the batch type now correctly handles multiple slots. Can you provide a fail case?

@iamlemec
Copy link
Collaborator Author

I don't see why it wouldn't work. n_parallel > 1 means more slots and this new logic with the batch type now correctly handles multiple slots. Can you provide a fail case?

You know, it was crashing before when I tried requesting a completion and an embedding at the same time (with -np 2), but now it's running fine. It's possible the initial batch_type fix discussed above was what was happening.

@ggerganov ggerganov merged commit c3ebcfa into ggml-org:master Jul 12, 2024
52 of 53 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
…org#8420)

* make sure batches are all embed or all non-embed

* non-embedding batch for sampled tokens; fix unused params warning
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 13, 2024
…org#8420)

* make sure batches are all embed or all non-embed

* non-embedding batch for sampled tokens; fix unused params warning
@okigan
Copy link
Contributor

okigan commented Jul 20, 2024

With this change, it appears that either the completion endpoint or the embeddings endpoint will work, but not both simultaneously.

Just to clarify, did both endpoints ever work simultaneously before?

@iamlemec
Copy link
Collaborator Author

@okigan If you're using a generative (non-embedding) model, you can run it as usual, and it will be able to serve both types of endpoints. In that case, you'll probably want to provide it with a --pooling option to tell it how to make sequence embeddings. If you're using an embedding-only model, you need to pass --embeddings to the server when you launch it, and it will only be able to provide embeddings (by design).

@okigan
Copy link
Contributor

okigan commented Jul 21, 2024

@iamlemec so seems like "--embeddings" flag shall be used with "embedding" only model, right? how would one identify if model file is for embeddings only? and if embeddings only model is passed would llama-server require the flag?

I also made a script to test this with mistral model (3rd response fails with 501):

./test_llama_server.sh
Model already exists. Skipping download.
Running server WITHOUT --embeddings flag...
Waiting for server to start...
Server is ready.
Testing chat completion...
> POST /v1/chat/completions HTTP/1.1
< HTTP/1.1 200 OK


Testing embeddings...
> POST /v1/embeddings HTTP/1.1
< HTTP/1.1 200 OK

#################

Running server with --embeddings flag...
Waiting for server to start...
Server is ready.
Testing chat completion...
> POST /v1/chat/completions HTTP/1.1
< HTTP/1.1 501 Not Implemented                 <<<<<---- here


Testing embeddings...
> POST /v1/embeddings HTTP/1.1
< HTTP/1.1 200 OK

so, if the above expected - i guess the docs are confusing for the --embeddings flag, would you agree?

@iamlemec
Copy link
Collaborator Author

@okigan Yeah, --embeddings would make most sense for embedding-only models, or perhaps if you want to ensure that completions are not served.

Funily enough, despite the many configuation flags seen in HF world, there isn't one that cleanly identifies whether it's an embedding model or not. Usually embedding models are clearly identified as such in the README. If you want to do it programatically, the best way right now is to look at the hparams.causal_attn flag. There are probably some exceptions, but embedding models typically have causal_attn = False while generative models have causal_attn = True. I think there was talk at some point of auto-detecting based on this heuristic, but it hasn't been done.

The main reason this slightly awkward setup is needed is that if you try to get a completion from an embedding model it'll actually crash. Basically, it looks for logits in the output, doesn't find them, and gives up. So you want to protect embedding models from crashing themselves if you ask for a completion.

I agree, the docs are not really accurate anymore (at least what's written in the command line help). It should probably now say something like "embedding only mode" rather than "enable embedding endpoint".

@okigan
Copy link
Contributor

okigan commented Jul 30, 2024

@iamlemec @ggerganov trying to clarify this, created #8763, please review and comment there. If we have a clear path forward I can propose a PR.

@a-h a-h mentioned this pull request Jul 30, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: llama-server crashes when started with --embeddings
4 participants