-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: fix CohereForAI/c4ai-command-r-plus #1707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -43,7 +43,7 @@ def __init__( | |||
] | |||
self.free_block_mask = torch.ones(num_blocks, dtype=torch.int32, device="cpu") | |||
self.slots = torch.arange( | |||
0, num_blocks * self.block_size, dtype=torch.int32 | |||
0, num_blocks * self.block_size, dtype=torch.int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick question, is there a case where num_blocks
is really really big? or maybe there very large block_indices
sometimes?
just trying to understand the type change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because the vllm kernel now ask for this dtype. I don't know why they changed.
Slots is a very small tensor anyway.
server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py
Outdated
Show resolved
Hide resolved
e0e96d2
to
26da6bf
Compare
Hello @OlivierDehaene, @drbh this pull request slightly changes decoding, and so breaks my integration pipeline (I'm testing multiple inputs on my models and asserting their outputs don't change with do_sample=False).Do you know why this change was needed and if it's going to stay this way? |
@Narsil @drbh this will update flash attention v2 and vllm.
You will need to re-install them.