Skip to content

fix: fix CohereForAI/c4ai-command-r-plus #1707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 10, 2024
Merged

Conversation

OlivierDehaene
Copy link
Contributor

@Narsil @drbh this will update flash attention v2 and vllm.
You will need to re-install them.

@@ -43,7 +43,7 @@ def __init__(
]
self.free_block_mask = torch.ones(num_blocks, dtype=torch.int32, device="cpu")
self.slots = torch.arange(
0, num_blocks * self.block_size, dtype=torch.int32
0, num_blocks * self.block_size, dtype=torch.int64
Copy link
Collaborator

@drbh drbh Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick question, is there a case where num_blocks is really really big? or maybe there very large block_indices sometimes?

just trying to understand the type change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because the vllm kernel now ask for this dtype. I don't know why they changed.
Slots is a very small tensor anyway.

@OlivierDehaene OlivierDehaene merged commit ad9d628 into main Apr 10, 2024
@OlivierDehaene OlivierDehaene deleted the chore/update_flash branch April 10, 2024 15:20
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
@Narsil @drbh this will update flash attention v2 and vllm.
You will need to re-install them.
@thomas-schillaci
Copy link
Contributor

Hello @OlivierDehaene, @drbh this pull request slightly changes decoding, and so breaks my integration pipeline (I'm testing multiple inputs on my models and asserting their outputs don't change with do_sample=False).Do you know why this change was needed and if it's going to stay this way?

Nilabhra pushed a commit to TII-AI-Research-Center/text-generation-inference that referenced this pull request May 14, 2024
@Narsil @drbh this will update flash attention v2 and vllm.
You will need to re-install them.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants