Improve support for GPUs with capability < 8 #2575

danieldk · 2024-09-26T13:39:05Z

What does this PR do?

Improve support for GPUs with capability < 8

For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8.
Disable prefix caching when using paged attention.
When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

- For models that cannot use flashinfer, use flash-attn v1 + paged attention for models with a compute capability older than 8. - Disable prefix caching when using paged attention. - When using flash-attn v1, pass the key/value, rather than the cache, since v1 cannot use block tables.

Narsil · 2024-09-27T09:38:12Z

launcher/src/main.rs

@@ -65,6 +66,7 @@ fn get_config(
 }

 fn resolve_attention(config: &Option<Config>, lora_adapters: &Option<String>) -> (String, String) {
+    let compute_capability = *gpu::COMPUTE_CAPABILITY;


Why not just get_cuda_capability() ?

No need to start a slow GIL if it's not needed.

I am not sure I understand. This will only evaluate the compute capability once, it's just a cheap Rust lock. If we'd call get_cuda_capability, we'd call into Python land every time we call get_cuda_capability (only once now).

Yes yes I understand, but since we're only calling once there's no need, but it's really totally not important.

launcher/src/gpu.rs

Narsil · 2024-09-27T09:42:40Z

launcher/src/main.rs

+    let prefix_caching = if attention == "paged"
+        && prefix_caching.is_none()
+        && compute_capability.is_some()
+    {
+        tracing::info!("Disabling prefix caching because it is not supported with 'flashinfer'");
+        "false".to_string()
+    } else {
+        prefix_caching.unwrap_or("true".to_string())
+    };


Can't we move upstairs ? We're not allowed to change anything if the user sets it directly so this line must always be unwraps_or.
And in the previous code every modification must be protected to check that the thing is not None (there might even be a clearer method for it on Option.

Do we need an unwrap_or? The first branch of the conditional only fires when prefix_caching is None, so the user has set something. Don't know why I added the compute_capability.is_some() though.

I'll reread the preceding code to see if we check this everywhere.

I don't know how to make it the most obvious tbh.

But what I was really aiming for was:

User signal takes precedence over everything else.

If not user specifed resolve attention (prefx + attention) as simply as possible:

When some conditions are met, we use non defaults.

If nothing else happens we use the defaults (so defaults are more easily seen imho)

I wasn't super happy with the design but that's the reason for the big ugly block on top.This is where I hide 'non defauls resolutions'

So the spirit of my comment is just to try and keep the same rational, have the messy thing only in 1 place. (With the hidden agenda that this mess should disappear, whenever flashinfer supports all the features we would hope it does)

Moved the additional messy stuff into the messy block, so there are now two unwrap_ors again.

Narsil · 2024-09-27T09:45:41Z

server/text_generation_server/models/custom_modeling/flash_cohere_modeling.py

-                kv_cache[0] if SYSTEM != "ipex" else key,
-                kv_cache[1] if SYSTEM != "ipex" else value,
+                kv_cache[0] if PREFILL_IN_KV_CACHE else key,
+                kv_cache[1] if PREFILL_IN_KV_CACHE else value,


Another option would be to send both to every implementation and let V2/flashinfer use the cache directly with block tables, and let V1 and other backends use the raw values.

Let's keep this for now, but maybe food for thought if this logic complexifies.

Yeah, that crossed my mind as well, certainly worth considering in the future.

Narsil

LGTM

danieldk force-pushed the bugfix/pre-cc-8 branch from 491c026 to 5bd79b8 Compare September 26, 2024 14:27

danieldk added 2 commits September 26, 2024 14:46

nix: add flash-attn-v1 to the server environment

8c0f931

danieldk force-pushed the bugfix/pre-cc-8 branch from 5bd79b8 to 8c0f931 Compare September 26, 2024 14:46

danieldk mentioned this pull request Sep 27, 2024

Update ROCM libs and improvements #2358

Closed

7 tasks

danieldk marked this pull request as ready for review September 27, 2024 09:21

Narsil reviewed Sep 27, 2024

View reviewed changes

launcher/src/gpu.rs Show resolved Hide resolved

Narsil reviewed Sep 27, 2024

View reviewed changes

danieldk added 2 commits September 27, 2024 11:29

Move disabling prefix caching into the block of exceptions

a29636e

Capability as usizes

3eb68a3

danieldk requested a review from Narsil September 27, 2024 12:35

Narsil approved these changes Sep 27, 2024

View reviewed changes

danieldk merged commit 5b6b74e into main Sep 27, 2024
12 checks passed

danieldk deleted the bugfix/pre-cc-8 branch September 27, 2024 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve support for GPUs with capability < 8 #2575

Improve support for GPUs with capability < 8 #2575

Uh oh!

danieldk commented Sep 26, 2024 •

edited

Loading

Uh oh!

Narsil Sep 27, 2024

Uh oh!

Narsil Sep 27, 2024

Uh oh!

danieldk Sep 27, 2024 •

edited

Loading

Uh oh!

Narsil Sep 27, 2024

Uh oh!

Uh oh!

Narsil Sep 27, 2024

Uh oh!

danieldk Sep 27, 2024

Uh oh!

danieldk Sep 27, 2024

Uh oh!

Narsil Sep 27, 2024

Uh oh!

danieldk Sep 27, 2024

Uh oh!

Narsil Sep 27, 2024

Uh oh!

danieldk Sep 27, 2024

Uh oh!

Narsil left a comment

Uh oh!

Uh oh!

Uh oh!

Improve support for GPUs with capability < 8 #2575

Improve support for GPUs with capability < 8 #2575

Uh oh!

Conversation

danieldk commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danieldk Sep 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danieldk commented Sep 26, 2024 •

edited

Loading

danieldk Sep 27, 2024 •

edited

Loading