fix: attempt forward on flash attn2 to check hardware support #2335

drbh · 2024-07-30T17:23:31Z

This PR attempt to execute a forward pass to validate that flash attention 2 works on the current hardware. Prior to this change import flash_attn_2_cuda could load but flash_attn_2_cuda.varlen_fwd fail with FlashAttention only supports Ampere GPUs or newer.

This change causes the runtime error when the library is loaded which will set V2 to False and avoid using flash attn 2 in the forward pass

danieldk · 2024-07-31T06:39:44Z

server/text_generation_server/layers/attention/cuda.py

@@ -173,6 +174,41 @@ def paged_attention(
 try:
    import flash_attn_2_cuda

+    # try forwarding to see if it works with all dummy inputs


Wouldn't it be easier to require the minimum needed CUDA capability? We could even skip the import altogether if the hardware doesn't have the right capability.

I agree, it would be simpler. We already have is_sm75 which was there probably for that reason.

yea I agree thats a better method. Updated to use is_ampere_or_newer = major >= 8 and minor >= 0 in the latest commits and avoid trying the forward pass

Narsil · 2024-07-31T07:25:50Z

server/text_generation_server/layers/attention/cuda.py

@@ -254,9 +290,11 @@ def attention(
        softcap=None,
    ):
        if window_size_left != -1:
-            raise NotImplementedError(
-                "window_size_left is only available with flash attn v2"
+            warnings.warn(


Can you also not change that.

It's important to hard crash and not silently ignored.

updated in the latest commit to still throw when an invalid value is passed

…not flash attn2

danieldk · 2024-08-02T14:44:47Z

server/text_generation_server/models/__init__.py

    if (
        (sliding_window is not None and sliding_window != -1)
        and not SUPPORTS_WINDOWING
-        and max_input_tokens > sliding_window
+        and is_max_input_within_sliding_window


Maybe I'm misreading this, but shouldn't the exception be raised when the max input is not inside the window size?

Yes this line was correct before.

The only thing that needs to happen is that if max_input_tokens <= sliding_window we can set sliding_window to -1 and forget about the windowing.

The code changes need to be much less.

updated to set sliding_window to -1 if max_input_tokens <= sliding_window in the latest commit

danieldk · 2024-08-02T14:46:18Z

server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py

        self.max_past_tensor = (
-            torch.tensor(config.sliding_window, device=weights.device)
-            if self.max_past is not None
+            torch.tensor(self.max_past, device=weights.device)


Doesn't this try to create a tensor with the None value if windowing is not supported? Same below.

ahh yes you are right 😅, fixed in the latest changes to avoid this all together

drbh added 2 commits July 30, 2024 17:20

fix: attempt forward on flash attn2 to check hardware support

4b1005c

fix: warn window_size_left when using flash attn 1

5123925

danieldk reviewed Jul 31, 2024

View reviewed changes

Narsil reviewed Jul 31, 2024

View reviewed changes

drbh added 2 commits August 1, 2024 15:06

fix: prefer version check over test op and avoid window_size_left if …

cae28dc

…not flash attn2

fix: improve condtional and error message

5b649d6

danieldk reviewed Aug 2, 2024

View reviewed changes

drbh added 4 commits August 2, 2024 17:39

fix: update sliding window conditional

cf27954

fix: simplify changes and revert model changes

afc0fb5

fix: avoid changing conditional

ad942a1

fix: typo tweak

645a6f8

danieldk approved these changes Aug 5, 2024

View reviewed changes

drbh merged commit 215ed3a into main Aug 5, 2024
9 checks passed

drbh deleted the validate-flash-attn2-on-arch branch August 5, 2024 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: attempt forward on flash attn2 to check hardware support #2335

fix: attempt forward on flash attn2 to check hardware support #2335

Uh oh!

drbh commented Jul 30, 2024

Uh oh!

danieldk Jul 31, 2024

Uh oh!

Narsil Jul 31, 2024 •

edited

Loading

Uh oh!

drbh Aug 1, 2024

Uh oh!

Narsil Jul 31, 2024

Uh oh!

drbh Aug 1, 2024

Uh oh!

danieldk Aug 2, 2024

Uh oh!

Narsil Aug 2, 2024

Uh oh!

drbh Aug 2, 2024

Uh oh!

danieldk Aug 2, 2024

Uh oh!

drbh Aug 2, 2024

Uh oh!

Uh oh!

Uh oh!

fix: attempt forward on flash attn2 to check hardware support #2335

fix: attempt forward on flash attn2 to check hardware support #2335

Uh oh!

Conversation

drbh commented Jul 30, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Narsil Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Narsil Jul 31, 2024 •

edited

Loading