feat: support flash attention 2 in qwen2 vl vision blocks #2721

drbh · 2024-11-04T16:29:54Z

This PR adds support for flash attention within the vision blocks of qwen2vl. This improves latency and reduces spikes in memory usage from sdpa attention (helps avoid oom'ing on large images).

running

text-generation-launcher --model-id Qwen/Qwen2-VL-7B-Instruct

on a single L4's with a small model this has a ~5% perf improvement: prev: ~67ms per token current: ~63ms per token.

ps** it's likely the performance improvement scales with the size/number of layers however I have not benchmarked any larger models to measure the difference.

OlivierDehaene · 2024-11-18T14:47:03Z

server/text_generation_server/models/custom_modeling/qwen2_vl.py

-        )
-        attn_output = attn_output.transpose(0, 1)
+        # calc maximum sequence length for any batch
+        max_seqlen = torch.max(cu_seqlens[1:] - cu_seqlens[:-1])


It may be interesting to compute this value once in the Qwen2VisionModel.forward instead.

great point, updated in the latest commit to only calculate the max_seqlen once!

drbh · 2024-11-18T17:46:03Z

optimistically merging as the max_seqlen comment was addressed, these changes are needed in a in progress PR, and the only failing tests are unrelated to this PR (failing elsewhere and unrelated code).

Watching for regressions and will revet if required!

feat: support flash attention 2 in qwen2 vl vision blocks

41dff31

OlivierDehaene reviewed Nov 18, 2024

View reviewed changes

fix: calc max_seqlen once and small refactors

70409f0

drbh merged commit 38cff84 into main Nov 18, 2024
10 of 12 checks passed

drbh deleted the support-flash-qwen2-vl branch November 18, 2024 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: support flash attention 2 in qwen2 vl vision blocks #2721

feat: support flash attention 2 in qwen2 vl vision blocks #2721

Uh oh!

drbh commented Nov 4, 2024

Uh oh!

OlivierDehaene Nov 18, 2024

Uh oh!

drbh Nov 18, 2024

Uh oh!

drbh commented Nov 18, 2024

Uh oh!

Uh oh!

Uh oh!

feat: support flash attention 2 in qwen2 vl vision blocks #2721

feat: support flash attention 2 in qwen2 vl vision blocks #2721

Uh oh!

Conversation

drbh commented Nov 4, 2024

Uh oh!

OlivierDehaene Nov 18, 2024

Choose a reason for hiding this comment

Uh oh!

drbh Nov 18, 2024

Choose a reason for hiding this comment

Uh oh!

drbh commented Nov 18, 2024

Uh oh!

Uh oh!

Uh oh!