Auto max prefill #2797

Narsil · 2024-12-03T03:08:15Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-12-04T20:36:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

danieldk · 2024-12-06T08:38:26Z

launcher/src/main.rs

+fn compute_optimal(config: Option<&Config>, compute: Option<&ComputeType>) -> Option<usize> {
+    if let (Some(config), Some(compute)) = (config, compute) {
+        if let (Some(f16_max_compute), Some(model_compute)) = (compute.f16_flop(), config.flop()) {
+            tracing::debug!("MAx compute {f16_max_compute} model compute {model_compute}");


Nit:

tracing::debug!("Max compute {f16_max_compute} model compute {model_compute}");

danieldk · 2024-12-06T08:49:58Z

launcher/src/main.rs

+        let q_flops = 2 * num_heads * head_dim * hidden_size;
+        let k_flops = 2 * num_kv_heads * head_dim * hidden_size;
+        let v_flops = 2 * num_kv_heads * head_dim * hidden_size;
+        let attn_flops = 2 * num_heads * head_dim * hidden_size;
+        let o_flops = 2 * num_heads * head_dim * hidden_size;


We should probably adjust this for FP8 on >=9.0. (For <9.0, this is still fine, since FP8-Marlin computes in FP16).

No it should be OK. Since I use the flops on hardware in f16 already.

Both are divided by 2 for fp8 so the calculation should hold, it's just expressed in f16 unit.

danieldk · 2024-12-06T08:55:20Z

launcher/src/main.rs

+        let head_dim = self.head_dim? as u64;
+        let hidden_size = self.hidden_size? as u64;
+        let intermediate_size = if let Some(experts) = self.experts {
+            (self.intermediate_size? * experts) as u64


In the case of Deepseek v2, we should do: (num_experts_per_tok + n_shared_experts) * moe_intermediate_size

(Or put differently, Phi 3.5 MoE and Mixtral are n_shared_experts == 0 && moe_intermediate_size == intermediate_size.)

To make it more fun, the first layer in Deepseek v2 does not use MoE (but the layer is the same size practically, since moe_intermediate_size is intermediate_size / (num_experts_per_tok + n_shared_experts).

Narsil added 17 commits December 3, 2024 04:06

Attempt at automatic max batch prefill.

54d3c81

Taking into account number of shards.

fa91244

Adding more cards.

23c0a20

Adding A100 + H100

5bcb3e6

Adding a few more cards.

e85dc0a

Logprobs cost too much.

96ad65b

h100 better name, and keep factor of 2

748dce6

Damn inflated sparse tflops.

3a53e8c

Typo in h100.

3ec9259

Updated the flops calculation (checked with fvcore).

9fab7c6

chunking by default.

db11149

Fix prefix caching for chat completion since we removed logprobs.

1352f70

More tests.

13e6d52

Dropping all the prefill logprobs.

f6998f8

Add a flag that enables users to get logprobs back.

3a86afc

Repairing prompt token counting.

3ed703c

Fixing a few tests.

a78b6fd

Narsil added 2 commits December 4, 2024 21:54

Remove some scaffolding.

ca8a115

Attempting to reduces the issues (workarounds for now).

f022ecf

Narsil requested review from danieldk December 6, 2024 04:49

Narsil merged commit 5df8059 into main Dec 6, 2024
11 of 13 checks passed

Narsil deleted the auto_max_prefill branch December 6, 2024 04:52

danieldk reviewed Dec 6, 2024

View reviewed changes

2016bgeyer mentioned this pull request Dec 20, 2024

Can't run llama3.1-70b at full context #2301

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto max prefill #2797

Auto max prefill #2797

Uh oh!

Narsil commented Dec 3, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Dec 4, 2024

Uh oh!

Uh oh!

danieldk Dec 6, 2024

Uh oh!

danieldk Dec 6, 2024

Uh oh!

Narsil Dec 6, 2024

Uh oh!

danieldk Dec 6, 2024

Uh oh!

Narsil Dec 6, 2024

Uh oh!

Uh oh!

Auto max prefill #2797

Auto max prefill #2797

Uh oh!

Conversation

Narsil commented Dec 3, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Dec 4, 2024

Uh oh!

Uh oh!

danieldk Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

danieldk Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

danieldk Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil Dec 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!