Improving mamba runtime by using updates #1552

Narsil · 2024-02-13T11:19:45Z

Move float16 to bfloat16, which has less imprecisions (load test are
failing with the update kernels + f16, all working under bf16).

Another note, is that we are not respecting the layer norm in f32
defined in the configuration (this is OK in my book, but that could
impact the f16 precision)
Moved to update kernels. Triton overhead is super high, removed by
switching to cuda graphs works great (update cuda graph is available
in TRT-LLM if needed, seems exactly like the regular ssm kernel.
Moved inference_params struct in order to make only 2 tensors, to
reduce the overhead of copying back and forth to the cuda graphs.
Left over overhead seems entirely in the tokenization bit. (Still 4
copies are paid before launching the graph)

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

- Move float16 to bfloat16, which has less imprecisions (load test are failing with the update kernels + f16, all working under bf16). Another note, is that we are not respecting the layer norm in f32 defined in the configuration (this is OK in my book, but that could impact the f16 precision) - Moved to update kernels. Triton overhead is super high, removed by switching to cuda graphs works great (update cuda graph is available in TRT-LLM if needed, seems *exactly* like the regular ssm kernel. - Moved inference_params struct in order to make only 2 tensors, to reduce the overhead of copying back and forth to the cuda graphs. - Left over overhead seems entirely in the tokenization bit. (Still 4 copies are paid before launching the graph)

drbh · 2024-02-13T15:18:00Z

wooooo this is awesome thank you for the optimizations @Narsil! Just pulled down and am getting a huge performance increase. ~50 to 15ms 🙏

LGTM just think the snapshots for mamba are out of date

Narsil · 2024-02-13T15:20:00Z

They are updated, just seems noise is quite high in the update kernels (could explain the f16 instabilities)

drbh

LGTM

logprobs.

@OlivierDehaene

- Move float16 to bfloat16, which has less imprecisions (load test are failing with the update kernels + f16, all working under bf16). Another note, is that we are not respecting the layer norm in f32 defined in the configuration (this is OK in my book, but that could impact the f16 precision) - Moved to update kernels. Triton overhead is super high, removed by switching to cuda graphs works great (update cuda graph is available in TRT-LLM if needed, seems *exactly* like the regular ssm kernel. - Moved inference_params struct in order to make only 2 tensors, to reduce the overhead of copying back and forth to the cuda graphs. - Left over overhead seems entirely in the tokenization bit. (Still 4 copies are paid before launching the graph) # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Narsil requested a review from drbh February 13, 2024 11:28

Update load .

d9000a2

drbh previously approved these changes Feb 13, 2024

View reviewed changes

Generous snapshot for load because of accumulations errors in the

b9ac720

logprobs.

Narsil dismissed drbh’s stale review via b9ac720 February 13, 2024 18:15

Typo.

1ffc3a0

Narsil merged commit d6b0fb9 into main Feb 14, 2024

Narsil deleted the mamba_graphs branch February 14, 2024 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving mamba runtime by using updates #1552

Improving mamba runtime by using updates #1552

Uh oh!

Narsil commented Feb 13, 2024

Uh oh!

drbh commented Feb 13, 2024

Uh oh!

Narsil commented Feb 13, 2024

Uh oh!

drbh left a comment

Uh oh!

Uh oh!

Improving mamba runtime by using updates #1552

Improving mamba runtime by using updates #1552

Uh oh!

Conversation

Narsil commented Feb 13, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

drbh commented Feb 13, 2024

Uh oh!

Narsil commented Feb 13, 2024

Uh oh!

drbh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!