ROCm AWQ support #1514

IlyasMoutawwakil · 2024-02-01T11:15:12Z

What does this PR do?

This PR adds the possibility to run AWQ models with Exllama/GPTQ kernels, specifically for ROCm devices that support Exllama kernels but not AWQ's GEMM.

This is done by :

un-packing, reordering and re-packing AWQ weights when --quantize gptq but the model's quant_method=awq.
avoiding overflows when adding 1 to zeros in exllama and triton.

Ref: casper-hansen/AutoAWQ#313

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

IlyasMoutawwakil · 2024-02-01T12:47:29Z

Tested with (Llama-7b + awq + sharding) on an 4 MI250 (8 devices)

docker run --device /dev/dri/ --device /dev/kfd/ --shm-size 1g -p 9999:80 -v $volume:/data tgi-rocm-awq:latest --model-id TheBloke/Llama-2-7B-AWQ --quantize awq  --sharded true --num-shard 2

IlyasMoutawwakil · 2024-02-01T13:33:32Z

Also tested with Triton Kernel

docker run --env DISABLE_EXLLAMA="True" --device /dev/dri/ --device /dev/kfd/ --shm-size 1g -p 9999:80 -v $volume:/data tgi-rocm-awq:latest --model-id TheBloke/Llama-2-7B-AWQ --quantize awq --sharded true --num-shard 2

Narsil

I love the idea of adding more support for Rocm, but the way of achieving it right now is deceiving to users and not really achieving what we are supposed to.

Adding tooling to do AWQ->GPTQ is great but we should not do it on behalf of users.

server/exllama_kernels/exllama_kernels/cuda_func/q4_matmul.cu

server/text_generation_server/server.py

Narsil · 2024-02-01T13:39:06Z

server/text_generation_server/utils/awq/pack_utils.py

@@ -0,0 +1,146 @@
+import torch


This is great, should probably a standalone conversion tool somewhere, no ?

Narsil · 2024-02-01T13:41:18Z

server/text_generation_server/utils/layers.py

-            scales=scales,
-            bias=bias is not None,
-        )
+        if HAS_AWQ:


I'm really against that kind of hidden control flow.

Let's make it trivial for users to convert from AWQ to GPTQ externally, and then actually use GTPQ.
Making unasked for, on-the-fly conversions is really not ok.

I see, does using exllama as a backend for awq exclusively on rocm make more sense ?

I'm not happy of introducing a rocm vs nvidia difference either.

Don' t you think we could have instead a good error message when trying to load AWQ on rocm, and the error message includes a trivial way to use that model anyway.

text-generation-launcher --model-id XXX-awq --quantize awq # Error ! AWQ on Rocm is not supported directly, you can use the GPTQ quantization to use them text-generation-launcher --model-id XXX-awq --quantize gptq

For instance.Wdyt ? It seems quite obvious to users.It might be useful on nvidia targets too, an we can keep the almost transparent feeling.

We do need to explicit log about the conversion happening though.

We can use this flag https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-AWQ/blob/main/config.json#L24 to do the conversion on the fly.

Narsil · 2024-02-01T13:56:28Z

Here are initial benchmarks on Mistral 7b. There is a latency improvement at BS=1, but the throughput is reduced for this PR.

IlyasMoutawwakil · 2024-02-01T14:59:48Z

Thank you @Narsil for the comments and the benchmark
can I get some details about the benchmark you shared, is it a comparison between llm-awq's GEMM (default behavior) vs Exllama (when llm-awq is unavailable) or Triton ? or GEMM on main vs this PR ?

Narsil · 2024-02-01T15:16:20Z

Thank you @Narsil for the comments and the benchmark can I get some details about the benchmark you shared, is it a comparison between llm-awq's GEMM (default behavior) vs Exllama (when llm-awq is unavailable) or Triton ? or GEMM on main vs this PR ?

Yes it' s mistral-instruct-v0.2 7B llm-awq VS your branch (without AWQ so running the conversion).
(On Nvidia ofc :) )

Narsil · 2024-02-01T15:34:25Z

Ignore the failing test it's unrelated to anything.

IlyasMoutawwakil · 2024-02-01T16:51:34Z

Yes the numbers make sense since they fall in the 10-90 total_seq_len (batch_size * seq_len) range.
This benchmark AutoGPTQ/AutoGPTQ#484 (comment) shows when each kernel is best.

Narsil

LGTM.
Let's wait for the tests (failure are still quite odd).

…t-generation-inference into rocm-awq-support

IlyasMoutawwakil · 2024-02-08T15:46:58Z

Thank you @Narsil for updating the test snapshots !
@fxmarty FYI it seems that there are in fact some slight changes in logits between gptq+exllama kernels with and without bit overflow correction. Some GPTQ models do overflow some times.

@OlivierDehaene

# What does this PR do?   This PR adds the possibility to run AWQ models with Exllama/GPTQ kernels, specifically for ROCm devices that support Exllama kernels but not AWQ's GEMM. This is done by : - un-packing, reordering and re-packing AWQ weights when `--quantize gptq` but the model's `quant_method=awq`. - avoiding overflows when adding 1 to zeros in exllama and triton. Ref: casper-hansen/AutoAWQ#313 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Nicolas Patry <[email protected]>

IlyasMoutawwakil added 3 commits February 1, 2024 12:05

fix exllama overflows

6cb6020

awq fallback to exllama

12c1f54

post process exllama model

5766c55

IlyasMoutawwakil added 2 commits February 1, 2024 13:30

add triton fallback to awq

8acbcb3

fix missing g_idx and eventual overflow in triton kernel

0b5b858

Narsil reviewed Feb 1, 2024

View reviewed changes

IlyasMoutawwakil and others added 17 commits February 1, 2024 18:35

revert changes

8665ab0

adapt awq weights to exllama/gptq kernels

fb59c56

typing

bcdb02e

pass g_idx instead of changing triton kernel

994ed8e

none g_idx

af2c589

log message

cda5751

fix exllama overflows

461dd6f

awq fallback to exllama

7508652

post process exllama model

aa2014f

add triton fallback to awq

3963074

fix missing g_idx and eventual overflow in triton kernel

3ceeb85

revert changes

212fdff

adapt awq weights to exllama/gptq kernels

8074c40

typing

646ab28

pass g_idx instead of changing triton kernel

bbe5bed

none g_idx

76834c9

log message

2629193

Narsil force-pushed the rocm-awq-support branch from cda5751 to 2629193 Compare February 8, 2024 11:44

Narsil previously approved these changes Feb 8, 2024

View reviewed changes

Updating the tests.

04d38a8

Narsil dismissed their stale review via 04d38a8 February 8, 2024 14:59

IlyasMoutawwakil and others added 3 commits February 8, 2024 16:03

Merge branch 'rocm-awq-support' of https://github.com/huggingface/tex…

e29fb79

…t-generation-inference into rocm-awq-support

generate g_idx only for triton kernel

bc157af

Update llama gptq.

a76821e

Better error message on non rocm.

326f8e3

Narsil merged commit a4e5801 into main Feb 9, 2024

Narsil deleted the rocm-awq-support branch February 9, 2024 09:45

ROCm AWQ support #1514

ROCm AWQ support #1514

Uh oh!

Conversation

IlyasMoutawwakil commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

IlyasMoutawwakil commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IlyasMoutawwakil commented Feb 1, 2024

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Narsil Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Narsil commented Feb 1, 2024

Uh oh!

IlyasMoutawwakil commented Feb 1, 2024

Uh oh!

Narsil commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Narsil commented Feb 1, 2024

Uh oh!

IlyasMoutawwakil commented Feb 1, 2024

Uh oh!

Narsil left a comment

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil commented Feb 8, 2024

Uh oh!

Uh oh!

IlyasMoutawwakil commented Feb 1, 2024 •

edited

Loading

IlyasMoutawwakil commented Feb 1, 2024 •

edited

Loading

Narsil commented Feb 1, 2024 •

edited

Loading