Skip to content

ROCm AWQ support #1514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Feb 9, 2024
Merged

ROCm AWQ support #1514

merged 27 commits into from
Feb 9, 2024

Conversation

IlyasMoutawwakil
Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil commented Feb 1, 2024

What does this PR do?

This PR adds the possibility to run AWQ models with Exllama/GPTQ kernels, specifically for ROCm devices that support Exllama kernels but not AWQ's GEMM.

This is done by :

  • un-packing, reordering and re-packing AWQ weights when --quantize gptq but the model's quant_method=awq.
  • avoiding overflows when adding 1 to zeros in exllama and triton.

Ref: casper-hansen/AutoAWQ#313

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@IlyasMoutawwakil
Copy link
Member Author

IlyasMoutawwakil commented Feb 1, 2024

Tested with (Llama-7b + awq + sharding) on an 4 MI250 (8 devices)

docker run --device /dev/dri/ --device /dev/kfd/ --shm-size 1g -p 9999:80 -v $volume:/data tgi-rocm-awq:latest --model-id TheBloke/Llama-2-7B-AWQ --quantize awq  --sharded true --num-shard 2

@IlyasMoutawwakil
Copy link
Member Author

Also tested with Triton Kernel

docker run --env DISABLE_EXLLAMA="True" --device /dev/dri/ --device /dev/kfd/ --shm-size 1g -p 9999:80 -v $volume:/data tgi-rocm-awq:latest --model-id TheBloke/Llama-2-7B-AWQ --quantize awq --sharded true --num-shard 2

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of adding more support for Rocm, but the way of achieving it right now is deceiving to users and not really achieving what we are supposed to.

Adding tooling to do AWQ->GPTQ is great but we should not do it on behalf of users.

@@ -0,0 +1,146 @@
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, should probably a standalone conversion tool somewhere, no ?

scales=scales,
bias=bias is not None,
)
if HAS_AWQ:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really against that kind of hidden control flow.

Let's make it trivial for users to convert from AWQ to GPTQ externally, and then actually use GTPQ.
Making unasked for, on-the-fly conversions is really not ok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, does using exllama as a backend for awq exclusively on rocm make more sense ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy of introducing a rocm vs nvidia difference either.

Don' t you think we could have instead a good error message when trying to load AWQ on rocm, and the error message includes a trivial way to use that model anyway.

text-generation-launcher --model-id XXX-awq --quantize awq
# Error ! AWQ on Rocm is not supported directly, you can use the GPTQ quantization to use them
text-generation-launcher --model-id XXX-awq --quantize gptq

For instance.Wdyt ? It seems quite obvious to users.It might be useful on nvidia targets too, an we can keep the almost transparent feeling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to explicit log about the conversion happening though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Narsil
Copy link
Collaborator

Narsil commented Feb 1, 2024

Screenshot 2024-02-01 at 14 46 29 Screenshot 2024-02-01 at 14 51 21

Here are initial benchmarks on Mistral 7b. There is a latency improvement at BS=1, but the throughput is reduced for this PR.

@IlyasMoutawwakil
Copy link
Member Author

Thank you @Narsil for the comments and the benchmark
can I get some details about the benchmark you shared, is it a comparison between llm-awq's GEMM (default behavior) vs Exllama (when llm-awq is unavailable) or Triton ? or GEMM on main vs this PR ?

@Narsil
Copy link
Collaborator

Narsil commented Feb 1, 2024

Thank you @Narsil for the comments and the benchmark can I get some details about the benchmark you shared, is it a comparison between llm-awq's GEMM (default behavior) vs Exllama (when llm-awq is unavailable) or Triton ? or GEMM on main vs this PR ?

Yes it' s mistral-instruct-v0.2 7B llm-awq VS your branch (without AWQ so running the conversion).
(On Nvidia ofc :) )

@Narsil
Copy link
Collaborator

Narsil commented Feb 1, 2024

Ignore the failing test it's unrelated to anything.

@IlyasMoutawwakil
Copy link
Member Author

Yes the numbers make sense since they fall in the 10-90 total_seq_len (batch_size * seq_len) range.
This benchmark AutoGPTQ/AutoGPTQ#484 (comment) shows when each kernel is best.

Narsil
Narsil previously approved these changes Feb 8, 2024
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Let's wait for the tests (failure are still quite odd).

@IlyasMoutawwakil
Copy link
Member Author

Thank you @Narsil for updating the test snapshots !
@fxmarty FYI it seems that there are in fact some slight changes in logits between gptq+exllama kernels with and without bit overflow correction. Some GPTQ models do overflow some times.

@Narsil Narsil merged commit a4e5801 into main Feb 9, 2024
@Narsil Narsil deleted the rocm-awq-support branch February 9, 2024 09:45
kdamaszk pushed a commit to kdamaszk/tgi-gaudi that referenced this pull request Apr 29, 2024
# What does this PR do?

<!--
Congratulations! You've made it this far! You're not quite done yet
though.

Once merged, your PR is going to appear in the release notes with the
title you set, so make sure it's a great title that fully reflects the
extent of your awesome contribution.

Then, please replace this with a description of the change and which
issue is fixed (if applicable). Please also include relevant motivation
and context. List any dependencies (if any) that are required for this
change.

Once you're done, someone will review your PR shortly (see the section
"Who can review?" below to tag some potential reviewers). They may
suggest changes to make the code even better. If no one reviewed your PR
after a week has passed, don't hesitate to post a new comment
@-mentioning the same persons---sometimes notifications get lost.
-->

<!-- Remove if not applicable -->

This PR adds the possibility to run AWQ models with Exllama/GPTQ
kernels, specifically for ROCm devices that support Exllama kernels but
not AWQ's GEMM.

This is done by :
- un-packing, reordering and re-packing AWQ weights when `--quantize
gptq` but the model's `quant_method=awq`.
- avoiding overflows when adding 1 to zeros in exllama and triton.

Ref: casper-hansen/AutoAWQ#313

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Did you read the [contributor
guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests),
      Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the
[forum](https://discuss.huggingface.co/)? Please add a link
      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
Here are the
[documentation
guidelines](https://github.com/huggingface/transformers/tree/main/docs),
and
[here are tips on formatting
docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).
- [ ] Did you write any new necessary tests?


## Who can review?

Anyone in the community is free to review the PR once the tests have
passed. Feel free to tag
members/contributors who may be interested in your PR.

<!-- Your PR will be replied to more quickly if you can figure out the
right person to tag with @


@OlivierDehaene OR @Narsil

 -->

---------

Co-authored-by: Nicolas Patry <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants