Skip to content

Removing all references to HQQ #869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 0 additions & 89 deletions .github/workflows/hqq-dtype.yml

This file was deleted.

6 changes: 0 additions & 6 deletions .github/workflows/pull.yml
Original file line number Diff line number Diff line change
Expand Up @@ -555,12 +555,6 @@ jobs:
# python export.py --quant '{"linear:int4" : {"groupsize": 32}}' --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --output-pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
# python3 torchchat.py generate --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --temperature 0 --pte-path ${MODEL_DIR}/${MODEL_NAME}.pte

echo "******************************************"
echo "******** HQQ group-wise quantized *******"
echo "******************************************"
# python export.py --quant '{"linear:hqq" : {"groupsize": 32}}' --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --output-pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
# python3 torchchat.py generate --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --temperature 0 --pte-path ${MODEL_DIR}/${MODEL_NAME}.pte

echo "tests complete"
echo "******************************************"

Expand Down
11 changes: 0 additions & 11 deletions .github/workflows/runner-cuda-dtype.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,17 +60,6 @@ jobs:

./cmake-out/aoti_run /tmp/model.so -d CUDA -z ${MODEL_DIR}/tokenizer.model -i "${PROMPT}"

echo "**********************************************"
echo "******** INT4 HQQ group-wise quantized *******"
echo "**********************************************"
python generate.py --dtype ${DTYPE} --device cuda --quant '{"linear:hqq" : {"groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --temperature 0 > ./output_eager
cat ./output_eager
python generate.py --dtype ${DTYPE} --device cuda --compile --quant '{"linear:hqq" : {"groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --temperature 0 > ./output_compiled
cat ./output_compiled
python export.py --dtype ${DTYPE} --device cuda --quant '{"linear:hqq" : {"groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --output-dso-path ${MODEL_DIR}/${MODEL_NAME}.so
python generate.py --dtype ${DTYPE} --device cuda --checkpoint-path ${MODEL_PATH} --temperature 0 --dso-path ${MODEL_DIR}/${MODEL_NAME}.so > ./output_aoti
cat ./output_aoti

done

echo "tests complete"
Expand Down
3 changes: 0 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -458,9 +458,6 @@ awesome libraries and tools you've built around local LLM inference.
Fast!](https://github.com/pytorch-labs/gpt-fast), which we have
directly adopted (both ideas and code) from his repo.

* Mobius Labs as the authors of the HQQ quantization algorithms
included in this distribution.


## License

Expand Down
2 changes: 1 addition & 1 deletion cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ def add_arguments_for_verb(parser, verb: str):
default="{ }",
help=(
'Quantization options. pass in as \'{"<mode>" : {"<argname1>" : <argval1>, "<argname2>" : <argval2>,...},}\' '
+ "modes are: embedding, linear:int8, linear:int4, linear:gptq, linear:hqq, linear:a8w4dq, precision."
+ "modes are: embedding, linear:int8, linear:int4, linear:gptq, linear:a8w4dq, precision."
),
)
parser.add_argument(
Expand Down
3 changes: 0 additions & 3 deletions docs/ACKNOWLEDGEMENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,3 @@
Fast!](https://github.com/pytorch-labs/gpt-fast), which we have
directly adopted (both ideas and code) from his repo.

* Mobius Labs as the authors of the HQQ quantization algorithms
included in this distribution.

7 changes: 0 additions & 7 deletions docs/ADVANCED-USERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -376,16 +376,12 @@ To compress models, torchchat offers a variety of strategies:

* dynamic activation quantization with weight quantization: a8w4dq

In addition, we support GPTQ and HQQ for improving the quality of 4b
weight-only quantization. Support for HQQ is a work in progress.

| compression | FP precision | weight quantization | dynamic activation quantization |
|--|--|--|--|
embedding table (symmetric) | fp32, fp16, bf16 | 8b (group/channel), 4b (group/channel) | n/a |
linear operator (symmetric) | fp32, fp16, bf16 | 8b (group/channel) | n/a |
linear operator (asymmetric) | n/a | 4b (group), a6w4dq | a8w4dq (group) |
linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |

## Model precision (dtype precision setting)
On top of quantizing models with quantization schemes mentioned above, models can be converted
Expand Down Expand Up @@ -450,9 +446,6 @@ strategies:

* dynamic activation quantization with weight quantization: a8w4dq

In addition, we support GPTQ and HQQ for improving the quality of 4b
weight-only quantization. Support for HQQ is a work in progress.

You can find instructions for quantizing models in
[docs/quantization.md](file:///./quantization.md). Advantageously,
quantization is available in eager mode as well as during export,
Expand Down
16 changes: 0 additions & 16 deletions docs/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@ While quantization can potentially degrade the model's performance, the methods
|--|--|--|--|--|--|--|--|
| linear (asymmetric) | [8, 4]* | [32, 64, 128, 256]** | | ✅ | ✅ | 🚧 |
| linear with GPTQ*** (asymmetric) | |[32, 64, 128, 256]** | | ✅ | ✅ | ❌ |
| linear with HQQ*** (asymmetric) | |[32, 64, 128, 256]** | | ✅ | ✅ | ❌ |
| linear with dynamic activations (symmetric) | | [32, 64, 128, 256]* | a8w4dq | 🚧 |🚧 | ✅ |

### Embedding Quantization
Expand All @@ -40,20 +39,6 @@ on-device usecases.
model quality and accuracy, and larger groupsize for further
improving performance. Set 0 for channelwise quantization.

*** [GPTQ](https://arxiv.org/abs/2210.17323) and
[HQQ](https://mobiusml.github.io/hqq_blog/) are two different
algorithms to address accuracy loss when using lower bit
quantization. Due to HQQ relying on data/calibration free
quantization, it tends to take less time to quantize model.
HQQ is currently enabled with axis=1 configuration.

Presently, torchchat includes a subset of the HQQ distribution in
the hqq subdirectory, but HQQ is not installed by default with torchchat,
due to dependence incompatibilities between torchchat and the hqq
project. We may integrate hqq via requirements.txt in the future.
(As a result, there's presently no upstream path for changes and/or
improvements to HQQ.)

+ Should support non-power-of-2-groups as well.

## Quantization Profiles
Expand Down Expand Up @@ -96,7 +81,6 @@ for valid `bitwidth` and `groupsize` values.
| linear (asymmetric) | `'{"linear:int<bitwidth>" : {"groupsize" : <groupsize>}}'` |
| linear with dynamic activations (symmetric) | `'{"linear:a8w4dq" : {"groupsize" : <groupsize>}}'`|
| linear with GPTQ (asymmetric) | `'{"linear:int4-gptq" : {"groupsize" : <groupsize>}}'`|
| linear with HQQ (asymmetric) |`'{"linear:hqq" : {"groupsize" : <groupsize>}}'`|
| embedding | `'{"embedding": {"bitwidth": <bitwidth>, "groupsize":<groupsize>}}'` |

See the available quantization schemes [here](https://github.com/pytorch/torchchat/blob/main/quantization/quantize.py#L1260-L1266).
Expand Down
Loading
Loading