Skip to content

Commit 2086fac

Browse files
Jack-Khuumalfet
authored andcommitted
Removing all references to HQQ (#869)
* Removing all references to HQQ * Updating lm_eval version (#865) Fixing CI related to EleutherAI/wikitext_document_level change requirements from using HF Datasets * Pinning numpy to under 2.0 (#867) * Update Quant call using llama.cpp (#868) llama.cpp did a BC breaking refactor: ggml-org/llama.cpp@1c641e6 resulting in some of our CI breaking This updates our CI to match llama.cpp's schema * Updating torch nightly to pick up aoti improvements in 128339 (#862) * Updating torch nightly to pick up aoti improvements in 128339 * Update the torch version to 2.5 * Updating lm_eval version (#865) Fixing CI related to EleutherAI/wikitext_document_level change requirements from using HF Datasets * Pinning numpy to under 2.0 (#867) * Creating an initial Quantization Directory (#863) * Initial Creation of a quantization directory * Moving qops * updating import * Updating lm_eval version (#865) Fixing CI related to EleutherAI/wikitext_document_level change requirements from using HF Datasets * Pinning numpy to under 2.0 (#867) * Update Quant call using llama.cpp (#868) llama.cpp did a BC breaking refactor: ggml-org/llama.cpp@1c641e6 resulting in some of our CI breaking This updates our CI to match llama.cpp's schema * Updating torch nightly to pick up aoti improvements in 128339 (#862) * Updating torch nightly to pick up aoti improvements in 128339 * Update the torch version to 2.5 * Updating lm_eval version (#865) Fixing CI related to EleutherAI/wikitext_document_level change requirements from using HF Datasets * Pinning numpy to under 2.0 (#867)
1 parent 09d9896 commit 2086fac

File tree

19 files changed

+1
-2409
lines changed

19 files changed

+1
-2409
lines changed

.github/workflows/hqq-dtype.yml

Lines changed: 0 additions & 89 deletions
This file was deleted.

.github/workflows/pull.yml

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -555,12 +555,6 @@ jobs:
555555
# python export.py --quant '{"linear:int4" : {"groupsize": 32}}' --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --output-pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
556556
# python3 torchchat.py generate --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --temperature 0 --pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
557557
558-
echo "******************************************"
559-
echo "******** HQQ group-wise quantized *******"
560-
echo "******************************************"
561-
# python export.py --quant '{"linear:hqq" : {"groupsize": 32}}' --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --output-pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
562-
# python3 torchchat.py generate --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --temperature 0 --pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
563-
564558
echo "tests complete"
565559
echo "******************************************"
566560

.github/workflows/runner-cuda-dtype.yml

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -60,17 +60,6 @@ jobs:
6060
6161
./cmake-out/aoti_run /tmp/model.so -d CUDA -z ${MODEL_DIR}/tokenizer.model -i "${PROMPT}"
6262
63-
echo "**********************************************"
64-
echo "******** INT4 HQQ group-wise quantized *******"
65-
echo "**********************************************"
66-
python generate.py --dtype ${DTYPE} --device cuda --quant '{"linear:hqq" : {"groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --temperature 0 > ./output_eager
67-
cat ./output_eager
68-
python generate.py --dtype ${DTYPE} --device cuda --compile --quant '{"linear:hqq" : {"groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --temperature 0 > ./output_compiled
69-
cat ./output_compiled
70-
python export.py --dtype ${DTYPE} --device cuda --quant '{"linear:hqq" : {"groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --output-dso-path ${MODEL_DIR}/${MODEL_NAME}.so
71-
python generate.py --dtype ${DTYPE} --device cuda --checkpoint-path ${MODEL_PATH} --temperature 0 --dso-path ${MODEL_DIR}/${MODEL_NAME}.so > ./output_aoti
72-
cat ./output_aoti
73-
7463
done
7564
7665
echo "tests complete"

README.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -458,9 +458,6 @@ awesome libraries and tools you've built around local LLM inference.
458458
Fast!](https://github.com/pytorch-labs/gpt-fast), which we have
459459
directly adopted (both ideas and code) from his repo.
460460

461-
* Mobius Labs as the authors of the HQQ quantization algorithms
462-
included in this distribution.
463-
464461

465462
## License
466463

cli.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -198,7 +198,7 @@ def add_arguments_for_verb(parser, verb: str):
198198
default="{ }",
199199
help=(
200200
'Quantization options. pass in as \'{"<mode>" : {"<argname1>" : <argval1>, "<argname2>" : <argval2>,...},}\' '
201-
+ "modes are: embedding, linear:int8, linear:int4, linear:gptq, linear:hqq, linear:a8w4dq, precision."
201+
+ "modes are: embedding, linear:int8, linear:int4, linear:gptq, linear:a8w4dq, precision."
202202
),
203203
)
204204
parser.add_argument(

docs/ACKNOWLEDGEMENTS.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,3 @@
2727
Fast!](https://github.com/pytorch-labs/gpt-fast), which we have
2828
directly adopted (both ideas and code) from his repo.
2929

30-
* Mobius Labs as the authors of the HQQ quantization algorithms
31-
included in this distribution.
32-

docs/ADVANCED-USERS.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -376,16 +376,12 @@ To compress models, torchchat offers a variety of strategies:
376376

377377
* dynamic activation quantization with weight quantization: a8w4dq
378378

379-
In addition, we support GPTQ and HQQ for improving the quality of 4b
380-
weight-only quantization. Support for HQQ is a work in progress.
381-
382379
| compression | FP precision | weight quantization | dynamic activation quantization |
383380
|--|--|--|--|
384381
embedding table (symmetric) | fp32, fp16, bf16 | 8b (group/channel), 4b (group/channel) | n/a |
385382
linear operator (symmetric) | fp32, fp16, bf16 | 8b (group/channel) | n/a |
386383
linear operator (asymmetric) | n/a | 4b (group), a6w4dq | a8w4dq (group) |
387384
linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
388-
linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |
389385

390386
## Model precision (dtype precision setting)
391387
On top of quantizing models with quantization schemes mentioned above, models can be converted
@@ -450,9 +446,6 @@ strategies:
450446

451447
* dynamic activation quantization with weight quantization: a8w4dq
452448

453-
In addition, we support GPTQ and HQQ for improving the quality of 4b
454-
weight-only quantization. Support for HQQ is a work in progress.
455-
456449
You can find instructions for quantizing models in
457450
[docs/quantization.md](file:///./quantization.md). Advantageously,
458451
quantization is available in eager mode as well as during export,

docs/quantization.md

Lines changed: 0 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ While quantization can potentially degrade the model's performance, the methods
1818
|--|--|--|--|--|--|--|--|
1919
| linear (asymmetric) | [8, 4]* | [32, 64, 128, 256]** | ||| 🚧 |
2020
| linear with GPTQ*** (asymmetric) | |[32, 64, 128, 256]** | ||||
21-
| linear with HQQ*** (asymmetric) | |[32, 64, 128, 256]** | ||||
2221
| linear with dynamic activations (symmetric) | | [32, 64, 128, 256]* | a8w4dq | 🚧 |🚧 ||
2322

2423
### Embedding Quantization
@@ -40,20 +39,6 @@ on-device usecases.
4039
model quality and accuracy, and larger groupsize for further
4140
improving performance. Set 0 for channelwise quantization.
4241

43-
*** [GPTQ](https://arxiv.org/abs/2210.17323) and
44-
[HQQ](https://mobiusml.github.io/hqq_blog/) are two different
45-
algorithms to address accuracy loss when using lower bit
46-
quantization. Due to HQQ relying on data/calibration free
47-
quantization, it tends to take less time to quantize model.
48-
HQQ is currently enabled with axis=1 configuration.
49-
50-
Presently, torchchat includes a subset of the HQQ distribution in
51-
the hqq subdirectory, but HQQ is not installed by default with torchchat,
52-
due to dependence incompatibilities between torchchat and the hqq
53-
project. We may integrate hqq via requirements.txt in the future.
54-
(As a result, there's presently no upstream path for changes and/or
55-
improvements to HQQ.)
56-
5742
+ Should support non-power-of-2-groups as well.
5843

5944
## Quantization Profiles
@@ -96,7 +81,6 @@ for valid `bitwidth` and `groupsize` values.
9681
| linear (asymmetric) | `'{"linear:int<bitwidth>" : {"groupsize" : <groupsize>}}'` |
9782
| linear with dynamic activations (symmetric) | `'{"linear:a8w4dq" : {"groupsize" : <groupsize>}}'`|
9883
| linear with GPTQ (asymmetric) | `'{"linear:int4-gptq" : {"groupsize" : <groupsize>}}'`|
99-
| linear with HQQ (asymmetric) |`'{"linear:hqq" : {"groupsize" : <groupsize>}}'`|
10084
| embedding | `'{"embedding": {"bitwidth": <bitwidth>, "groupsize":<groupsize>}}'` |
10185

10286
See the available quantization schemes [here](https://github.com/pytorch/torchchat/blob/main/quantization/quantize.py#L1260-L1266).

0 commit comments

Comments
 (0)