pytorch · Jack-Khuu · Jun 21, 2024 · Jun 21, 2024 · Jun 18, 2024 · Jun 18, 2024
diff --git a/.github/workflows/hqq-dtype.yml b/.github/workflows/hqq-dtype.yml
diff --git a/.github/workflows/pull.yml b/.github/workflows/pull.yml
@@ -555,12 +555,6 @@ jobs:
           # python export.py --quant '{"linear:int4" : {"groupsize": 32}}' --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --output-pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
           # python3 torchchat.py generate --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --temperature 0 --pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
 
-          echo "******************************************"
-          echo "******** HQQ group-wise quantized *******"
-          echo "******************************************"
-          # python export.py --quant '{"linear:hqq" : {"groupsize": 32}}' --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --output-pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
-          # python3 torchchat.py generate --tokenizer-path ${TOKENIZER_PATH} --gguf-path ${GGUF_PATH} --temperature 0 --pte-path ${MODEL_DIR}/${MODEL_NAME}.pte
-
           echo "tests complete"
           echo "******************************************"
 

diff --git a/.github/workflows/runner-cuda-dtype.yml b/.github/workflows/runner-cuda-dtype.yml
@@ -60,17 +60,6 @@ jobs:
 
             ./cmake-out/aoti_run /tmp/model.so -d CUDA -z ${MODEL_DIR}/tokenizer.model -i "${PROMPT}"
 
-            echo "**********************************************"
-            echo "******** INT4 HQQ group-wise quantized *******"
-            echo "**********************************************"
-            python generate.py --dtype ${DTYPE} --device cuda --quant '{"linear:hqq" : {"groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --temperature 0 > ./output_eager
-            cat ./output_eager
-            python generate.py --dtype ${DTYPE} --device cuda --compile --quant '{"linear:hqq" : {"groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --temperature 0 > ./output_compiled
-            cat ./output_compiled
-            python export.py --dtype ${DTYPE} --device cuda --quant '{"linear:hqq" : {"groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --output-dso-path ${MODEL_DIR}/${MODEL_NAME}.so
-            python generate.py --dtype ${DTYPE} --device cuda --checkpoint-path ${MODEL_PATH} --temperature 0 --dso-path ${MODEL_DIR}/${MODEL_NAME}.so  > ./output_aoti
-            cat ./output_aoti
-
         done
 
         echo "tests complete"

diff --git a/README.md b/README.md
@@ -458,9 +458,6 @@ awesome libraries and tools you've built around local LLM inference.
   Fast!](https://github.com/pytorch-labs/gpt-fast), which we have
   directly adopted (both ideas and code) from his repo.
 
-* Mobius Labs as the authors of the HQQ quantization algorithms
-  included in this distribution.
-
 
 ## License
 

diff --git a/cli.py b/cli.py
@@ -198,7 +198,7 @@ def add_arguments_for_verb(parser, verb: str):
         default="{ }",
         help=(
             'Quantization options. pass in as \'{"<mode>" : {"<argname1>" : <argval1>, "<argname2>" : <argval2>,...},}\' '
-            + "modes are: embedding, linear:int8, linear:int4, linear:gptq, linear:hqq, linear:a8w4dq, precision."
+            + "modes are: embedding, linear:int8, linear:int4, linear:gptq, linear:a8w4dq, precision."
         ),
     )
     parser.add_argument(

diff --git a/docs/ACKNOWLEDGEMENTS.md b/docs/ACKNOWLEDGEMENTS.md
@@ -27,6 +27,3 @@
   Fast!](https://github.com/pytorch-labs/gpt-fast), which we have
   directly adopted (both ideas and code) from his repo.
 
-* Mobius Labs as the authors of the HQQ quantization algorithms
-  included in this distribution.
-
diff --git a/docs/ADVANCED-USERS.md b/docs/ADVANCED-USERS.md
@@ -376,16 +376,12 @@ To compress models, torchchat offers a variety of strategies:
 
 * dynamic activation quantization with weight quantization: a8w4dq
 
-In addition, we support GPTQ and HQQ for improving the quality of 4b
-weight-only quantization. Support for HQQ is a work in progress.
-
 | compression | FP precision |  weight quantization | dynamic activation quantization |
 |--|--|--|--|
 embedding table (symmetric) | fp32, fp16, bf16 | 8b (group/channel), 4b (group/channel) | n/a |
 linear operator (symmetric) | fp32, fp16, bf16 | 8b (group/channel) | n/a |
 linear operator (asymmetric) | n/a | 4b (group), a6w4dq | a8w4dq (group) |
 linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
-linear operator (asymmetric) with HQQ | n/a |  work in progress | n/a |
 
 ## Model precision (dtype precision setting)
 On top of quantizing models with quantization schemes mentioned above, models can be converted
@@ -450,9 +446,6 @@ strategies:
 
 * dynamic activation quantization with weight quantization: a8w4dq
 
-In addition, we support GPTQ and HQQ for improving the quality of 4b
-weight-only quantization. Support for HQQ is a work in progress.
-
 You can find instructions for quantizing models in
 [docs/quantization.md](file:///./quantization.md).  Advantageously,
 quantization is available in eager mode as well as during export,

diff --git a/docs/quantization.md b/docs/quantization.md
@@ -18,7 +18,6 @@ While quantization can potentially degrade the model's performance, the methods
 |--|--|--|--|--|--|--|--|
 | linear (asymmetric) | [8, 4]* | [32, 64, 128, 256]** | | ✅ | ✅ | 🚧 |
 | linear with GPTQ*** (asymmetric) | |[32, 64, 128, 256]**  | | ✅ | ✅ | ❌ |
-| linear with HQQ*** (asymmetric) | |[32, 64, 128, 256]**  | | ✅ | ✅ | ❌ |
 | linear with dynamic activations (symmetric) | | [32, 64, 128, 256]* | a8w4dq | 🚧 |🚧 | ✅ |
 
 ### Embedding Quantization
@@ -40,20 +39,6 @@ on-device usecases.
    model quality and accuracy, and larger groupsize for further
    improving performance. Set 0 for channelwise quantization.
 
-*** [GPTQ](https://arxiv.org/abs/2210.17323) and
-    [HQQ](https://mobiusml.github.io/hqq_blog/) are two different
-    algorithms to address accuracy loss when using lower bit
-    quantization. Due to HQQ relying on data/calibration free
-    quantization, it tends to take less time to quantize model.
-    HQQ is currently enabled with axis=1 configuration. 
-
-    Presently, torchchat includes a subset of the HQQ distribution in 
-    the hqq subdirectory, but HQQ is not installed by default with torchchat,
-    due to dependence incompatibilities between torchchat and the hqq
-    project.  We may integrate hqq via requirements.txt in the future. 
-    (As a result, there's presently no upstream path for changes and/or
-    improvements to HQQ.)
-
 + Should support non-power-of-2-groups as well.
 
 ## Quantization Profiles
@@ -96,7 +81,6 @@ for valid `bitwidth` and `groupsize` values.
 | linear (asymmetric) | `'{"linear:int<bitwidth>" : {"groupsize" : <groupsize>}}'` |
 | linear with dynamic activations (symmetric) | `'{"linear:a8w4dq" : {"groupsize" : <groupsize>}}'`|
 | linear with GPTQ (asymmetric) | `'{"linear:int4-gptq" : {"groupsize" : <groupsize>}}'`|
-| linear with HQQ (asymmetric) |`'{"linear:hqq" : {"groupsize" : <groupsize>}}'`|
 | embedding | `'{"embedding": {"bitwidth": <bitwidth>, "groupsize":<groupsize>}}'` |
 
 See the available quantization schemes [here](https://github.com/pytorch/torchchat/blob/main/quantization/quantize.py#L1260-L1266).