Quantization, fp acceleration, and testing (#572)

mikekgfb · shoumikhin · metascroy · malfet · commit c9804727cc91 · 2024-07-17T09:55:44.000-07:00
* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <mikekg@meta.com> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in> Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: lucylq <lfq@meta.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in> Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: lucylq <lfq@meta.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>
diff --git a/.github/workflows/more-tests.yml b/.github/workflows/more-tests.yml
@@ -0,0 +1,89 @@
+name: Run parallel prefill
+
+on:
+  pull_request:
+  push:
+    branches:
+      - main
+  workflow_dispatch:
+
+jobs:
+  test-cuda:
+    uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
+    with:
+      runner: linux.g5.4xlarge.nvidia.gpu
+      gpu-arch-type: cuda
+      gpu-arch-version: "12.1"
+      script: |
+        echo "::group::Print machine info"
+        uname -a
+        echo "::endgroup::"
+
+        echo "::group::Install newer objcopy that supports --set-section-alignment"
+        yum install -y  devtoolset-10-binutils
+        export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
+        echo "::endgroup::"
+
+
+        echo "::group::Download checkpoints"
+        # Install requirements
+        pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
+        pip3 install -r requirements.txt
+        pip3 list
+        python3 -c 'import torch;print(f"torch: {torch.__version__, torch.version.git_version}")'
+        echo "::endgroup::"
+
+        echo "::group::Download checkpoints"
+        mkdir -p checkpoints/stories15M
+        pushd checkpoints/stories15M
+        wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt
+        wget https://github.com/karpathy/llama2.c/raw/master/tokenizer.model
+        popd
+        echo "::endgroup::"
+
+        echo "::group::Run inference"
+        export MODEL_PATH=checkpoints/stories15M/stories15M.pt
+        export MODEL_NAME=stories15M
+        export MODEL_DIR=/tmp
+
+        for DTYPE in bfloat16 float16 float32; do
+          ###################################################################
+          # group with different temperatures 
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0 
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0.9 
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 1.0
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 100 
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 200 	  
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 500 	  
+          ###################################################################
+          # group with different temperatures and prefill, and compile
+          # and prefill compile
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0 --compile --compile-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0.9 --compile --compile-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 1.0 --compile --compile-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 100 --compile --compile-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 200 --compile --compile-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 500 --compile --compile-prefill
+          ###################################################################
+          # group with different temperatures and sequential prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0 --sequential-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0.9 --sequential-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 1.0 --sequential-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 100 --sequential-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 200 --sequential-prefill
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 500 --sequential-prefill
+          ###################################################################
+          # group with different temperatures and prefill, and compile
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0 --sequential-prefill --compile
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 0.9 --sequential-prefill --compile
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --temperature 1.0 --sequential-prefill --compile
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 100 --sequential-prefill --compile
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 200 --sequential-prefill --compile
+          python generate.py --checkpoint-path ${MODEL_PATH} --device cpu --dtype ${DTYPE} --top-k 500 --sequential-prefill --compile
+
+        done
+
+        echo "tests complete"
+        echo "******************************************"
+        echo "::endgroup::"
+
diff --git a/build/builder.py b/build/builder.py
@@ -117,6 +117,14 @@ def from_args(cls, args):  # -> BuilderArgs:
                     if "chat" in path_basename or "instruct" in path_basename:
                         is_chat_model = True
 
+        if args.output_pte_path and args.dtype.startswith("fast"):
+            if args.dtype == "fast":
+                dtype = torch.float32
+            else:
+                dtype = torch.float16
+        else:
+            dtype = name_to_dtype(args.dtype)
+
         return cls(
             checkpoint_dir=checkpoint_dir,
             checkpoint_path=checkpoint_path,
@@ -127,7 +135,7 @@ def from_args(cls, args):  # -> BuilderArgs:
             dso_path=args.dso_path,
             pte_path=args.pte_path,
             device=args.device,
-            precision=name_to_dtype(args.dtype),
+            precision=dtype,
             setup_caches=(args.output_dso_path or args.output_pte_path),
             use_tp=False,
             is_chat_model=is_chat_model,
diff --git a/build/utils.py b/build/utils.py
@@ -130,7 +130,17 @@ def get_precision():
 
 ##########################################################################
 ###               dtype name to torch.dtype mapping                    ###
+
+
 def name_to_dtype(name):
+    if (name == "fast") or (name == "fast16"):
+        import platform
+
+        if platform.processor() == "arm":
+            return torch.float16
+        else:
+            return torch.bfloat16
+
     if name in name_to_dtype_dict:
         return name_to_dtype_dict[name]
     else:
@@ -150,6 +160,8 @@ def allowable_dtype_names() -> List[str]:
     "float32": torch.float,
     "float16": torch.float16,
     "bfloat16": torch.bfloat16,
+    "fast": None,
+    "fast16": None,
 }
 
 
@@ -208,6 +220,7 @@ def state_dict_device(d, device="cpu") -> Dict:
 #########################################################################
 ###                move state dict to specified device                ###
 
+
 def is_mps_available() -> bool:
     if not torch.backends.mps.is_available():
         return False
@@ -219,7 +232,7 @@ def is_mps_available() -> bool:
     except:
         return False
 
-    # MPS, is that you? 
+    # MPS, is that you?
     return True
 
 
diff --git a/cli.py b/cli.py
@@ -210,11 +210,10 @@ def _add_arguments_common(parser):
         help="Use the specified ExecuTorch .pte model file",
     )
     parser.add_argument(
-        "-d",
         "--dtype",
-        default="float32",
+        default="fast",
         choices=allowable_dtype_names(),
-        help="Override the dtype of the model (default is the checkpoint dtype). Options: bf16, fp16, fp32",
+        help="Override the dtype of the model (default is the checkpoint dtype). Options: bf16, fp16, fp32, fast16, fast",
     )
     parser.add_argument(
         "-v",
diff --git a/generate.py b/generate.py
@@ -172,7 +172,7 @@ def prefill(
     sequential_prefill=True,
     **sampling_kwargs,
 ) -> torch.Tensor:
-    logging.debug(f"x: {x}, input_pos: {input_pos}")
+    # logging.debug(f"x: {x}, input_pos: {input_pos}")
     width = x.size(1)
     assert input_pos.size(0) == width
 
diff --git a/qops.py b/qops.py
@@ -305,3 +305,21 @@ def forward(self, input: torch.Tensor) -> torch.Tensor:
     @classmethod
     def _check_k(cls, *, k, groupsize=1, inner_k_tiles=1):
         return k % groupsize == 0 and k % (inner_k_tiles * 16) == 0
+
+    @classmethod
+    def _prepare_weight_and_scales_and_zeros(
+        cls, weight_bf16, groupsize, inner_k_tiles
+    ):
+        from quantize import group_quantize_tensor
+
+        weight_int32, scales_and_zeros = group_quantize_tensor(
+            weight_bf16, n_bit=4, groupsize=groupsize
+        )
+        weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(
+            weight_int32, inner_k_tiles
+        )
+        return weight_int4pack, scales_and_zeros
+
+    @classmethod
+    def _calc_padded_size(cls, *, k, groupsize=1, innner_k_tiles=1):
+        return find_multiple(k, 1024)
diff --git a/quantize.py b/quantize.py
@@ -595,22 +595,6 @@ def quantized_model(self) -> nn.Module:
 #####     weight only int4 per channel groupwise quantized code    ######
 
 
-def _int4_prepare_int4_weight_and_scales_and_zeros(
-    weight_bf16, groupsize, inner_k_tiles
-):
-    weight_int32, scales_and_zeros = group_quantize_tensor(
-        weight_bf16, n_bit=4, groupsize=groupsize
-    )
-    weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(
-        weight_int32, inner_k_tiles
-    )
-    return weight_int4pack, scales_and_zeros
-
-
-def _int4_calc_padded_size(k, groupsize=1, innner_k_tiles=1):
-    return find_multiple(k, 1024)
-
-
 def replace_linear_int4(
     module,
     device,
@@ -705,7 +689,7 @@ def create_quantized_state_dict(self):
                         )
                         continue
                 weight_int4pack, scales_and_zeros = (
-                    _int4_prepare_int4_weight_and_scales_and_zeros(
+                    WeightOnlyInt4Linear._prepare_weight_and_scales_and_zeros(
                         weight.to(torch.float), self.groupsize, self.inner_k_tiles
                     )
                 )

Original file line number	Diff line number	Diff line change
`@@ -210,11 +210,10 @@ def _add_arguments_common(parser):`
`210`	`210`	`help="Use the specified ExecuTorch .pte model file",`
`211`	`211`	`)`
`212`	`212`	`parser.add_argument(`
`213`		`- "-d",`
`214`	`213`	`"--dtype",`
`215`		`- default="float32",`
	`214`	`+ default="fast",`
`216`	`215`	`choices=allowable_dtype_names(),`
`217`		`- help="Override the dtype of the model (default is the checkpoint dtype). Options: bf16, fp16, fp32",`
	`216`	`+ help="Override the dtype of the model (default is the checkpoint dtype). Options: bf16, fp16, fp32, fast16, fast",`
`218`	`217`	`)`
`219`	`218`	`parser.add_argument(`
`220`	`219`	`"-v",`