make --device fast the default #515

mikekgfb · 2024-04-27T02:47:23Z

make --device fast the default

* Update iOS.md * Update iOS.md

* remove macos-12 test * pip to pip3

* init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]>

* refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver

…ion to cuda.json (#519)

Update Advanced Users description to reflect changes in the repo since the description was initially created.

* runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda

Update description of runner and build process in runner_build.md

* clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo

* add dtype tests for runner-aoti + runner-et * typo

* move int8 linear class and function into qops.py * move Quantized Embedding to qops.py

* move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops

This reverts commit a7a2457.

malfet

Please remove unrelated changes, but if CI is green this looks ok to me.

Though from design point of view, I dislike the concept of "fast" device as it moves users away from a notion of PyTorch device.

Instead we should change the arguments parsing/ArgBuilder to detect best device if none were specified via CLI. (I.e. make device argument optional, rather than a mandatory with default value equal to "fast")

malfet · 2024-04-29T19:05:40Z

quantize.py

@@ -17,6 +17,7 @@
 import torch.nn.functional as F
 from build.utils import (
    find_multiple,
+    get_device_str,


Why this is needed?

we need to translate device name "fast" into a real device.

And yes, we could have tied that to no input for device, which was my first version -- but that failed, because at present some systems report MPS when they don't. That's why this is a separate PR enabling the "fast" device which is failing at the moment.

x-ref: pytorch/pytorch#125197

One more reason: The config files can specify a device, so this provides the ability to specify "fast" device in a config file, no matter where the config file qill be accessed.

malfet · 2024-04-29T19:05:47Z

generate.py

@@ -210,7 +210,7 @@ def decode_n_tokens(
 ):
    new_tokens, new_probs = [], []
    encountered_eos = False
-    for i in range(
+    for _i in range(


Same as above

* fix generate for llama3 * switch more things to C * remove C++ header

* add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti

* test sdpa with fp16 * kv cache fp32 * typo

Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags:

* split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix

Co-authored-by: Michael Gschwind <[email protected]>

…hchat into default_device_fast

* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>

* make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>