-
Notifications
You must be signed in to change notification settings - Fork 250
fix generate for llama3 #538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
prompt_tokens = tokenizer->encode(prompt, 0, 0); | ||
prompt_tokens.insert( | ||
prompt_tokens.begin(), | ||
tokenizer->encode("<|begin_of_text|>", 0, 0)[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two issues -- do we actually need the decorators for llama3. In talking with @JacobSzwejbka @kartikayk @ebsmothers , my take away was that llama3 doesn't need the decorators?
Also, for decorators, why don't we just init a control once at beginning like so
if (model_type == llama3)
global_prompt = "<|begin_of_text|>%s<|end_of_text|><|eot_id|>"
else
global_prompt = "<s>%s</s>"
And then do this
snprintf(buffer, global_prompt_format, user_input);
tokenizer->encode(buffer)
We're using what feels like a very expensive way involving lots of expensive C++. The goal was to have an extremly simple framework based on Andrej's extremely simple code (that would make it easy to port....)
C++ is not an advantage when C features do, too, with simple loops because it's much easier to port into other languages/environments....
A second message was that this can be integrated into "any application without much change" and picked Andrej's app, but it feelswe've gone a bit far of rather than adding any changes we need, refactoring the app. That is decidedly not the goal. How can we achieve this with minimal invasiveness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let @JacobSzwejbka and others respond on whether llama3 needs the decorators. For chat, they appear to be necessary because the output looks way off without them, and even for generate, the output for llama3 on a prompt like "Once upon a time" looks funny without these changes IMO. You can try yourself on main to see what I mean.
On using a global prompt, we could potentially do that. But chat and generate use different prompts AFAIK, so we would have two sets of global prompts, and at that point, why not just define the prompts in the chat/generate functions?
One smaller complication on these clean "the only difference between llama2/3 is the tokenizer + the prompt" rewrites: we cannot include <s>
or </s>
in prompts for llama2 because the tokenizer does not handle special tokens correctly (it will encode them with something like 3 tokens instead of one special token). The only way to pass the <s>
and </s>
tokens to the llama2 tokenizer is to specify them with the 2nd/3rd arguments to encode (tokenizer->encode(prompt_without_special_tokens, bos=1, eos=1)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think llama3 requires the format specified in https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
runner/run.cpp
Outdated
@@ -396,7 +410,7 @@ void safe_printf(const char* piece) { | |||
// piece might be a raw byte token, and we only want to print printable chars | |||
// or whitespace because some of the other bytes can be various control codes, | |||
// backspace, etc. | |||
if (piece == NULL) { | |||
if (piece == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this C, not C++. If you feel strongly about doing a C++ version, we should do that separately. The talking point for this was "the exported model is as easy to use as llama2.c". (This used to be in the docs, but in the name of streamlining the doc, Jesse & team dropped a lot of that during the README rewrite....)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to switch this back to NULL and move throw std::runtime_error back to exit(EXIT_FAILURE). That's a pretty small change. Most of the code is still in C-ish.
After the tokenizer was introduced with virtual function calls returning std::vector, it became a bit unclear to me whether this was still supposed to be C, and Nikita started leaving comments on previous PRs about why we were using NULL and calling exit outside of main. So I split the difference and tried to target C++11, but left a lot of the C stuff like printf.
runner/run.cpp
Outdated
"No chat template implemnation for model type %d", | ||
model_type); | ||
exit(EXIT_FAILURE); | ||
throw std::runtime_error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment about keeping this mostly C. (To the point I was thinking we might push the forward out into forward.cpp with an extern "C"
interface and make run.cpp -> run.c ...
As I mentioned above, maybe we want to keep one mostly C and do one with the full C++ monty. And I'm not going purist on that with "oh then we have to rewrite tiktoken as C". There's a difference between using opportunistically what exists, and rewriting everything into C++. I deliberately left aside the C++ refactor that mengwei had done for the ET runner. The only change from Bert's adaptation of Andrej's runner was a very few lines.
Maybe the right way to go is to have both amore C++ runner, and one that stays closer to Andrej's original code. The point is to convince the community that integration of exported models doesn't have to upend your app -- yet that's exactly what we're doing, rework everything. I understand the desire to make this "ours" - and there's totally a space for this, and put this into a C++ in spirit runner, which what this has become a good example of. (If we are thinking going this route, I'll check with Soumith re: his feedback.) On a more C-style version we might do something like
BTW, on the tokenizer change from yesterday -- I liked that a lot because it sorta followed Andrej's philosophy and extended it tokenizers. One more thing - my opinion is not the one true word... and there may even be a generational thing going on as well, I definitely think of C as my home field, and I'm quite sure that you're looking it and thinking, OMG get this old stuff out of my face. Another reason why diversity is important and having one C-in-spirit implementation (even if it uses some C++ here and there....), and a more authentic C++ implementation. I'll get some consultation from our some of our senior SWEs too, because it's important to have the broader perspective. My sense would be to do two versions - a "minimal change" (mostly that of yesterday, probably without fancy name spaces etc - i.e. is_llama3 = 0/1 instead of M<odelType::, and tweaking prompts as needed, using something like my example above, or another mostly-C flavored without fancy language features. This doesn't need to become a fundamentalist no C++ allowed code) and a "here's a full C++ implementation" ( with your code of today). Let's chat tomorrow. |
@mikekgfb I guess there are two questions:
The code snippet you have above is mostly C, but it doesn't look that similar to Andrej's original code IMO. I can change the code in this PR to be mostly C (other than the tokenizer returning std::vector). We can also look at trying to minimize the difference's with Andrej's original code in a follow-up PR (perhaps in a separate file?). |
70cc8bc
to
72eb661
Compare
@mikekgfb, I think I got all the C++ stuff removed except the stuff related to the tokenizer and the std::vector it returns:
|
2d6280e
to
ce79311
Compare
llama2 = 2, | ||
llama3 = 3, | ||
enum ModelType { | ||
UNKNOWN_MODEL = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 , 2 ,3? is it just because its nice that llama2 is 2 and llama3 is 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
prompt_tokens.insert( | ||
prompt_tokens.begin(), | ||
tokenizer->encode("<|begin_of_text|>", 0, 0)[0]); | ||
stop_tokens.push_back(tokenizer->encode("<|end_of_text|>", 0, 0)[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL this works. In python I go through .special_tokens or something like that to get the token for etc. instead of actually calling the tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
* make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* fix generate for llama3 * switch more things to C * remove C++ header
* make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* fix generate for llama3 * switch more things to C * remove C++ header
* make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* fix generate for llama3 * switch more things to C * remove C++ header
* make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* fix generate for llama3 * switch more things to C * remove C++ header
* make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* fix generate for llama3 * switch more things to C * remove C++ header
* make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* fix generate for llama3 * switch more things to C * remove C++ header
* make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>
This PR:
Issues with chat llama 3 were fixed in a previous PR. The documentation for generate is already updated to reflect this PR.
Here are samples for chat/generate for llama2/llama3 from this PR: