fix generate for llama3 #538

metascroy · 2024-04-29T02:06:24Z

This PR:

Consolidates chat + generate to share the same code via generate_from_prompt_tokens
Fixes issues for llama3 generate related to the new prompt structure.

Issues with chat llama 3 were fixed in a previous PR. The documentation for generate is already updated to reflect this PR.

Here are samples for chat/generate for llama2/llama3 from this PR:

llama2 chat

./cmake-out/et_run ./llama2/model.pte -z ./llama2/tokenizer.bin -l 2 -m chat
Enter system prompt (optional): Be brief in your replies.
User: How many oceans are there?
Assistant:   There are 5 oceans:

1. Pacific Ocean
2. Atlantic Ocean
3. Indian Ocean
4. Arctic Ocean
5. Southern Ocean (also known as the Antarctic Ocean)
User: Which one is the biggest?
Assistant:   The biggest ocean is the Pacific Ocean, which covers an area of approximately 155.6 million square kilometers (60.1 million square miles).
User:

llama3 chat

./cmake-out/et_run ./llama3/model.pte -z ./llama3/tiktokenizer.bin -l 3 -m chat
Enter system prompt (optional): Be brief in your replies and answer like mario.
User: What is a good place to eat?
Assistant: "It's-a me, Mario! Ah, you're lookin' for a good place to eat, eh? Well, I got just the spot for ya! Luigi's Kitchen, it's-a the best! He's-a got all sorts of delicious pasta, pizza, and even some-a goomba-gobbling goodness! Trust me, you won't be disappointed!"
User: How do I get there?
Assistant: "Whoa, careful there! You gotta hop on the Warp Pipe, buddy! Just kidding, sorta... You can take the Mushroom Kingdom Highway, exit at World 1-2, and follow the signs to Toad Town. Luigi's Kitchen is on the left, can't miss it! Just watch out for Goombas and Bullet Bills, yeah!"
User:

llama2 generate

./cmake-out/et_run ./llama2/model.pte -z ./llama2/tokenizer.bin -l 2 -i "Once upon a time"
Once upon a time in a far-off kingdom, there was a young prince named Leo.eq. Leo was the eldest son of the king and queen, and he was known throughout the kingdom for his kindness, bravery, and wisdom.

One day, while out for a walk in the palace gardens, Leo stumbled upon a hidden path he had never seen before. Curious, he decided to follow it, and it led him to a beautiful and mysterious garden. In the center of the garden was a majestic tree, its branches heavy with golden fruit. Leo was drawn to the tree and decided to pluck one of the fruit. As soon as he took a bite, he felt a strange sensation wash over him.

Suddenly, Leo found himself in a strange and unfamiliar place. He was standing in a vast plain, surrounded by towering mountains in the distance. He had no idea how he got there or how to get back to his kingdom.

Leo began to wander the plain, searching for any sign of civilization. As he walked, he encountered various creatures, some friendly and some not so friendly. He met a talking rabbit who offered to guide

achieved tok/s: 0.828172

llama3 generate

./cmake-out/et_run ./llama3/model.pte -z ./llama3/tiktokenizer.bin -l 3 -i "Once upon a time"
Once upon a time, there were four best friends – Nikhil, Nidhi, Karthik and Ritika. They all studied in the same school and were inseparable. Their bond grew stronger as the years went by, and they often spent their holidays together.

One day, they received an invitation from their school to participate in an inter-school quiz competition. The competition was being held at a prestigious venue in the city, and the team that won would receive a cash prize and the prestigious title of "Quiz Champions."

The friends were thrilled at the opportunity and started preparing for the competition. They devoted all their free time to study, and their hard work paid off. They found themselves well-prepared for the quiz competition.

The day of the competition arrived, and the friends made their way to the venue. They were the first team to take the stage, and their performance was impressive. They answered question after question correctly, impressing the judges with their knowledge.

As they reached the final question, they were tied for the first place with another team. The judges announced that the two teams would be competing in a tie-breaker question to determine the winner.

The final question was "What is the world's largest living structure?" The friends looked at each other, confident

achieved tok/s: 1.021667

mikekgfb · 2024-04-29T03:53:02Z

runner/run.cpp

+      prompt_tokens = tokenizer->encode(prompt, 0, 0);
+      prompt_tokens.insert(
+          prompt_tokens.begin(),
+          tokenizer->encode("<|begin_of_text|>", 0, 0)[0]);


Two issues -- do we actually need the decorators for llama3. In talking with @JacobSzwejbka @kartikayk @ebsmothers , my take away was that llama3 doesn't need the decorators?

Also, for decorators, why don't we just init a control once at beginning like so

if (model_type == llama3) global_prompt = "<|begin_of_text|>%s<|end_of_text|><|eot_id|>" else global_prompt = "<s>%s</s>"

And then do this

snprintf(buffer, global_prompt_format, user_input); tokenizer->encode(buffer)

We're using what feels like a very expensive way involving lots of expensive C++. The goal was to have an extremly simple framework based on Andrej's extremely simple code (that would make it easy to port....)

C++ is not an advantage when C features do, too, with simple loops because it's much easier to port into other languages/environments....

A second message was that this can be integrated into "any application without much change" and picked Andrej's app, but it feelswe've gone a bit far of rather than adding any changes we need, refactoring the app. That is decidedly not the goal. How can we achieve this with minimal invasiveness?

I'll let @JacobSzwejbka and others respond on whether llama3 needs the decorators. For chat, they appear to be necessary because the output looks way off without them, and even for generate, the output for llama3 on a prompt like "Once upon a time" looks funny without these changes IMO. You can try yourself on main to see what I mean.

On using a global prompt, we could potentially do that. But chat and generate use different prompts AFAIK, so we would have two sets of global prompts, and at that point, why not just define the prompts in the chat/generate functions?

One smaller complication on these clean "the only difference between llama2/3 is the tokenizer + the prompt" rewrites: we cannot include <s> or </s> in prompts for llama2 because the tokenizer does not handle special tokens correctly (it will encode them with something like 3 tokens instead of one special token). The only way to pass the <s> and </s> tokens to the llama2 tokenizer is to specify them with the 2nd/3rd arguments to encode (tokenizer->encode(prompt_without_special_tokens, bos=1, eos=1)).

Yeah I think llama3 requires the format specified in https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/

mikekgfb · 2024-04-29T03:58:22Z

runner/run.cpp

@@ -396,7 +410,7 @@ void safe_printf(const char* piece) {
  // piece might be a raw byte token, and we only want to print printable chars
  // or whitespace because some of the other bytes can be various control codes,
  // backspace, etc.
-  if (piece == NULL) {
+  if (piece == nullptr) {


Let's keep this C, not C++. If you feel strongly about doing a C++ version, we should do that separately. The talking point for this was "the exported model is as easy to use as llama2.c". (This used to be in the docs, but in the name of streamlining the doc, Jesse & team dropped a lot of that during the README rewrite....)

Happy to switch this back to NULL and move throw std::runtime_error back to exit(EXIT_FAILURE). That's a pretty small change. Most of the code is still in C-ish.

After the tokenizer was introduced with virtual function calls returning std::vector, it became a bit unclear to me whether this was still supposed to be C, and Nikita started leaving comments on previous PRs about why we were using NULL and calling exit outside of main. So I split the difference and tried to target C++11, but left a lot of the C stuff like printf.

mikekgfb · 2024-04-29T04:09:24Z

runner/run.cpp

-          "No chat template implemnation for model type %d",
-          model_type);
-      exit(EXIT_FAILURE);
+      throw std::runtime_error(


See my comment about keeping this mostly C. (To the point I was thinking we might push the forward out into forward.cpp with an extern "C" interface and make run.cpp -> run.c ...

As I mentioned above, maybe we want to keep one mostly C and do one with the full C++ monty. And I'm not going purist on that with "oh then we have to rewrite tiktoken as C". There's a difference between using opportunistically what exists, and rewriting everything into C++. I deliberately left aside the C++ refactor that mengwei had done for the ET runner. The only change from Bert's adaptation of Andrej's runner was a very few lines.

mikekgfb · 2024-04-29T05:09:18Z

Maybe the right way to go is to have both amore C++ runner, and one that stays closer to Andrej's original code. The point is to convince the community that integration of exported models doesn't have to upend your app -- yet that's exactly what we're doing, rework everything.

I understand the desire to make this "ours" - and there's totally a space for this, and put this into a C++ in spirit runner, which what this has become a good example of. (If we are thinking going this route, I'll check with Soumith re: his feedback.)

On a more C-style version we might do something like

// IDK whether C++ compiler will love this struct because struct has morphed into a class 
// that is in the typedef namespace... 
typedef struct chat_control {
    char * user_input;
    char *prompt_format;
    } chat_control;
    
chat_control system;
chat_control user;

chat_control llama2_system  = { "System prompt:\n", "<format for decorating input>%s<more format>}
chat_control llama2_user = {"User input: ", "<s>%s</s>"}

chat_control llama3_system  = { "System prompt:\n", "<...>%s</...>}
chat_control llama3_user = {"User input: ", "<...> %s<....>"}

generate_n_chat(char *input)

chat_control active_chat;
chat_control active, ongoing;
if (chat) {
    active = {NULL, "%s");
    
    (llama2) {
    active = llama2_system;
    ongoing = llama2_user;
} else { // must be llama3, we only support these
    active = llama2_system;
    ongoing = llama2_user;    
}

do {
   static buffer[MAX_BUF];
   if (active.user_input){
     static char input_buffer[MAX_BUF]; 
      printf (active.user_input);
      scanf("%s", buffer);  
      input = inputbuffer;
   }
   sprintf(buffer, active.prompt_format, input); // pref use snprintf 
   
   decode = tokenizer->decode(buffer);
   
   iterate over input tokens... 
   
   get output tokens....
   
   active = ongoing;
} while (active.prompt);

BTW, on the tokenizer change from yesterday -- I liked that a lot because it sorta followed Andrej's philosophy and extended it tokenizers.

One more thing - my opinion is not the one true word... and there may even be a generational thing going on as well, I definitely think of C as my home field, and I'm quite sure that you're looking it and thinking, OMG get this old stuff out of my face. Another reason why diversity is important and having one C-in-spirit implementation (even if it uses some C++ here and there....), and a more authentic C++ implementation.

I'll get some consultation from our some of our senior SWEs too, because it's important to have the broader perspective. My sense would be to do two versions - a "minimal change" (mostly that of yesterday, probably without fancy name spaces etc - i.e. is_llama3 = 0/1 instead of M<odelType::, and tweaking prompts as needed, using something like my example above, or another mostly-C flavored without fancy language features. This doesn't need to become a fundamentalist no C++ allowed code) and a "here's a full C++ implementation" ( with your code of today).

Let's chat tomorrow.

metascroy · 2024-04-29T16:29:19Z

@mikekgfb I guess there are two questions:

Do we want to target C/C++, or perhaps both?
For the C implementation, do we want to minimize the diff with Andrej's original code?

The code snippet you have above is mostly C, but it doesn't look that similar to Andrej's original code IMO. I can change the code in this PR to be mostly C (other than the tokenizer returning std::vector). We can also look at trying to minimize the difference's with Andrej's original code in a follow-up PR (perhaps in a separate file?).

metascroy · 2024-04-29T16:52:26Z

@mikekgfb, I think I got all the C++ stuff removed except the stuff related to the tokenizer and the std::vector it returns:

nullptr -> NULL
throw -> exit(FAILURE)
enum class -> enum

JacobSzwejbka · 2024-04-29T18:30:51Z

runner/run.cpp

-  llama2 = 2,
-  llama3 = 3,
+enum ModelType {
+  UNKNOWN_MODEL = 0,


0 , 2 ,3? is it just because its nice that llama2 is 2 and llama3 is 3?

JacobSzwejbka · 2024-04-29T18:31:50Z

runner/run.cpp

+      prompt_tokens.insert(
+          prompt_tokens.begin(),
+          tokenizer->encode("<|begin_of_text|>", 0, 0)[0]);
+      stop_tokens.push_back(tokenizer->encode("<|end_of_text|>", 0, 0)[0]);


TIL this works. In python I go through .special_tokens or something like that to get the token for etc. instead of actually calling the tokenizer.

mikekgfb

Thank you!

* make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>

* code beautification * code beautification, move functions together * make --device fast the default (#515) * make --device fast the default * Update iOS.md (#517) * Update iOS.md * Update iOS.md * Pip to pip3 (#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (#519) * remove code for no KV Cache path (#527) * Update ADVANCED-USERS.md (#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (#530) Update description of runner and build process in runner_build.md * clean up runner code a little (#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (#534) * add dtype tests for runner-aoti + runner-et (#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548) This reverts commit a7a2457. * fix generate for llama3 (#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (#551) * Add dtype runner aoti (#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (#553) * test sdpa with fp16 * kv cache fp32 * typo * update (#560) * Only support newest versions of lm-eval (#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>