Skip to content

Add Yandex instruct model template support #12621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 30, 2025
Merged

Conversation

vorobyov01
Copy link
Contributor

@vorobyov01 vorobyov01 commented Mar 28, 2025

I have read the contributing guidelines

Self-reported review complexity:

  • Low

Hello!
We at Yandex are planning to release our 8B instruct model to open-source. The pre-trained model can be found here: YandexGPT-5-Lite-8B-pretrain.
I created this pull request to add support for our custom chat template in llama.cpp. Could you please take a look and review my changes? @ggerganov

I ran test-chat-template locally — it works as expected. I believe my changes should not affect other parts of the project.

Thank you!

@github-actions github-actions bot added the testing Everything test related label Mar 28, 2025
@vorobyov01
Copy link
Contributor Author

I encountered an issue related to the difference between Jinja and C++ templates. For some reason, the system prompt "You are a helpful assistant" in Jinja was included in the first user message. I suspect this is due to some peculiarities in Jinja templating. As a workaround, I hardcoded it in the tests:

/* .expected_output= */ "<s> Пользователь: Hello\n\n Ассистент: Hi there\n\n Пользователь: Who are you\n\n Ассистент:    I am an assistant   \n\n Пользователь: Another question\n\n Ассистент:[SEP]",

/* .expected_output_jinja= */ "<s> Пользователь: You are a helpful assistant\nHello\n\n Ассистент: Hi there\n\n Пользователь: Who are you\n\n Ассистент:    I am an assistant   \n\n Пользователь: Another question\n\n Ассистент:[SEP]",

@vorobyov01
Copy link
Contributor Author

@compilade @ngxson Could you please take a look when you have time?

@ngxson
Copy link
Collaborator

ngxson commented Mar 30, 2025

This template doesn't seem to use EOT (end of turn) token. Have you tried with llama-cli to make sure the generation correctly stop after each turn?

@vorobyov01
Copy link
Contributor Author

vorobyov01 commented Mar 30, 2025

Yes, it works fine. We used EOS token as EOT during alignment

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> Hello! 
 Hello! How can I assist you?

> Write a haiku about tokenizer
 A tokenizer splits,
Breaking words into parts,
Streams flow in lines.

> 

We have a prefix space in the 1st token of the reply, which is a consequence of converting our tokenizer to the HF infra.

ngxson
ngxson previously approved these changes Mar 30, 2025
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me.

Btw if you're from Yandex team, please do not use the \n\nas EOT next time you make an instruction-tuned model. The reason is because this prevent the model to generate long code or markdown content, which makes it completely useless for real-life use case.

@vorobyov01
Copy link
Contributor Author

vorobyov01 commented Mar 30, 2025

It uses the </s> token as an EOT. There will indeed be \n\n</s> at the end of each turn, but not every \n\n sequence should end with </s>

@vorobyov01
Copy link
Contributor Author

Thanks a lot for reviewing!

@ngxson
Copy link
Collaborator

ngxson commented Mar 30, 2025

It uses the </s> token as an EOT. There will indeed be \n\n</s> at the end of each turn, but not every \n\n sequence should end with </s>

Then I think the current template is wrong:

ss << " Пользователь: " << chat[i]->content << "\n\n";

It must be "\n\n</s>" after each turn, not just "\n\n"

@ngxson
Copy link
Collaborator

ngxson commented Mar 30, 2025

In anyway, for future I would recommend using template like chatml or llama 3, as it is very easy to work with, and also you don't have to deal with trailing space issues

@ngxson ngxson dismissed their stale review March 30, 2025 14:56

Dismiss until EOS/EOT is confirmed

@vorobyov01
Copy link
Contributor Author

Our model was trained to respond with only one turn after Ассистент:[SEP], so Ассистент:[SEP] appears only before the very last reply. I might be mistaken (please correct me if im wrong), but when calling llama-cli, it reconstructs this dialog each time, meaning our last reply will always start with Ассистент:[SEP], no matter how long the sequence we generate. It always ends with the eos token </s>. We don't actually have a concept of an end_of_turn token; sorry for the confusion.

I agree that it would be much better and more convenient to use special tokens for this. We realized this as we began the open-source process. We will definitely switch to special tokens in future model updates :)

@ngxson
Copy link
Collaborator

ngxson commented Mar 30, 2025

when calling llama-cli, it reconstructs this dialog each time

No it is not the case, the way llama-cli works is to keep processed tokens as-is.

From what you said, I image the expected behavior is as follow: For the first turn, we will have:

... Ассистент:[SEP]

When it done generating:

... Ассистент: the response</s>

Then for next turn, we have:

... Ассистент: the response ... Ассистент:[SEP]

But in reality, this is what you will get in llama-cli, we can't go back to modify the past token:

... Ассистент:[SEP] the response</s> ... Ассистент:[SEP]

Currently, only llama-server can handle this case, but this will invalidate the KV cache for the last assistant response because now we have to go back and modify one single token (which is very inefficient)

So, please, just use an existing chat template next time. If you don't know how chat templates works and try to invent a new one like this, you risk degrading both performance and quality a lot!

@ngxson
Copy link
Collaborator

ngxson commented Mar 30, 2025

I will approve & merge this for now, but please note that the quality and performance will be degraded compared to other models (as explained above)

Let's hope that next time we can reuse one of the existing templates, so your model will "just work" out of the box with the best performance.

For example, with chatml, you can simple do:

<|im_start|>Пользователь\n{user_message}<|im_end|>\n<|im_start|>Ассистент\n{assistant_message}<|im_end|>...

@ngxson ngxson merged commit 7242dd9 into ggml-org:master Mar 30, 2025
48 checks passed
@vorobyov01
Copy link
Contributor Author

Ok thanks a lot!
We will inform users that performance may suffer due to a non-standard template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants