Skip to content

Add option to ignore tokens with 2+ English characters #8279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

hopto-dot
Copy link

This PR adds an option called ignore_english_tokens. When enabled, it attempts to prevent the model from sampling tokens that contain two or more English characters, unless they include angle brackets (for example <EOS>).

The intention behind this option is to stop multilingual LLMs from generating text with mixed in English.

For example, when prompted to respond in Japanese, instead of generating:

彼は milk を買いに店に行った。 (He went to the shop to buy *milk*.)

The LLM should now correctly generate:

彼はミルクを買いに店に行った。 (He went to the shop to buy milk.)

I am not well-versed in cpp and haven't tested to see if the code works as I'm unsure of how to do this on Windows, I mainly made this PR to propose the feature and tried my best.

That said, I did test a workaround to this option not existing by generating a value for the logit_bias argument. I wrote a python script to read through the tokenizer.json of a model, then assign all English token IDs to a value of -100 to ban them. This stops the mixed language issue, but the problem is that no frontend is able to do this automatically for any model due to the nature of this being a backend related problem.

@ngxson
Copy link
Collaborator

ngxson commented Jul 3, 2024

Good idea but I'm doubt that it may not work in all cases (i.e. unicode?)

Maybe you should try using grammar?

@hopto-dot
Copy link
Author

hopto-dot commented Jul 3, 2024

Thank you for your suggestion!

I wrote a gnbf file to ban English words:

root ::= (non-english | angle-bracket-token)+
non-english ::= [^\u0041-\u005A^\u0061-\u007A]+
angle-bracket-token ::= "<" [^>]* ">"

Then tested it on llama-cli.exe with this command:

llama-cli.exe -ngl 43 -m "...\\gemma-2-9b-it-Q8_0_L.gguf" --grammar-file "...\non-english.gbnf" -p "What is your name?<end_of_turn>\n<start_of_turn>model"

and it works, no English words, primarily Chinese and Japanese.

However, when I try the same command (without the prompt argument) but for llama-server.exe then interact with the model using my frontend the outputs aren't affected by the grammar file at all.

What do you suggest I do? I was wanting this to work on the llama server build on any frontend.

@ExtReMLapin
Copy link
Contributor

Why not just a grammar

@ngxson
Copy link
Collaborator

ngxson commented Jul 4, 2024

@hopto-dot I don't think there is a way to set global grammar for server. You need to add grammar to each request:

{
  "messages": ...
  "grammar": "root ::= (non-english | angle-bracket-token)+\nnon-english ::= [^\\u0041-\\u005A^\\u0061-\\u007A]+\nangle-bracket-token ::= \"<\" [^>]* \">\""
}

@hopto-dot
Copy link
Author

@hopto-dot I don't think there is a way to set global grammar for server. You need to add grammar to each request:

@ngxson In that case, wouldn't it be a good idea to make the --grammar-file actually do something for llama-server being as the argument exists? Otherwise, it shouldn't exist.

@ngxson
Copy link
Collaborator

ngxson commented Jul 4, 2024

llama-server does read the grammar if you specific one. It's just not taken by slot.

You can patch launch_slot_with_task so that slot takes grammar from server_context.params

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Jul 4, 2024
@kristaller486
Copy link

I think it would be better to increase the probability of a non-English tokens instead of decreasing the English tokens. And it is a good idea to do this for each selected language, e.g. separately for Japanese, Chinese, Spanish, Russian, French, etc. I suggest finding a subset of tokens for each language by analyzing clean multilingual datasets such as Uncyclopedia/Uncyclosource.

bool has_angle_bracket = false;

for (char c : token_str) {
if (c >= 'a' && c <= 'z') {
Copy link
Collaborator

@HanClinto HanClinto Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that grammars are the proper way to address this task, but quick comment to note that if one is going to take this approach, then one should probably also check for upper-case characters as well.

Suggested change
if (c >= 'a' && c <= 'z') {
if ((c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z')) {

Edit: Just realized that the grammar file that hopto-dot created already takes care of this. Nevermind me! :)

@HanClinto
Copy link
Collaborator

I think it would be better to increase the probability of a non-English tokens instead of decreasing the English tokens. And it is a good idea to do this for each selected language, e.g. separately for Japanese, Chinese, Spanish, Russian, French, etc. I suggest finding a subset of tokens for each language by analyzing clean multilingual datasets such as Uncyclopedia/Uncyclosource.

Are you talking about constraining a model's output at the time of inference, or fine-tuning an LLM's weights?

This solution's approach is primarily character-based. For languages that share much of an alphabet with English (thinking of languages like Spanish, French, German, etc), then I don't think this token-disqualification approach will work, because too often those languages will use letters from the UTF-8 ASCII range (which is really all that this solution can look for).

That said, I think you have a VERY good idea re: building grammars for each of the specified alphabets. Various alphabets often have well-structured unicode character ranges, and one could define grammar to require that an output be Korean-only, or to generate Japanese but to not use Kanji, etc. I think having some stock grammars for that in our example grammar folder could be really helpful!

@HanClinto
Copy link
Collaborator

llama-server does read the grammar if you specific one. It's just not taken by slot.

You can patch launch_slot_with_task so that slot takes grammar from server_context.params

This is a great idea! I wasn't entirely sure what it would take to do this, but I gave it a shot in #8402 and -- unless I missed something -- it wound up being simpler than I expected. Is that what you were imagining?

@hopto-dot, would that PR work for your usage?

@hopto-dot
Copy link
Author

hopto-dot commented Jul 10, 2024

This is a great idea! I wasn't entirely sure what it would take to do this, but I gave it a shot in #8402 and -- unless I missed something -- it wound up being simpler than I expected. Is that what you were imagining?

@hopto-dot, would that PR work for your usage?

@HanClinto Hello, sorry for the inactivity in this issue. Yes, the solution you proposed works perfectly and I see it has been merged. Thank you!

@HanClinto
Copy link
Collaborator

@HanClinto Hello, sorry for the inactivity in this issue. Yes, the solution you proposed works perfectly and I see it has been merged. Thank you!

No problem at all -- thank you for bringing this up -- I think this is going to be a good change! :)

@hopto-dot hopto-dot closed this Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants