Add option to ignore tokens with 2+ English characters #8279

hopto-dot · 2024-07-03T15:23:08Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This PR adds an option called ignore_english_tokens. When enabled, it attempts to prevent the model from sampling tokens that contain two or more English characters, unless they include angle brackets (for example <EOS>).

The intention behind this option is to stop multilingual LLMs from generating text with mixed in English.

For example, when prompted to respond in Japanese, instead of generating:

彼は milk を買いに店に行った。 (He went to the shop to buy *milk*.)

The LLM should now correctly generate:

彼はミルクを買いに店に行った。 (He went to the shop to buy milk.)

I am not well-versed in cpp and haven't tested to see if the code works as I'm unsure of how to do this on Windows, I mainly made this PR to propose the feature and tried my best.

That said, I did test a workaround to this option not existing by generating a value for the logit_bias argument. I wrote a python script to read through the tokenizer.json of a model, then assign all English token IDs to a value of -100 to ban them. This stops the mixed language issue, but the problem is that no frontend is able to do this automatically for any model due to the nature of this being a backend related problem.

ngxson · 2024-07-03T15:28:42Z

Good idea but I'm doubt that it may not work in all cases (i.e. unicode?)

Maybe you should try using grammar?

hopto-dot · 2024-07-03T17:02:54Z

Thank you for your suggestion!

I wrote a gnbf file to ban English words:

root ::= (non-english | angle-bracket-token)+
non-english ::= [^\u0041-\u005A^\u0061-\u007A]+
angle-bracket-token ::= "<" [^>]* ">"

Then tested it on llama-cli.exe with this command:

llama-cli.exe -ngl 43 -m "...\\gemma-2-9b-it-Q8_0_L.gguf" --grammar-file "...\non-english.gbnf" -p "What is your name?<end_of_turn>\n<start_of_turn>model"

and it works, no English words, primarily Chinese and Japanese.

However, when I try the same command (without the prompt argument) but for llama-server.exe then interact with the model using my frontend the outputs aren't affected by the grammar file at all.

What do you suggest I do? I was wanting this to work on the llama server build on any frontend.

ExtReMLapin · 2024-07-04T05:23:33Z

Why not just a grammar

ngxson · 2024-07-04T08:50:59Z

@hopto-dot I don't think there is a way to set global grammar for server. You need to add grammar to each request:

{
  "messages": ...
  "grammar": "root ::= (non-english | angle-bracket-token)+\nnon-english ::= [^\\u0041-\\u005A^\\u0061-\\u007A]+\nangle-bracket-token ::= \"<\" [^>]* \">\""
}

hopto-dot · 2024-07-04T09:33:09Z

@hopto-dot I don't think there is a way to set global grammar for server. You need to add grammar to each request:

@ngxson In that case, wouldn't it be a good idea to make the --grammar-file actually do something for llama-server being as the argument exists? Otherwise, it shouldn't exist.

ngxson · 2024-07-04T09:48:33Z

llama-server does read the grammar if you specific one. It's just not taken by slot.

You can patch launch_slot_with_task so that slot takes grammar from server_context.params

kristaller486 · 2024-07-04T14:11:18Z

I think it would be better to increase the probability of a non-English tokens instead of decreasing the English tokens. And it is a good idea to do this for each selected language, e.g. separately for Japanese, Chinese, Spanish, Russian, French, etc. I suggest finding a subset of tokens for each language by analyzing clean multilingual datasets such as Uncyclopedia/Uncyclosource.

HanClinto · 2024-07-09T20:25:25Z

common/sampling.cpp

+    bool has_angle_bracket = false;
+
+    for (char c : token_str) {
+        if (c >= 'a' && c <= 'z') {


I agree that grammars are the proper way to address this task, but quick comment to note that if one is going to take this approach, then one should probably also check for upper-case characters as well.

Suggested change

if (c >= 'a' && c <= 'z') {

if ((c >= 'a' && c <= 'z') ||

(c >= 'A' && c <= 'Z')) {

Edit: Just realized that the grammar file that hopto-dot created already takes care of this. Nevermind me! :)

HanClinto · 2024-07-09T20:37:18Z

I think it would be better to increase the probability of a non-English tokens instead of decreasing the English tokens. And it is a good idea to do this for each selected language, e.g. separately for Japanese, Chinese, Spanish, Russian, French, etc. I suggest finding a subset of tokens for each language by analyzing clean multilingual datasets such as Uncyclopedia/Uncyclosource.

Are you talking about constraining a model's output at the time of inference, or fine-tuning an LLM's weights?

This solution's approach is primarily character-based. For languages that share much of an alphabet with English (thinking of languages like Spanish, French, German, etc), then I don't think this token-disqualification approach will work, because too often those languages will use letters from the UTF-8 ASCII range (which is really all that this solution can look for).

That said, I think you have a VERY good idea re: building grammars for each of the specified alphabets. Various alphabets often have well-structured unicode character ranges, and one could define grammar to require that an output be Korean-only, or to generate Japanese but to not use Kanji, etc. I think having some stock grammars for that in our example grammar folder could be really helpful!

HanClinto · 2024-07-09T21:34:39Z

llama-server does read the grammar if you specific one. It's just not taken by slot.

You can patch launch_slot_with_task so that slot takes grammar from server_context.params

This is a great idea! I wasn't entirely sure what it would take to do this, but I gave it a shot in #8402 and -- unless I missed something -- it wound up being simpler than I expected. Is that what you were imagining?

@hopto-dot, would that PR work for your usage?

hopto-dot · 2024-07-10T17:08:36Z

This is a great idea! I wasn't entirely sure what it would take to do this, but I gave it a shot in #8402 and -- unless I missed something -- it wound up being simpler than I expected. Is that what you were imagining?

@hopto-dot, would that PR work for your usage?

@HanClinto Hello, sorry for the inactivity in this issue. Yes, the solution you proposed works perfectly and I see it has been merged. Thank you!

HanClinto · 2024-07-10T18:53:08Z

@HanClinto Hello, sorry for the inactivity in this issue. Yes, the solution you proposed works perfectly and I see it has been merged. Thank you!

No problem at all -- thank you for bringing this up -- I think this is going to be a good change! :)

Add option to ignore tokens with 2+ English characters

3c5acd5

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Jul 4, 2024

HanClinto reviewed Jul 9, 2024

View reviewed changes

HanClinto mentioned this pull request Jul 9, 2024

Server: Enable setting default sampling parameters via command-line #8402

Merged

4 tasks

hopto-dot closed this Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to ignore tokens with 2+ English characters #8279

Add option to ignore tokens with 2+ English characters #8279

Uh oh!

hopto-dot commented Jul 3, 2024

Uh oh!

ngxson commented Jul 3, 2024

Uh oh!

hopto-dot commented Jul 3, 2024 •

edited

Loading

Uh oh!

ExtReMLapin commented Jul 4, 2024

Uh oh!

ngxson commented Jul 4, 2024

Uh oh!

hopto-dot commented Jul 4, 2024

Uh oh!

ngxson commented Jul 4, 2024

Uh oh!

kristaller486 commented Jul 4, 2024

Uh oh!

HanClinto Jul 9, 2024 •

edited

Loading

Uh oh!

HanClinto commented Jul 9, 2024

Uh oh!

HanClinto commented Jul 9, 2024

Uh oh!

hopto-dot commented Jul 10, 2024 •

edited

Loading

Uh oh!

HanClinto commented Jul 10, 2024

Uh oh!

Uh oh!

	if (c >= 'a' && c <= 'z') {
	if ((c >= 'a' && c <= 'z') \|\|
	(c >= 'A' && c <= 'Z')) {

Add option to ignore tokens with 2+ English characters #8279

Add option to ignore tokens with 2+ English characters #8279

Uh oh!

Conversation

hopto-dot commented Jul 3, 2024

Uh oh!

ngxson commented Jul 3, 2024

Uh oh!

hopto-dot commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Jul 4, 2024

Uh oh!

ngxson commented Jul 4, 2024

Uh oh!

hopto-dot commented Jul 4, 2024

Uh oh!

ngxson commented Jul 4, 2024

Uh oh!

kristaller486 commented Jul 4, 2024

Uh oh!

HanClinto Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HanClinto commented Jul 9, 2024

Uh oh!

HanClinto commented Jul 9, 2024

Uh oh!

hopto-dot commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HanClinto commented Jul 10, 2024

Uh oh!

Uh oh!

hopto-dot commented Jul 3, 2024 •

edited

Loading

HanClinto Jul 9, 2024 •

edited

Loading

hopto-dot commented Jul 10, 2024 •

edited

Loading