-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Add option to ignore tokens with 2+ English characters #8279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Good idea but I'm doubt that it may not work in all cases (i.e. unicode?) Maybe you should try using grammar? |
Thank you for your suggestion! I wrote a gnbf file to ban English words:
Then tested it on
and it works, no English words, primarily Chinese and Japanese. However, when I try the same command (without the prompt argument) but for What do you suggest I do? I was wanting this to work on the llama server build on any frontend. |
Why not just a grammar |
@hopto-dot I don't think there is a way to set global grammar for server. You need to add grammar to each request:
|
@ngxson In that case, wouldn't it be a good idea to make the |
llama-server does read the grammar if you specific one. It's just not taken by slot. You can patch |
I think it would be better to increase the probability of a non-English tokens instead of decreasing the English tokens. And it is a good idea to do this for each selected language, e.g. separately for Japanese, Chinese, Spanish, Russian, French, etc. I suggest finding a subset of tokens for each language by analyzing clean multilingual datasets such as Uncyclopedia/Uncyclosource. |
bool has_angle_bracket = false; | ||
|
||
for (char c : token_str) { | ||
if (c >= 'a' && c <= 'z') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that grammars are the proper way to address this task, but quick comment to note that if one is going to take this approach, then one should probably also check for upper-case characters as well.
if (c >= 'a' && c <= 'z') { | |
if ((c >= 'a' && c <= 'z') || | |
(c >= 'A' && c <= 'Z')) { |
Edit: Just realized that the grammar file that hopto-dot created already takes care of this. Nevermind me! :)
Are you talking about constraining a model's output at the time of inference, or fine-tuning an LLM's weights? This solution's approach is primarily character-based. For languages that share much of an alphabet with English (thinking of languages like Spanish, French, German, etc), then I don't think this token-disqualification approach will work, because too often those languages will use letters from the UTF-8 ASCII range (which is really all that this solution can look for). That said, I think you have a VERY good idea re: building grammars for each of the specified alphabets. Various alphabets often have well-structured unicode character ranges, and one could define grammar to require that an output be Korean-only, or to generate Japanese but to not use Kanji, etc. I think having some stock grammars for that in our example grammar folder could be really helpful! |
This is a great idea! I wasn't entirely sure what it would take to do this, but I gave it a shot in #8402 and -- unless I missed something -- it wound up being simpler than I expected. Is that what you were imagining? @hopto-dot, would that PR work for your usage? |
@HanClinto Hello, sorry for the inactivity in this issue. Yes, the solution you proposed works perfectly and I see it has been merged. Thank you! |
No problem at all -- thank you for bringing this up -- I think this is going to be a good change! :) |
This PR adds an option called
ignore_english_tokens
. When enabled, it attempts to prevent the model from sampling tokens that contain two or more English characters, unless they include angle brackets (for example<EOS>
).The intention behind this option is to stop multilingual LLMs from generating text with mixed in English.
For example, when prompted to respond in Japanese, instead of generating:
The LLM should now correctly generate:
I am not well-versed in cpp and haven't tested to see if the code works as I'm unsure of how to do this on Windows, I mainly made this PR to propose the feature and tried my best.
That said, I did test a workaround to this option not existing by generating a value for the
logit_bias
argument. I wrote a python script to read through thetokenizer.json
of a model, then assign all English token IDs to a value of -100 to ban them. This stops the mixed language issue, but the problem is that no frontend is able to do this automatically for any model due to the nature of this being a backend related problem.