-
Notifications
You must be signed in to change notification settings - Fork 250
Rm tiktoken flag #424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rm tiktoken flag #424
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not manage which tokenizer is required in the models.json file?
If we detect a mismatch we can message the user and suggest the tokenizer to use along with details on how to fix their command/request
raise RuntimeError( | ||
f"model-specified tokenizer ({tokenizer_setting_to_name(use_tiktoken)} does not match provided tokenizer ({tokenizer_setting_to_name(is_tiktoken)} for {model_description}" | ||
"test" # f"model-specified tokenizer ({tokenizer_setting_to_name(model.config.use_tiktoken)} does not match provided tokenizer ({tokenizer_setting_to_name(self.is_tiktoken)} for {model_description}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add details to tell the user how to resolve the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll uncomment this in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put another way: "you gave me the wrong tokenizer, give me the right one. you must specify the tokenizer that corresponds to this model, not some random file you picked up from the side of the road.". It's hard to put it clearly, and politely
I wrote that last night. That's #409. It's an unmitigated mess because it forces everything to be done in the same spot. The linter lights up red, it drags modification through the entire stack, unsavoury. This way doesn't force us to delay loading the the tokenizer during model load. |
try: | ||
from tokenizer.tiktoken import Tokenizer as TiktokenTokenizer | ||
|
||
self.t = TiktokenTokenizer( | ||
model_path=str(self.tokenizer_path) | ||
) | ||
self.is_tiktoken = True | ||
self.is_sentencepiece = False | ||
return | ||
except: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will indeed work, but this will be somewhat slow. Is there a way to check some magic from a file or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, is tiktoken is a simple space separate base64 token + token number
For example, token 125355 is encoded as INGF0LDRgNCw0LrRgtC10YDQuNGB0YLQuNC60Lg=
and
% echo INGF0LDRgNCw0LrRgtC10YDQuNGB0YLQuNC60Lg= | base64 -d
характеристики%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's very charakterisiki of modern formats lolz. Or uunet of the days long gone.
Def was my first choice, but we're running out of time. I wish file $filename
worked on these things!
Def should write a checker, although - the risk is that the format changes and we end up having to chase implementations. It's optimized for llama3 (there's no overhead for that one).
* remove need for toktoken flag * can't pass self to a function
* remove need for toktoken flag * can't pass self to a function
* remove need for toktoken flag * can't pass self to a function
* remove need for toktoken flag * can't pass self to a function
* remove need for toktoken flag * can't pass self to a function
* remove need for toktoken flag * can't pass self to a function
remove the need for tiktoken flag