Skip to content

Rm tiktoken flag #424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 23, 2024
Merged

Rm tiktoken flag #424

merged 2 commits into from
Apr 23, 2024

Conversation

mikekgfb
Copy link
Contributor

remove the need for tiktoken flag

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 23, 2024
Copy link
Contributor

@byjlw byjlw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not manage which tokenizer is required in the models.json file?
If we detect a mismatch we can message the user and suggest the tokenizer to use along with details on how to fix their command/request

raise RuntimeError(
f"model-specified tokenizer ({tokenizer_setting_to_name(use_tiktoken)} does not match provided tokenizer ({tokenizer_setting_to_name(is_tiktoken)} for {model_description}"
"test" # f"model-specified tokenizer ({tokenizer_setting_to_name(model.config.use_tiktoken)} does not match provided tokenizer ({tokenizer_setting_to_name(self.is_tiktoken)} for {model_description}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add details to tell the user how to resolve the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll uncomment this in a followup PR.

Copy link
Contributor Author

@mikekgfb mikekgfb Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way: "you gave me the wrong tokenizer, give me the right one. you must specify the tokenizer that corresponds to this model, not some random file you picked up from the side of the road.". It's hard to put it clearly, and politely

@mikekgfb
Copy link
Contributor Author

we detect a mismatch we can message the user and suggest the tokenizer to use along with details on how to fix their command/request

I wrote that last night. That's #409. It's an unmitigated mess because it forces everything to be done in the same spot. The linter lights up red, it drags modification through the entire stack, unsavoury.
I already had that, so just pushing that one would have been an easier solution than writing a new one.

This way doesn't force us to delay loading the the tokenizer during model load.

Comment on lines +147 to +157
try:
from tokenizer.tiktoken import Tokenizer as TiktokenTokenizer

self.t = TiktokenTokenizer(
model_path=str(self.tokenizer_path)
)
self.is_tiktoken = True
self.is_sentencepiece = False
return
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will indeed work, but this will be somewhat slow. Is there a way to check some magic from a file or something?

Copy link
Contributor

@malfet malfet Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, is tiktoken is a simple space separate base64 token + token number

For example, token 125355 is encoded as INGF0LDRgNCw0LrRgtC10YDQuNGB0YLQuNC60Lg= and

% echo INGF0LDRgNCw0LrRgtC10YDQuNGB0YLQuNC60Lg= | base64  -d  
 характеристики%             

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's very charakterisiki of modern formats lolz. Or uunet of the days long gone.
Def was my first choice, but we're running out of time. I wish file $filename worked on these things!

Def should write a checker, although - the risk is that the format changes and we end up having to chase implementations. It's optimized for llama3 (there's no overhead for that one).

@mikekgfb mikekgfb merged commit 1c44ad1 into main Apr 23, 2024
@malfet malfet deleted the rm_tiktoken_flag branch April 30, 2024 16:50
malfet pushed a commit that referenced this pull request Jul 17, 2024
* remove need for toktoken flag

* can't pass self to a function
malfet pushed a commit that referenced this pull request Jul 17, 2024
* remove need for toktoken flag

* can't pass self to a function
malfet pushed a commit that referenced this pull request Jul 17, 2024
* remove need for toktoken flag

* can't pass self to a function
malfet pushed a commit that referenced this pull request Jul 17, 2024
* remove need for toktoken flag

* can't pass self to a function
malfet pushed a commit that referenced this pull request Jul 17, 2024
* remove need for toktoken flag

* can't pass self to a function
malfet pushed a commit that referenced this pull request Jul 17, 2024
* remove need for toktoken flag

* can't pass self to a function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants