common : allow raw byte in SPM vocabs; don't crash if newline token is not found #5478

akx · 2024-02-13T10:23:15Z

This makes e.g. a GGUF quantization of Finnish-NLP/llama-7b-finnish-instruct-v0.2 not crash while loading.

Sure, it would be better if the model's vocabulary had the <0x0A> token to begin with...

#5477 is related (written while I was hunting this down).

ggerganov · 2024-02-13T13:26:48Z

llama.cpp

+            // Try to fall back to just the byte as a string
+            const char buf2[2] = { (char)ch, 0 };
+            return vocab.token_to_id.at(buf2);


Does this fallback work for the Finnish model that you are trying?

Yep! @R4ZZ3, who's been involved with training the model, pointed out that tokenizer.json has

{ "id": 64261, "content": "\n", "single_word": false, "lstrip": false, "rstrip": false, "normalized": false, "special": true }

llama.cpp

…token byte in SPM vocabs

…#5478) * common : don't crash if newline token is not found * common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs

akx marked this pull request as draft February 13, 2024 11:00

akx changed the title ~~common : don't crash if newline token is not found~~ common : allow raw byte in SPM vocabs; don't crash if newline token is not found Feb 13, 2024

ggerganov reviewed Feb 13, 2024

View reviewed changes

akx marked this pull request as ready for review February 13, 2024 13:30

ggerganov reviewed Feb 13, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

akx added 2 commits February 13, 2024 17:20

common : don't crash if newline token is not found

93aed75

common : llama_byte_to_token: allow falling back to finding just the …

72b353f

…token byte in SPM vocabs

akx force-pushed the spm-no-linefeed-hack branch from dce9bc9 to 72b353f Compare February 13, 2024 15:20

akx requested a review from ggerganov February 13, 2024 15:21

ggerganov approved these changes Feb 13, 2024

View reviewed changes

ggerganov merged commit c4e6dd5 into ggml-org:master Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

common : allow raw byte in SPM vocabs; don't crash if newline token is not found #5478

common : allow raw byte in SPM vocabs; don't crash if newline token is not found #5478

Uh oh!

akx commented Feb 13, 2024 •

edited

Loading

Uh oh!

ggerganov Feb 13, 2024

Uh oh!

akx Feb 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

common : allow raw byte in SPM vocabs; don't crash if newline token is not found #5478

common : allow raw byte in SPM vocabs; don't crash if newline token is not found #5478

Uh oh!

Conversation

akx commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

akx Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

akx commented Feb 13, 2024 •

edited

Loading

akx Feb 13, 2024 •

edited

Loading