Skip to content

common : allow raw byte in SPM vocabs; don't crash if newline token is not found #5478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 13, 2024

Conversation

akx
Copy link
Contributor

@akx akx commented Feb 13, 2024

This makes e.g. a GGUF quantization of Finnish-NLP/llama-7b-finnish-instruct-v0.2 not crash while loading.

Sure, it would be better if the model's vocabulary had the <0x0A> token to begin with...

#5477 is related (written while I was hunting this down).

@akx akx marked this pull request as draft February 13, 2024 11:00
@akx akx changed the title common : don't crash if newline token is not found common : allow raw byte in SPM vocabs; don't crash if newline token is not found Feb 13, 2024
Comment on lines +7722 to +7760
// Try to fall back to just the byte as a string
const char buf2[2] = { (char)ch, 0 };
return vocab.token_to_id.at(buf2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this fallback work for the Finnish model that you are trying?

Copy link
Contributor Author

@akx akx Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! @R4ZZ3, who's been involved with training the model, pointed out that tokenizer.json has

    {
      "id": 64261,
      "content": "\n",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    }

@akx akx marked this pull request as ready for review February 13, 2024 13:30
@akx akx force-pushed the spm-no-linefeed-hack branch from dce9bc9 to 72b353f Compare February 13, 2024 15:20
@akx akx requested a review from ggerganov February 13, 2024 15:21
@ggerganov ggerganov merged commit c4e6dd5 into ggml-org:master Feb 13, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
…#5478)

* common : don't crash if newline token is not found

* common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
…#5478)

* common : don't crash if newline token is not found

* common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants