llama : sanitize tokens in the upper bound #9359

slaren · 2024-09-07T22:35:37Z

ggml-ci

julmb · 2024-09-08T00:05:03Z

I don't know if it is within the scope of this PR, but it might also be worth looking at this:
https://github.com/ggerganov/llama.cpp/blob/a5b5d9a1014248f385939e6739e9db9de7147e55/src/llama-vocab.cpp#L1700-L1712

It seems that the cache lookup is not protected by the bounds check. At least I am getting segmentation faults when passing negative or too large tokens to llama_token_to_piece.

julmb · 2024-09-08T00:17:23Z

Actually, I think it might be llama_token_to_piece_impl calling llama_token_get_attr_impl (also not protected by bounds checking) that is causing the segfault, and not the cache lookup:

llama_token_attr llama_token_get_attr_impl(const struct llama_vocab & vocab, llama_token token) {
    GGML_ASSERT(vocab.type != LLAMA_VOCAB_TYPE_NONE);
    return vocab.id_to_token[token].attr;
}

The cache lookup uses std::vector<T,Allocator>::at, which does bounds checking (although I am not sure what happens to C++ exceptions when the function is called from C).

slaren · 2024-09-08T10:41:43Z

I think it would be good to add bounds checking to more functions. The positions in embedding models also need to be bounds checked, currently it will still crash if the sequence is longer than then model supports. However let's do that in separate PRs.

llama : sanitize tokens in the upper bound

297ba5c

ggml-ci

ggerganov approved these changes Sep 8, 2024

View reviewed changes

slaren merged commit eae5971 into master Sep 8, 2024
59 checks passed

slaren deleted the sl/llama-token-hi branch September 8, 2024 10:41

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

llama : sanitize tokens in the upper bound (ggml-org#9359)

06ac336

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

llama : sanitize tokens in the upper bound (ggml-org#9359)

5f10d45

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

llama : sanitize tokens in the upper bound (ggml-org#9359)

afe6d90

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : sanitize tokens in the upper bound #9359

llama : sanitize tokens in the upper bound #9359

Uh oh!

slaren commented Sep 7, 2024

Uh oh!

julmb commented Sep 8, 2024

Uh oh!

julmb commented Sep 8, 2024

Uh oh!

slaren commented Sep 8, 2024

Uh oh!

Uh oh!

Uh oh!

llama : sanitize tokens in the upper bound #9359

llama : sanitize tokens in the upper bound #9359

Uh oh!

Conversation

slaren commented Sep 7, 2024

Uh oh!

julmb commented Sep 8, 2024

Uh oh!

julmb commented Sep 8, 2024

Uh oh!

slaren commented Sep 8, 2024

Uh oh!

Uh oh!

Uh oh!