Skip to content

Commit 2d219b3

Browse files
authored
vocab : ignore invalid UTF-8 input in the BPE tokenizer (#11729)
Silently insert U+FFFD(s) (Unicode replacement character) instead until the next valid codepoint can be found. This fixes `llama_tokenize` throwing an exception across the C API boundary or libllama's module boundary (the caller's runtime might be incompatible!) Returing a proper error code might be desirable, however the signature of `llama_tokenize` doesn't allow it as all return values already have existing meaning.
1 parent 333820d commit 2d219b3

File tree

1 file changed

+8
-1
lines changed

1 file changed

+8
-1
lines changed

src/unicode.cpp

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -618,7 +618,14 @@ std::vector<uint32_t> unicode_cpts_from_utf8(const std::string & utf8) {
618618
result.reserve(utf8.size());
619619
size_t offset = 0;
620620
while (offset < utf8.size()) {
621-
result.push_back(unicode_cpt_from_utf8(utf8, offset));
621+
try {
622+
result.push_back(unicode_cpt_from_utf8(utf8, offset));
623+
}
624+
catch (const std::invalid_argument & /*ex*/) {
625+
// Silently ignore invalid UTF-8 input to avoid leaking the exception beyond llama_tokenize
626+
++offset;
627+
result.emplace_back(0xFFFD); // replacement character
628+
}
622629
}
623630
return result;
624631
}

0 commit comments

Comments
 (0)