Skip to content

Commit f2315b7

Browse files
committed
Remove llama related stuff out of bpe_tokenizer
Pull Request resolved: #4235 We don't need to initialize `vocab_`, `vocab_scores_`, etc. They will be initialized anyway while loading the tokenizer binary. A benefit of removing them is that we can remove these llama related default values and make `bpe_tokenizer` agnostic to models. ghstack-source-id: 233578697 Differential Revision: [D59664556](https://our.internmc.facebook.com/intern/diff/D59664556/)
1 parent 17d6229 commit f2315b7

File tree

2 files changed

+0
-11
lines changed

2 files changed

+0
-11
lines changed

examples/models/llama2/tokenizer/bpe_tokenizer.cpp

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,6 @@ static int compare_tokens(const void* a, const void* b) {
2424
}
2525

2626
BPETokenizer::BPETokenizer() : Tokenizer() {
27-
vocab_size_ = kDefaultVocabSize;
28-
vocab_ = std::make_unique<char*[]>(kDefaultVocabSize);
29-
vocab_scores_ = std::make_unique<float[]>(kDefaultVocabSize);
30-
sorted_vocab_ = std::make_unique<TokenIndex[]>(kDefaultVocabSize);
31-
bos_tok_ = kDefaultBosTokenId;
32-
eos_tok_ = kDefaultEosTokenId;
3327
for (int i = 0; i < 256; i++) {
3428
byte_pieces_[i * 2] = (unsigned char)i;
3529
byte_pieces_[i * 2 + 1] = '\0';

examples/models/llama2/tokenizer/bpe_tokenizer.h

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,6 @@
1414
namespace torch {
1515
namespace executor {
1616

17-
// Default values for llama2
18-
constexpr int32_t kDefaultVocabSize = 32000;
19-
constexpr uint64_t kDefaultBosTokenId = 1;
20-
constexpr uint64_t kDefaultEosTokenId = 2;
21-
2217
struct TokenIndex {
2318
const char* str;
2419
int32_t id;

0 commit comments

Comments
 (0)