-
Notifications
You must be signed in to change notification settings - Fork 12.2k
convert: handle when model's tokenization method relies on Mecab #13830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
036b5d6
convert: add support for Japanese Bert model
huydt-bti c484802
remove auto install, only throw error if fugashi is missing
huydt-bti 4e0f769
Only skip pre_tokenizer print for mecab tokenizer type
huydt-bti 547b380
restore download_file_with_auth
huydt-bti 0192cab
small import lint restore
huydt-bti 94184ae
Merge branch 'master' into huydt/bert-ja-support
huydt-bti f256169
Merge branch 'master' into huydt/bert-ja-support
huydt-bti a6b9bde
update convert_hf_to_gguf to include ruri-large
huydt-bti a7fef9c
fix typecheck and lint
huydt-bti File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
ied 4 ½ months | ||
__ggml_vocab_test__ | ||
Äpfel | ||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
__ggml_vocab_test__ | ||
|
||
|
||
__ggml_vocab_test__ | ||
|
||
|
||
|
||
__ggml_vocab_test__ | ||
|
||
|
||
|
||
|
||
__ggml_vocab_test__ | ||
|
||
|
||
__ggml_vocab_test__ | ||
Hello world | ||
__ggml_vocab_test__ | ||
Hello world | ||
__ggml_vocab_test__ | ||
Hello World | ||
__ggml_vocab_test__ | ||
Hello World | ||
__ggml_vocab_test__ | ||
Hello World! | ||
__ggml_vocab_test__ | ||
Hello, world! | ||
__ggml_vocab_test__ | ||
Hello, world! | ||
__ggml_vocab_test__ | ||
this is 🦙.cpp | ||
__ggml_vocab_test__ | ||
w048 7tuijk dsdfhu | ||
__ggml_vocab_test__ | ||
нещо на Български | ||
__ggml_vocab_test__ | ||
កាន់តែពិសេសអាចខលចេញ | ||
__ggml_vocab_test__ | ||
🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ (only emoji that has its own token) | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
__ggml_vocab_test__ | ||
Hello | ||
Hello | ||
__ggml_vocab_test__ | ||
( | ||
__ggml_vocab_test__ | ||
|
||
= | ||
__ggml_vocab_test__ | ||
' era | ||
__ggml_vocab_test__ | ||
Hello, y'all! How are you 😁 ?我想在apple工作1314151天~ | ||
__ggml_vocab_test__ | ||
!!!!!! | ||
__ggml_vocab_test__ | ||
3 | ||
__ggml_vocab_test__ | ||
33 | ||
__ggml_vocab_test__ | ||
333 | ||
__ggml_vocab_test__ | ||
3333 | ||
__ggml_vocab_test__ | ||
33333 | ||
__ggml_vocab_test__ | ||
333333 | ||
__ggml_vocab_test__ | ||
3333333 | ||
__ggml_vocab_test__ | ||
33333333 | ||
__ggml_vocab_test__ | ||
333333333 | ||
__ggml_vocab_test__ | ||
Cửa Việt | ||
__ggml_vocab_test__ | ||
discards | ||
__ggml_vocab_test__ | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
🚀 (normal) 😶🌫️ (multiple emojis concatenated) ✅ 🦙🦙 3 33 333 3333 33333 333333 3333333 33333333 3.3 3..3 3...3 កាន់តែពិសេសអាច😁 ?我想在apple工作1314151天~ ------======= нещо на Български ''''''```````""""......!!!!!!?????? I've been 'told he's there, 'RE you sure? 'M not sure I'll make it, 'D you like some tea? We'Ve a'lL | ||
__ggml_vocab_test__ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
88 13247 35 32 1 33 92 18336 7095 7045 | ||
1 | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
32004 29944 102 28789 | ||
32004 29944 102 28789 | ||
32004 29944 18520 | ||
32004 29944 18520 | ||
32004 29944 18520 16 | ||
32004 29944 27 102 28789 16 | ||
32004 29944 27 102 28789 16 | ||
14152 12741 23274 1 29 82 16003 | ||
102 16435 7187 38 99 7069 25460 7099 83 7045 7094 7222 7095 7069 | ||
1 1 1 | ||
1 | ||
1 23 31304 7048 21907 7071 24 1 1 1 23 92 19760 14698 12835 84 7073 7075 32061 7045 26430 30214 16061 23624 16061 7094 24 1 23 18446 16157 84 7073 7075 32061 14152 12648 22106 7045 21801 7045 94 7070 7044 17253 20903 7044 24 | ||
32004 29944 | ||
32004 29944 | ||
32004 29944 | ||
32004 29944 | ||
32004 29944 | ||
32004 29944 32004 29944 | ||
23 | ||
44 | ||
22 84 14469 | ||
32004 29944 27 104 22 28187 16 55 13544 21369 7084 23418 1 46 2366 2263 1448 80 16003 12835 17228 22230 17880 23055 1589 109 | ||
16 16 16 16 16 16 | ||
34 | ||
13590 | ||
13590 7083 | ||
13590 17209 | ||
13590 17209 7083 | ||
13590 17209 17209 | ||
13590 17209 17209 7083 | ||
13590 17209 17209 17209 | ||
13590 17209 17209 17209 7083 | ||
1 1 | ||
23283 23637 14194 7045 | ||
1 23 31304 7048 21907 7071 24 1 1 1 23 92 19760 14698 12835 84 7073 7075 32061 7045 26430 30214 16061 23624 16061 7094 24 1 1 34 13590 13590 7083 13590 17209 13590 17209 7083 13590 17209 17209 13590 17209 17209 7083 13590 17209 17209 17209 34 29 34 34 29 29 34 34 29 29 29 34 1 46 2366 2263 1448 80 16003 12835 17228 22230 17880 23055 1589 109 26810 7509 7509 7509 7509 8741 8741 8741 8741 8741 8741 8741 1 1 1 22 22 22 22 22 22 79 79 79 79 79 8669 8669 7329 7329 7329 7329 28042 28042 28042 7508 7508 7508 7508 7508 7508 8134 8134 8134 8134 8134 8134 56 22 101 7084 24992 12620 22 17253 16344 30334 22 98 13891 12940 27 22 18898 23418 31114 12940 46 22 60 31304 7046 31114 12940 56 22 91 7071 92 19897 21801 27 22 51 23418 91 26183 98 19851 30713 7043 46 21452 22 69 7084 80 22 91 7159 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you add it here, you must also run the script so it updates
convert_hf_to_gguf
and include the change in this PRThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and btw, do we even have the CPP code to handle this? is this already tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested that model and similar models (ruri-*) locally for embedding task and it worked.
I'm sorry. About this, like I said before, I don't have access to many models in the list, so it's hard to run all listed models to update to
convert_hf_to_gguf
. Can you do that for me? If not, how do you think we can handle this (Like left a comment telling that some Japanese models requirevocab.txt
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When #13847 is merged, you can run the script again and this time it will only process the newly added model