-
Notifications
You must be signed in to change notification settings - Fork 12.2k
llama-model : support Qwen2 embedding models and pooling_mode_lasttoken #13245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
convert_hf_to_gguf.py
Outdated
dir_model : Path, | ||
ftype : gguf.LlamaFileType, | ||
fname_out : Path, | ||
hf_arch : str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, if we update it here, we should also update convert_lora_to_gguf
, that's why I think it's better not to add to many input arguments for this class. hf_arch
can indeed be implied from hparams
, so could we remove it from this list?
Btw, the master branch fails mypy, so I feel like I'm flying blind with PRs like this which touch python: Errors
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if you can also update convert_lora_to_gguf
, otherwise I can do that in a follow up PR
With the current revision I don't think any change to convert_lora_to_gguf.py should be necessary. Unless there is some other fix you have in mind. |
I thought convert_lora_to_gguf uses get_model_architecture but I could be wrong (I'm not in front of computer rn) But if it doesn't, then all is good 👍 |
It uses |
* GraniteMoEShared: fix: Fix the input to the shared experts fix: Cleaner (maybe more correct?) splitting for gate/up feat: First WIP cut at model arch in cpp fix: Split MoE fused tensors for shared experts in conversion feat: hparam and arch plumbing for granitemoeshared feat: Add GGUF conversion for granitemoeshared llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (ggml-org#13245) convert : use correct context length for nomic-embed-text-v2 (ggml-org#13216)
These changes are necessary to support nomic-embed-code, a Qwen2-7B-based embedding model for code retrieval.
Without these changes it is still possible to run the GGUFs converted using this PR, but you have to explicitly specify the pooling mode when it is loaded, e.g.
--pooling last
for llama-server.See the model card and GGUFs here: https://huggingface.co/nomic-ai/nomic-embed-code-GGUF