Skip to content

llama-model : support Qwen2 embedding models and pooling_mode_lasttoken #13245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 2, 2025

Conversation

cebtenzzre
Copy link
Collaborator

@cebtenzzre cebtenzzre commented May 1, 2025

These changes are necessary to support nomic-embed-code, a Qwen2-7B-based embedding model for code retrieval.

Without these changes it is still possible to run the GGUFs converted using this PR, but you have to explicitly specify the pooling mode when it is loaded, e.g. --pooling last for llama-server.

See the model card and GGUFs here: https://huggingface.co/nomic-ai/nomic-embed-code-GGUF

@cebtenzzre cebtenzzre requested a review from ggerganov May 1, 2025 20:03
@github-actions github-actions bot added the python python script changes label May 1, 2025
@CISC CISC linked an issue May 1, 2025 that may be closed by this pull request
dir_model : Path,
ftype : gguf.LlamaFileType,
fname_out : Path,
hf_arch : str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, if we update it here, we should also update convert_lora_to_gguf, that's why I think it's better not to add to many input arguments for this class. hf_arch can indeed be implied from hparams, so could we remove it from this list?

@cebtenzzre
Copy link
Collaborator Author

Btw, the master branch fails mypy, so I feel like I'm flying blind with PRs like this which touch python:

Errors
convert_hf_to_gguf.py:113: error: Cannot assign to a method  [method-assign]
                self.get_tensors = get_remote_tensors
                ^~~~~~~~~~~~~~~~
convert_hf_to_gguf.py:498: error: Name "fname_default" already defined on line 496  [no-redef]
                    fname_default: str = gguf.naming_convention(self.metadata.name, self.metadata.basename, self.metadata.finetune, self.metadata.version, size_label=None, output_type=None, model_type="vocab")
                    ^
convert_hf_to_gguf.py:1468: error: If x = b'abc' then f"{x}" or "{}".format(x) produces "b'abc'", not "abc". If this is desired behavior, use f"{x!r}" or "{!r}".format(x). Otherwise, decode the bytes  [str-bytes-safe]
                    token_text = f"<{token_text}>".encode('utf-8')
                                    ^~~~~~~~~~~~
convert_hf_to_gguf.py:2186: error: Attribute "_num_kv_heads" already defined on line 2114  [no-redef]
                    self._num_kv_heads: list[int] = self.hparams["num_key_value_heads_per_layer"]
                    ^
convert_lora_to_gguf.py:187: error: Missing type parameters for generic type "Callable"  [type-arg]
        def __torch_function__(cls, func: Callable, types, args=(), kwargs=None):
                                          ^
convert_lora_to_gguf.py:316: error: Incompatible types in assignment (expression has type "str", variable has type "Path")  [assignment]
            input_model = os.path.join(dir_lora, "adapter_model.bin")
                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
convert_lora_to_gguf.py:352: error: Variable "convert_lora_to_gguf.model_class" is not valid as a type  [valid-type]
            class LoraModel(model_class):
                            ^
convert_lora_to_gguf.py:352: note: See https://mypy.readthedocs.io/en/stable/common_issues.html#variables-vs-type-aliases
convert_lora_to_gguf.py:352: error: Invalid base class "model_class"  [misc]
            class LoraModel(model_class):
                            ^
convert_lora_to_gguf.py:412: error: Incompatible types in assignment (expression has type "PartialLoraTensor", variable has type "Tensor")  [assignment]
                    for name, tensor in tensor_map.items():
                    ^
convert_lora_to_gguf.py:413: error: "Tensor" has no attribute "A"  [attr-defined]
                        assert tensor.A is not None
                               ^~~~~~~~
convert_lora_to_gguf.py:414: error: "Tensor" has no attribute "B"  [attr-defined]
                        assert tensor.B is not None
                               ^~~~~~~~
convert_lora_to_gguf.py:415: error: "Tensor" has no attribute "A"  [attr-defined]
                        yield (name, cast(torch.Tensor, LoraTorchTensor(tensor.A, tensor.B)))
                                                                        ^~~~~~~~
convert_lora_to_gguf.py:415: error: "Tensor" has no attribute "B"  [attr-defined]
                        yield (name, cast(torch.Tensor, LoraTorchTensor(tensor.A, tensor.B)))
                                                                                  ^~~~~~~~
Found 13 errors in 2 files (checked 2 source files)

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if you can also update convert_lora_to_gguf, otherwise I can do that in a follow up PR

@cebtenzzre
Copy link
Collaborator Author

Would be nice if you can also update convert_lora_to_gguf, otherwise I can do that in a follow up PR

With the current revision I don't think any change to convert_lora_to_gguf.py should be necessary. Unless there is some other fix you have in mind.

@ngxson
Copy link
Collaborator

ngxson commented May 1, 2025

I thought convert_lora_to_gguf uses get_model_architecture but I could be wrong (I'm not in front of computer rn)

But if it doesn't, then all is good 👍

@cebtenzzre
Copy link
Collaborator Author

thought convert_lora_to_gguf uses get_model_architecture

It uses hparams["architectures"][0] directly.

@cebtenzzre cebtenzzre merged commit 2f56761 into master May 2, 2025
56 checks passed
@cebtenzzre cebtenzzre deleted the jared/nomic-embed-code branch May 2, 2025 15:42
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 2, 2025
* GraniteMoEShared:
fix: Fix the input to the shared experts
fix: Cleaner (maybe more correct?) splitting for gate/up
feat: First WIP cut at model arch in cpp
fix: Split MoE fused tensors for shared experts in conversion
feat: hparam and arch plumbing for granitemoeshared
feat: Add GGUF conversion for granitemoeshared
llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (ggml-org#13245)
convert : use correct context length for nomic-embed-text-v2 (ggml-org#13216)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: Cannot convert nomic-embed-code to gguf
3 participants