llama-model : support Qwen2 embedding models and pooling_mode_lasttoken #13245

cebtenzzre · 2025-05-01T20:03:35Z

These changes are necessary to support nomic-embed-code, a Qwen2-7B-based embedding model for code retrieval.

Without these changes it is still possible to run the GGUFs converted using this PR, but you have to explicitly specify the pooling mode when it is loaded, e.g. --pooling last for llama-server.

See the model card and GGUFs here: https://huggingface.co/nomic-ai/nomic-embed-code-GGUF

convert_hf_to_gguf.py

ngxson · 2025-05-01T21:39:00Z

convert_hf_to_gguf.py

+        dir_model          : Path,
+        ftype              : gguf.LlamaFileType,
+        fname_out          : Path,
+        hf_arch            : str,


btw, if we update it here, we should also update convert_lora_to_gguf, that's why I think it's better not to add to many input arguments for this class. hf_arch can indeed be implied from hparams, so could we remove it from this list?

cebtenzzre · 2025-05-01T21:57:28Z

Btw, the master branch fails mypy, so I feel like I'm flying blind with PRs like this which touch python:

Errors

convert_hf_to_gguf.py:113: error: Cannot assign to a method  [method-assign]
                self.get_tensors = get_remote_tensors
                ^~~~~~~~~~~~~~~~
convert_hf_to_gguf.py:498: error: Name "fname_default" already defined on line 496  [no-redef]
                    fname_default: str = gguf.naming_convention(self.metadata.name, self.metadata.basename, self.metadata.finetune, self.metadata.version, size_label=None, output_type=None, model_type="vocab")
                    ^
convert_hf_to_gguf.py:1468: error: If x = b'abc' then f"{x}" or "{}".format(x) produces "b'abc'", not "abc". If this is desired behavior, use f"{x!r}" or "{!r}".format(x). Otherwise, decode the bytes  [str-bytes-safe]
                    token_text = f"<{token_text}>".encode('utf-8')
                                    ^~~~~~~~~~~~
convert_hf_to_gguf.py:2186: error: Attribute "_num_kv_heads" already defined on line 2114  [no-redef]
                    self._num_kv_heads: list[int] = self.hparams["num_key_value_heads_per_layer"]
                    ^
convert_lora_to_gguf.py:187: error: Missing type parameters for generic type "Callable"  [type-arg]
        def __torch_function__(cls, func: Callable, types, args=(), kwargs=None):
                                          ^
convert_lora_to_gguf.py:316: error: Incompatible types in assignment (expression has type "str", variable has type "Path")  [assignment]
            input_model = os.path.join(dir_lora, "adapter_model.bin")
                          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
convert_lora_to_gguf.py:352: error: Variable "convert_lora_to_gguf.model_class" is not valid as a type  [valid-type]
            class LoraModel(model_class):
                            ^
convert_lora_to_gguf.py:352: note: See https://mypy.readthedocs.io/en/stable/common_issues.html#variables-vs-type-aliases
convert_lora_to_gguf.py:352: error: Invalid base class "model_class"  [misc]
            class LoraModel(model_class):
                            ^
convert_lora_to_gguf.py:412: error: Incompatible types in assignment (expression has type "PartialLoraTensor", variable has type "Tensor")  [assignment]
                    for name, tensor in tensor_map.items():
                    ^
convert_lora_to_gguf.py:413: error: "Tensor" has no attribute "A"  [attr-defined]
                        assert tensor.A is not None
                               ^~~~~~~~
convert_lora_to_gguf.py:414: error: "Tensor" has no attribute "B"  [attr-defined]
                        assert tensor.B is not None
                               ^~~~~~~~
convert_lora_to_gguf.py:415: error: "Tensor" has no attribute "A"  [attr-defined]
                        yield (name, cast(torch.Tensor, LoraTorchTensor(tensor.A, tensor.B)))
                                                                        ^~~~~~~~
convert_lora_to_gguf.py:415: error: "Tensor" has no attribute "B"  [attr-defined]
                        yield (name, cast(torch.Tensor, LoraTorchTensor(tensor.A, tensor.B)))
                                                                                  ^~~~~~~~
Found 13 errors in 2 files (checked 2 source files)

convert_hf_to_gguf.py

ngxson

Would be nice if you can also update convert_lora_to_gguf, otherwise I can do that in a follow up PR

cebtenzzre · 2025-05-01T22:37:15Z

Would be nice if you can also update convert_lora_to_gguf, otherwise I can do that in a follow up PR

With the current revision I don't think any change to convert_lora_to_gguf.py should be necessary. Unless there is some other fix you have in mind.

ngxson · 2025-05-01T22:50:19Z

I thought convert_lora_to_gguf uses get_model_architecture but I could be wrong (I'm not in front of computer rn)

But if it doesn't, then all is good 👍

cebtenzzre · 2025-05-01T23:32:51Z

thought convert_lora_to_gguf uses get_model_architecture

It uses hparams["architectures"][0] directly.

* GraniteMoEShared: fix: Fix the input to the shared experts fix: Cleaner (maybe more correct?) splitting for gate/up feat: First WIP cut at model arch in cpp fix: Split MoE fused tensors for shared experts in conversion feat: hparam and arch plumbing for granitemoeshared feat: Add GGUF conversion for granitemoeshared llama-model : support Qwen2 embedding models and pooling_mode_lasttoken (ggml-org#13245) convert : use correct context length for nomic-embed-text-v2 (ggml-org#13216)

llama-model : support Qwen2 embedding models and pooling_mode_lasttoken

11683f5

cebtenzzre requested a review from ggerganov May 1, 2025 20:03

github-actions bot added the python python script changes label May 1, 2025

CISC linked an issue May 1, 2025 that may be closed by this pull request

Eval bug: Cannot convert nomic-embed-code to gguf #13242

Closed

ngxson reviewed May 1, 2025

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

ngxson reviewed May 1, 2025

View reviewed changes

set hf_arch in TextModel.__init__

ce73722

appease pyright

f48e792

ngxson reviewed May 1, 2025

View reviewed changes

convert_hf_to_gguf.py Show resolved Hide resolved

ngxson approved these changes May 1, 2025

View reviewed changes

add explicit type annotation

c259cd5

ggerganov approved these changes May 2, 2025

View reviewed changes

cebtenzzre merged commit 2f56761 into master May 2, 2025
56 checks passed

cebtenzzre deleted the jared/nomic-embed-code branch May 2, 2025 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama-model : support Qwen2 embedding models and pooling_mode_lasttoken #13245

llama-model : support Qwen2 embedding models and pooling_mode_lasttoken #13245

Uh oh!

cebtenzzre commented May 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

ngxson May 1, 2025

Uh oh!

cebtenzzre commented May 1, 2025

Uh oh!

Uh oh!

ngxson left a comment

Uh oh!

cebtenzzre commented May 1, 2025

Uh oh!

ngxson commented May 1, 2025

Uh oh!

cebtenzzre commented May 1, 2025

Uh oh!

Uh oh!

Uh oh!

llama-model : support Qwen2 embedding models and pooling_mode_lasttoken #13245

llama-model : support Qwen2 embedding models and pooling_mode_lasttoken #13245

Uh oh!

Conversation

cebtenzzre commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ngxson May 1, 2025

Choose a reason for hiding this comment

Uh oh!

cebtenzzre commented May 1, 2025

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

cebtenzzre commented May 1, 2025

Uh oh!

ngxson commented May 1, 2025

Uh oh!

cebtenzzre commented May 1, 2025

Uh oh!

Uh oh!

Uh oh!

cebtenzzre commented May 1, 2025 •

edited

Loading