falcon arch fix for tied output embeddings #4978

cmp-nct · 2024-01-16T14:30:21Z

The latest falcon finetune from Openbuddy is merging intput/output tensors, it was typically separate before as lm_head.
Here it is: https://huggingface.co/OpenBuddy/openbuddy-falcon-40b-v16.1-4k

This PR uses input embeddings if no output ones are available.
Tested on wizard 40 (separate) and and openbuddy 40 (shared) and both work.

llama.cpp

cmp-nct · 2024-01-16T18:22:16Z

One thing is strange though, not super simple to debug.
ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q2_K.gguf -> inference is 35tk/sec
openbuddy-falcon-40b-v16.1-4k\ggml-model-Q2_K.gguf -> inference is 25tk/sec

That's despite the openbuddy variant being a bit smaller in total size.
Maybe tensor sizes are just not optimal for llama.cpp kernels anymore, something is slowing it down despite being same quantization and architecture (except for the output tensor)

Co-authored-by: Georgi Gerganov <[email protected]>

slaren · 2024-01-16T22:32:08Z

tok_embd is always allocated in the CPU, so the result of doing this in this way is that the output layer cannot be offloaded. To avoid the performance degradation, the tensor would need to be copied to the GPU backend instead, which can be done with ggml_backend_tensor_copy.

cmp-nct · 2024-01-16T23:06:02Z

tok_embd is always allocated in the CPU, so the result of doing this in this way is that the output layer cannot be offloaded. To avoid the performance degradation, the tensor would need to be copied to the GPU backend instead, which can be done with ggml_backend_tensor_copy.

Thanks, I didn't think about that!
I create the tensor in split mode now.
I had to adapt the n_tensors counter as this now results in an additional tensor.

Performance now is at 35tokens/sec !

llama.cpp

ggerganov · 2024-01-17T08:26:23Z

Hm cool! Will wait for slaren to confirm this is fine before merging

Co-authored-by: Georgi Gerganov <[email protected]>

cebtenzzre · 2024-01-17T20:46:28Z

Does this mean that we can revive #3626?

slaren · 2024-01-17T21:07:40Z

llama.cpp

+                        } else {
+                            model.output = ml.create_tensor(ctx_output_split, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}); // needs to be on GPU
+                            ml.n_tensors++; // artificial tensor
+                        }


This should also work, but I would prefer decreasing ml.n_created instead of increasing ml.n_tensors.

I was thinking that given we actually have one more tensor in use than in the model file it's better to increase that var. But we can change it to ml.n_created--;
Should I change it ? you can do it as well of course:)

Let's change it as suggested and merge then

changed and tested it

slaren · 2024-01-17T21:08:02Z

@cebtenzzre yep, this should be doable now.

* falcon arch fix for tied output embeddings * Update llama.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update llama.cpp * Update llama.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <[email protected]>

falcon arch fix for tied output embeddings

94f330a

ggerganov reviewed Jan 16, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

ggerganov approved these changes Jan 16, 2024

View reviewed changes

Update llama.cpp

196348d

Co-authored-by: Georgi Gerganov <[email protected]>

Update llama.cpp

80350da

ggerganov reviewed Jan 17, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

Update llama.cpp

a0dee64

Co-authored-by: Georgi Gerganov <[email protected]>

slaren reviewed Jan 17, 2024

View reviewed changes

slaren approved these changes Jan 17, 2024

View reviewed changes

Update llama.cpp

85070cf

ggerganov merged commit 57e2a7a into ggml-org:master Jan 18, 2024

cebtenzzre mentioned this pull request Feb 22, 2024

models: new MPT model file without duplicated token_embd.weight nomic-ai/gpt4all#2006

Merged

cebtenzzre mentioned this pull request Mar 1, 2024

Assume tied weights if lm_head/output weights is missing. #5824

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

falcon arch fix for tied output embeddings #4978

falcon arch fix for tied output embeddings #4978

Uh oh!

cmp-nct commented Jan 16, 2024

Uh oh!

Uh oh!

cmp-nct commented Jan 16, 2024

Uh oh!

slaren commented Jan 16, 2024

Uh oh!

cmp-nct commented Jan 16, 2024

Uh oh!

Uh oh!

ggerganov commented Jan 17, 2024

Uh oh!

cebtenzzre commented Jan 17, 2024

Uh oh!

slaren Jan 17, 2024

Uh oh!

cmp-nct Jan 17, 2024 •

edited

Loading

Uh oh!

ggerganov Jan 18, 2024

Uh oh!

cmp-nct Jan 18, 2024

Uh oh!

slaren commented Jan 17, 2024

Uh oh!

Uh oh!

falcon arch fix for tied output embeddings #4978

falcon arch fix for tied output embeddings #4978

Uh oh!

Conversation

cmp-nct commented Jan 16, 2024

Uh oh!

Uh oh!

cmp-nct commented Jan 16, 2024

Uh oh!

slaren commented Jan 16, 2024

Uh oh!

cmp-nct commented Jan 16, 2024

Uh oh!

Uh oh!

ggerganov commented Jan 17, 2024

Uh oh!

cebtenzzre commented Jan 17, 2024

Uh oh!

slaren Jan 17, 2024

Choose a reason for hiding this comment

Uh oh!

cmp-nct Jan 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

cmp-nct Jan 18, 2024

Choose a reason for hiding this comment

Uh oh!

slaren commented Jan 17, 2024

Uh oh!

Uh oh!

cmp-nct Jan 17, 2024 •

edited

Loading