Skip to content

falcon arch fix for tied output embeddings #4978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

cmp-nct
Copy link
Contributor

@cmp-nct cmp-nct commented Jan 16, 2024

The latest falcon finetune from Openbuddy is merging intput/output tensors, it was typically separate before as lm_head.
Here it is: https://huggingface.co/OpenBuddy/openbuddy-falcon-40b-v16.1-4k

This PR uses input embeddings if no output ones are available.
Tested on wizard 40 (separate) and and openbuddy 40 (shared) and both work.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 16, 2024

One thing is strange though, not super simple to debug.
ggml-ehartford-WizardLM-Uncensored-Falcon-40b-Q2_K.gguf -> inference is 35tk/sec
openbuddy-falcon-40b-v16.1-4k\ggml-model-Q2_K.gguf -> inference is 25tk/sec

That's despite the openbuddy variant being a bit smaller in total size.
Maybe tensor sizes are just not optimal for llama.cpp kernels anymore, something is slowing it down despite being same quantization and architecture (except for the output tensor)

Co-authored-by: Georgi Gerganov <[email protected]>
@slaren
Copy link
Member

slaren commented Jan 16, 2024

tok_embd is always allocated in the CPU, so the result of doing this in this way is that the output layer cannot be offloaded. To avoid the performance degradation, the tensor would need to be copied to the GPU backend instead, which can be done with ggml_backend_tensor_copy.

@cmp-nct
Copy link
Contributor Author

cmp-nct commented Jan 16, 2024

tok_embd is always allocated in the CPU, so the result of doing this in this way is that the output layer cannot be offloaded. To avoid the performance degradation, the tensor would need to be copied to the GPU backend instead, which can be done with ggml_backend_tensor_copy.

Thanks, I didn't think about that!
I create the tensor in split mode now.
I had to adapt the n_tensors counter as this now results in an additional tensor.

Performance now is at 35tokens/sec !

@ggerganov
Copy link
Member

Hm cool! Will wait for slaren to confirm this is fine before merging

Co-authored-by: Georgi Gerganov <[email protected]>
@cebtenzzre
Copy link
Collaborator

Does this mean that we can revive #3626?

llama.cpp Outdated
Comment on lines 3440 to 3443
} else {
model.output = ml.create_tensor(ctx_output_split, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}); // needs to be on GPU
ml.n_tensors++; // artificial tensor
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also work, but I would prefer decreasing ml.n_created instead of increasing ml.n_tensors.

Copy link
Contributor Author

@cmp-nct cmp-nct Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that given we actually have one more tensor in use than in the model file it's better to increase that var. But we can change it to ml.n_created--;
Should I change it ? you can do it as well of course:)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change it as suggested and merge then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed and tested it

@slaren
Copy link
Member

slaren commented Jan 17, 2024

@cebtenzzre yep, this should be doable now.

@ggerganov ggerganov merged commit 57e2a7a into ggml-org:master Jan 18, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* falcon arch fix for tied output embeddings

* Update llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update llama.cpp

* Update llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* falcon arch fix for tied output embeddings

* Update llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update llama.cpp

* Update llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update llama.cpp

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants