Is there a CPU equivalent implementation of the _flash_attention_forward function in llama.cpp? #12193
Unanswered
guoguo1314
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone, I would like to ask if there is an implementation of the _flash_attention_forward function in llama.cpp. You can find the reference here:
https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_flash_attention_utils.py#L231
Of course, the following implementations would also work for me:
1) The actual implementations of the core sub-functions of the _flash_attention_forward function, specifically referring to self-implemented versions of _upad_input, flash_attn_varlen_func, and pad_input.
2)Alternatively, implementations of equivalent functions to these three sub-functions, particularly flash_attn_varlen_func. Having equivalent implementations for all three sub-functions would be even better.
3) Or any other ideas can be provided ?
thanks
Beta Was this translation helpful? Give feedback.
All reactions