-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fixing frequency penalty #1811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing frequency penalty #1811
Conversation
Hey I noticed the links to the generation examples were broken, so I have updated them! |
Hey @drbh I saw you created a new branch off of this one, why is that? Btw, I have rebased and fixed the styling! Could you guys maybe take a look and give me your feedback? @drbh @Narsil @OlivierDehaene This is the commit with my changes 949b889 Are you guys comfortable with merging this branch? Thank you! |
We create branches to run the CI because our secrets will not be available on forks (and we need them to run the integration tests unfortunately). The rebase created a lot of bogus commits, any way you could remove them ? I'm happy to help with the rebase if you want. Thanks a lot for the fix ! |
… when apply freq penalty
fa9713d
to
21ec539
Compare
I took the liberty of doing the rebase so your fix could be included in the upcoming release. |
Thank you so much @Narsil ! Okay now I understand why the fork happened, keep up the great work you guys are doing with TGI! (and sorry for the mess I created with my rebase :/ ) |
No worries :) Cheers |
Thank you so much for the work you are doing, this is my little contribution to this great thing you have built. I hope it is useful and helpful, please don't hesitate to discuss any matters that are not clear! I am basing my implementation of frequency penalty on OpenAI's implementation: https://platform.openai.com/docs/guides/text-generation/parameter-details The problem I see with TGI's current implementation is that is not taking into account the frequency of tokens which have already been sampled in the current generation stream. Also, the scaling is of the adjusted token logits is done differently for positive and negative logits. While in OpenAI's implementation token frequency is taking into account and the scaling is always done with a subtraction (if penalty is positive) or add operation (if penalty is negative). This leads to corrupt generations as I mentioned in issue huggingface#1810 . Moreover, after my tests, other issues are also gone like the one about some request's with ``penalty_frequency = 1.0`` overruling other requests (with ``frequency_penalty = 0.0``) in the same batch and therefore corrupting all generations in the batch. Basically, padding does not affect this implementation so I believe this ``score *= input_ids.ne(0)`` is not needed anymore. Frequency penalty | -1.0 | 0.0 | 1.0 -- | -- | -- | -- Before my change | https://paste.mozilla.org/JxqGJkWY | https://paste.mozilla.org/hrztJ56h | https://paste.mozilla.org/pBSEH2zw After my change | https://paste.mozilla.org/7gXCi7zo | https://paste.mozilla.org/ZR9rJ92g | https://paste.mozilla.org/gHaD2YnC --------- Co-authored-by: martini <[email protected]>
Thank you so much for the work you are doing, this is my little contribution to this great thing you have built. I hope it is useful and helpful, please don't hesitate to discuss any matters that are not clear! I am basing my implementation of frequency penalty on OpenAI's implementation: https://platform.openai.com/docs/guides/text-generation/parameter-details The problem I see with TGI's current implementation is that is not taking into account the frequency of tokens which have already been sampled in the current generation stream. Also, the scaling is of the adjusted token logits is done differently for positive and negative logits. While in OpenAI's implementation token frequency is taking into account and the scaling is always done with a subtraction (if penalty is positive) or add operation (if penalty is negative). This leads to corrupt generations as I mentioned in issue huggingface#1810 . Moreover, after my tests, other issues are also gone like the one about some request's with ``penalty_frequency = 1.0`` overruling other requests (with ``frequency_penalty = 0.0``) in the same batch and therefore corrupting all generations in the batch. Basically, padding does not affect this implementation so I believe this ``score *= input_ids.ne(0)`` is not needed anymore. Frequency penalty | -1.0 | 0.0 | 1.0 -- | -- | -- | -- Before my change | https://paste.mozilla.org/JxqGJkWY | https://paste.mozilla.org/hrztJ56h | https://paste.mozilla.org/pBSEH2zw After my change | https://paste.mozilla.org/7gXCi7zo | https://paste.mozilla.org/ZR9rJ92g | https://paste.mozilla.org/gHaD2YnC --------- Co-authored-by: martini <[email protected]>
Thank you so much for the work you are doing, this is my little contribution to this great thing you have built. I hope it is useful and helpful, please don't hesitate to discuss any matters that are not clear!
I am basing my implementation of frequency penalty on OpenAI's implementation: https://platform.openai.com/docs/guides/text-generation/parameter-details
The problem I see with TGI's current implementation is that is not taking into account the frequency of tokens which have already been sampled in the current generation stream. Also, the scaling is of the adjusted token logits is done differently for positive and negative logits. While in OpenAI's implementation token frequency is taking into account and the scaling is always done with a subtraction (if penalty is positive) or add operation (if penalty is negative).
This leads to corrupt generations as I mentioned in issue #1810 . Moreover, after my tests, other issues are also gone like the one about some request's with
penalty_frequency = 1.0
overruling other requests (withfrequency_penalty = 0.0
) in the same batch and therefore corrupting all generations in the batch. Basically, padding does not affect this implementation so I believe thisscore *= input_ids.ne(0)
is not needed anymore.