Add CLI arg generate_until_token
to support reasoning and CoT models
#617
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As noted in #8 and #513 , LightEval expects models to follow a question with an immediate answer, but chain-of-thought and reasoning models (such as DeepSeek) generate many tokens to arrive at a more accurate / thought-out result before answering.
This PR would add
--generate-until-token '</think>'
as the syntax to support these models.It must be run with
--use-chat-template
and aTransformerModel
model, or it will raise an Exception.I have a CoLab notebook running a BigBench task which I didn't run to the end, but I used
logger.info
to confirm it was generating reasoning text. In a previous test linked in #513 I confirmed this method works on a short taskNotes:
do_sample=True
when generating the reasoning text? Is that reproducible?logger.debug()
when calling lighteval from the command line? I can remove the logging of reasoning text if it isn't helpful["A", "B", ...]
to["The answer is A", ...]
- thoughts about using the template string to set post-reasoning text and be compatible with more evals?