Quantifying factual correctness with multiple modes at the same time

**Describe the Feature**
Rather than choosing a single mode for quantifying factual evaluation, a user should have the freedom to request multiple modes at the same time

**Why is the feature important for you?**
Each mode (f1_score, precision, recall) allows you to interpret the correctness from a different angle. I consider that for a clear overview of the answer correctness, a user should interpret all results. 
At the same time, calling the factual correctness multiple times is redundant, cosidering that we use the same statements processed by an LLM for quantifying the score in each mode. 
By providing this feature, we can save computational time and API costs. 

**Additional context**

Papers like https://arxiv.org/abs/2307.16877 and https://arxiv.org/pdf/2503.16161 discussed about the tradeoffs between each mode. 

If the proposal is approved, I would like to be the one who implements it. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantifying factual correctness with multiple modes at the same time #2065

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantifying factual correctness with multiple modes at the same time #2065

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions