Open
Description
Describe the Feature
It would be nice to do something like evaluate(..., num_replicates=30)
so I can calculate mean/std dev of accuracy on a benchmark EvaluationDataset
.
What I mean by replicates is basically running the task N times in parallel, and computing aggregate metrics across the parallel runs.
Why is the feature important for you?
Statistical significance is important.
Additional context
I have a custom task, and am trying to compared trained models' performance on that task.