`evaluate` supporting replicates

**Describe the Feature**

It would be nice to do something like `evaluate(..., num_replicates=30)` so I can calculate mean/std dev of accuracy on a benchmark `EvaluationDataset`.

What I mean by replicates is basically running the task N times in parallel, and computing aggregate metrics across the parallel runs.

**Why is the feature important for you?**

Statistical significance is important.

**Additional context**

I have a custom task, and am trying to compared trained models' performance on that task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`evaluate` supporting replicates #2052

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

evaluate supporting replicates #2052

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`evaluate` supporting replicates #2052