Update on "add option to run mmlu with 5 shots"

helunwencser · helunwencser · commit 127ecf96c68d · 2024-10-11T10:33:53.000-07:00
This PR does the following changes: - add `--num_fewshot` option which is required for running MMLU task with 5 shots - set the default value of `--limit` to none such that we can actually run all examples - update `eval_llama` to call `simple_evaluate` which is a wrapper of `evaluate` and does some extra work for us like getting the task dict Test Plan: - Make sure UncycloText perplexity for llama 3.2 1B stays the same before and after the change. Before, run eval_llama for llama 3.2 1B with limit set to None: ``` wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` After, run eval_llama for llama 3.2 1B: ``` wikitext: {'word_perplexity,none': 12.78246428138387, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.610432252171856, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6874479705552373, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` - Make sure that lm_eval(v0.4.2, which is used by eval_llama) and eval_llama reports similar number for llama 3.2 1B and 3B BF16 for MMLU task with 5 shots. Example command for lm_eval: ``` lm_eval --model hf \ --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct \ --tasks mmlu \ --device cuda \ -f 5 \ --batch_size auto ``` Example command for eval_llama: ``` python -m examples.models.llama2.eval_llama \ -c /home/lunwenh/models/1B_Instruct/consolidated.00.pth \ -p /home/lunwenh/models/1B_Instruct/params.json \ -t /home/lunwenh/models/1B_Instruct/tokenizer.model \ -kv \ -d bf16 \ --tasks mmlu \ -f 5 \ --max_seq_length 2048 ``` Differential Revision: [D64215268](https://our.internmc.facebook.com/intern/diff/D64215268) [ghstack-poisoned]
diff --git a/examples/models/llama2/eval_llama_lib.py b/examples/models/llama2/eval_llama_lib.py
@@ -247,7 +247,10 @@ def build_args_parser() -> argparse.ArgumentParser:
         help="list of lm-eluther tasks to evaluate usage: --tasks task1 task2",
     )
     parser.add_argument(
-        "--limit", type=int, default=None, help="number of samples to evalulate"
+        "--limit",
+        type=int,
+        default=None,
+        help="number of samples to evalulate. If not set, evaluate all samples",
     )
     parser.add_argument(
         "-f",

Original file line number	Diff line number	Diff line change
`@@ -247,7 +247,10 @@ def build_args_parser() -> argparse.ArgumentParser:`
`247`	`247`	`help="list of lm-eluther tasks to evaluate usage: --tasks task1 task2",`
`248`	`248`	`)`
`249`	`249`	`parser.add_argument(`
`250`		`- "--limit", type=int, default=None, help="number of samples to evalulate"`
	`250`	`+ "--limit",`
	`251`	`+ type=int,`
	`252`	`+ default=None,`
	`253`	`+ help="number of samples to evalulate. If not set, evaluate all samples",`
`251`	`254`	`)`
`252`	`255`	`parser.add_argument(`
`253`	`256`	`"-f",`