Skip to content

add instructions about getting mmlu score for instruct models #6173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 24 additions & 4 deletions examples/models/llama2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ We employed 4-bit groupwise per token dynamic quantization of all the linear lay

We evaluated UncycloText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). Please note that LM Eval reports perplexity normalized by word count instead of token count. You may see different perplexity for UncycloText from other sources if they implement it differntly. More details could be found [here](https://github.com/EleutherAI/lm-evaluation-harness/issues/2301).

Below are the results for two different groupsizes, with max_seq_len 2048, and 1000 samples.
Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000.

|Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256)
|--------|-----------------| ---------------------- | ---------------
Expand Down Expand Up @@ -280,12 +280,32 @@ tokenizer.path=<path_to_checkpoint_folder>/tokenizer.model

> Forewarning: Model evaluation without a GPU may take a long time, especially on larger models.

Using the same arguments from above
We use [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate model accuracy.

Using the following example command to calculate model's perplexity based on UncycloText.
```
python -m examples.models.llama2.eval_llama -c <checkpoint.pth> -p <params.json> -t <tokenizer.model/bin> -d fp32 --max_seq_len <max sequence length> --limit <number of samples>
python -m examples.models.llama2.eval_llama \
-c <checkpoint.pth> \
-p <params.json> \
-t <tokenizer.model/bin> \
-kv \
-d <checkpoint dtype> \
--max_seq_len <max sequence length> \
--limit <number of samples>
```

The Uncyclotext results generated above used: `{max_seq_len: 2048, limit: 1000}`
For instruct models, you can use the following example command to calculate model's MMLU score.
```
python -m examples.models.llama2.eval_llama \
-c <checkpoint.pth> \
-p <params.json> \
-t <tokenizer.model/bin> \
-kv \
-d <checkpoint dtype> \
--tasks mmlu \
--num_fewshot 5 \
--max_seq_len <max sequence length>
```

## Step 4: Run on your computer to validate

Expand Down
6 changes: 3 additions & 3 deletions examples/models/llama2/eval_llama_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,9 +295,9 @@ def eval_llama(
with torch.no_grad():
eval_results = simple_evaluate(
model=eval_wrapper,
tasks=args.tasks,
num_fewshot=args.num_fewshot,
limit=args.limit,
tasks=args.tasks, # pyre-ignore: Undefined attribute [16]: `argparse.ArgumentParser` has no attribute `tasks`
num_fewshot=args.num_fewshot, # pyre-ignore: Undefined attribute [16]: `argparse.ArgumentParser` has no attribute `num_fewshot`
limit=args.limit, # pyre-ignore: Undefined attribute [16]: `argparse.ArgumentParser` has no attribute `limit`
)

for task, res in eval_results["results"].items():
Expand Down
Loading