tests : add test-model-random #14139

compilade · 2025-06-12T04:58:44Z

This generates random models and then tests different concurrencies of batches to check if the output is consistent.

This can detect when e.g. the recurrent cache has been broken (which has been a problem in the last months, see #13834 (comment)), or anything else which would affect the consistency of the output when inferencing multiple distinct sequences.

More architectures will be added, but for now this starts with Mamba.

Eventually, consistency of pooled embeddings will also be tested.

The goal is to reduce accidental regressions by making it easy to quickly test a lot of edge cases on the supported architectures, without having to download any model.

Draft for now because it's very much a work-in-progress, although it's kind of usable.

Example output

$ ./bin/test-model-random
.............
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=1024, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=1024, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=1024, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=1, n_ctx=1024, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=1, n_ctx=1024, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=1, n_ctx=1024, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=2048, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=2048, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=2048, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=2, n_ctx=2048, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=2, n_ctx=2048, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=2, n_ctx=2048, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=13, n_ctx=13312, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=13, n_ctx=13312, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=13, n_ctx=13312, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=13, n_ctx=13312, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=13, n_ctx=13312, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=13, n_ctx=13312, n_ubatch=512: OK

(takes around 11 seconds on a fast laptop, which means the sizes might need to be reduced once more architectures are added to the tests)

TODO

Make sure to read the contributing guidelines before submitting a PR

This generates random models and then tests different concurrencies of batches to check if the output is consistent. This can detect when e.g. the recurrent cache has been broken, or anything else which would affect the consistency of the output when inferencing multiple distinct sequences. More architectures will be added, but for now this starts with Mamba. Eventually, consistency of pooled embeddings will also be tested. The goal is to reduce accidental regressions by making it easy to quickly test a lot of edge cases on the supported architectures, without having to download any model.

* tests : fix integer types in test-model-random

ngxson · 2025-06-12T14:45:34Z

I had a discussion a while ago with @ggerganov about a similar idea, it's nice to see you doing this!

My idea was to have a full pipeline of:

converting HF --> GGUF
run the inference using llama.cpp and using transformers
compare the logits
optionally, rerun the same test with different configs

The random model can be stored on ggml-org. For example, we can easily generate a tiny random HF model like this one

I think this PR can be an interesting step toward that idea. Lmk if I can help!

compilade · 2025-06-12T20:11:31Z

My idea was to have a full pipeline

@ngxson
Ooh, that could be great! Making comparison with transformers easier would be useful when adding support for new architectures.

I was going to say that I feel like regressions are less likely to happen with conversion, but that's not true given the recent change of using AutoConfig in convert_hf_to_gguf.py, which did have unintended effects which required workarounds (#13881, #14103, #13859, and maybe others)

Lmk if I can help!

We might want to start adding a way to list links of models of a given architecture so that either they can be tested directly or their config can be used to generate random models to test.

Not sure yet how to make the link lists look less like endorsement, and more like "here are some reference models for this architecture, which are expected to always convert properly".

We might need a way to make sparse model files (with holes where the tensors data would be) and load that with mmap, to at least make sure the shapes load properly, or some other way to test that.

The random model can be stored on ggml-org.

I would prefer the random models to be generated locally, so that the tests don't need fast network on every new environment they are put on.

But I agree testing convert_hf_to_gguf.py could be nice, and comparing logits with transformers would be great for compute graph correctness tests.

Since that would depend on Python and transformers, that could be done in an eventual tests/test-model-random.py, which could probably be made to interface with tests/test-model-random.cpp given proper cli args handling.

Some things are more convenient to test separately, though, like tokenizers, chat templates, quantization, backend ops, etc.

My intention here is mainly to test batch splits and consistency of llama_memory operations, and it just also happens to test model loading and graph validity (shape compatibility between ops, etc.).

Since this will test that multi-user batches work correctly, then testing for the correctness of model graphs will really only need to handle trivial single-user batches. In a way, this will reduce the scope of logits correctness testing (and make it simpler).

I will keep this in mind, but for now the scope of tests/test-model-random.cpp will be about consistency of the outputs because it's already broad enough as a scope (e.g. nothing really tests mutli-user pooled embeddings with weird batches yet). Also making sure that changing the batch split strategy doesn't break anything will hopefully be made easier with this.

compilade added help wanted Extra attention is needed testing Everything test related labels Jun 12, 2025

compilade force-pushed the compilade/test-model-random branch from 868471b to 9cd402c Compare June 12, 2025 05:01

compilade mentioned this pull request Jun 12, 2025

context : round n_tokens to next multiple of n_seqs when reserving #14140

Merged

compilade added 2 commits June 12, 2025 02:41

tests : fix overflow and memory leaks in test-model-random

7657835

* tests : fix integer types in test-model-random

tests : avoid sprintf in test-model-random

8fe213a

compilade added 4 commits June 13, 2025 14:31

Merge branch 'master' into compilade/test-model-random

61f6429

tests : add LLAMA, LLAMA4, and GEMMA2 to test-model-random

dfa3c18

test-model-random : better default tensor initialization distribution

352703b

Merge branch 'master' into compilade/test-model-random

04b8f51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests : add test-model-random #14139

tests : add test-model-random #14139

Uh oh!

compilade commented Jun 12, 2025 •

edited

Loading

Uh oh!

ngxson commented Jun 12, 2025

Uh oh!

compilade commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

tests : add test-model-random #14139

Are you sure you want to change the base?

tests : add test-model-random #14139

Uh oh!

Conversation

compilade commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example output

TODO

Uh oh!

ngxson commented Jun 12, 2025

Uh oh!

compilade commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

compilade commented Jun 12, 2025 •

edited

Loading

compilade commented Jun 12, 2025 •

edited

Loading