Skip to content

tests : add test-model-random #14139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from
Draft

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Jun 12, 2025

This generates random models and then tests different concurrencies of batches to check if the output is consistent.

This can detect when e.g. the recurrent cache has been broken (which has been a problem in the last months, see #13834 (comment)), or anything else which would affect the consistency of the output when inferencing multiple distinct sequences.

More architectures will be added, but for now this starts with Mamba.

Eventually, consistency of pooled embeddings will also be tested.

The goal is to reduce accidental regressions by making it easy to quickly test a lot of edge cases on the supported architectures, without having to download any model.

Draft for now because it's very much a work-in-progress, although it's kind of usable.

Example output

$ ./bin/test-model-random
.............
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=1024, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=1024, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=1, n_ctx=1024, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=1, n_ctx=1024, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=1, n_ctx=1024, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=1, n_ctx=1024, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=2048, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=2048, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=2, n_ctx=2048, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=2, n_ctx=2048, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=2, n_ctx=2048, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=2, n_ctx=2048, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=13, n_ctx=13312, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=13, n_ctx=13312, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=0, n_seq_max=13, n_ctx=13312, n_ubatch=512: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=13, n_ctx=13312, n_ubatch=1: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=13, n_ctx=13312, n_ubatch=2: OK
Comparing output for 'Mamba', with shuffle=1, n_seq_max=13, n_ctx=13312, n_ubatch=512: OK

(takes around 11 seconds on a fast laptop, which means the sizes might need to be reduced once more architectures are added to the tests)

TODO

  • Configure which arch to test from command-line args
    • Will be more useful once initializations for more architectures are implemented
  • Clear temporary model file on exit
    • A temporary file is created because the public API can only load models from filenames.
      • Can't use tmpfile() because the file needs to have a known name.
  • Print "usage" of command-line args
  • Test seq_cp and seq_rm
  • More modular test cases, maybe like in tests/test-backend-ops.cpp
  • Test embeddings
  • Support testing the "important" architectures, aka the common ones and/or the very different ones.
  • TODO: how to test that sliding-windows are implemented properly?

Make sure to read the contributing guidelines before submitting a PR

@compilade compilade added help wanted Extra attention is needed testing Everything test related labels Jun 12, 2025
This generates random models and then tests different concurrencies
of batches to check if the output is consistent.

This can detect when e.g. the recurrent cache has been broken,
or anything else which would affect the consistency of the output
when inferencing multiple distinct sequences.

More architectures will be added, but for now this starts with Mamba.

Eventually, consistency of pooled embeddings will also be tested.

The goal is to reduce accidental regressions
by making it easy to quickly test a lot of edge cases
on the supported architectures,
without having to download any model.
@ngxson
Copy link
Collaborator

ngxson commented Jun 12, 2025

I had a discussion a while ago with @ggerganov about a similar idea, it's nice to see you doing this!

My idea was to have a full pipeline of:

  1. converting HF --> GGUF
  2. run the inference using llama.cpp and using transformers
  3. compare the logits
  4. optionally, rerun the same test with different configs

The random model can be stored on ggml-org. For example, we can easily generate a tiny random HF model like this one

I think this PR can be an interesting step toward that idea. Lmk if I can help!

@compilade
Copy link
Collaborator Author

compilade commented Jun 12, 2025

My idea was to have a full pipeline

@ngxson
Ooh, that could be great! Making comparison with transformers easier would be useful when adding support for new architectures.

I was going to say that I feel like regressions are less likely to happen with conversion, but that's not true given the recent change of using AutoConfig in convert_hf_to_gguf.py, which did have unintended effects which required workarounds (#13881, #14103, #13859, and maybe others)

Lmk if I can help!

We might want to start adding a way to list links of models of a given architecture so that either they can be tested directly or their config can be used to generate random models to test.

Not sure yet how to make the link lists look less like endorsement, and more like "here are some reference models for this architecture, which are expected to always convert properly".

We might need a way to make sparse model files (with holes where the tensors data would be) and load that with mmap, to at least make sure the shapes load properly, or some other way to test that.

The random model can be stored on ggml-org.

I would prefer the random models to be generated locally, so that the tests don't need fast network on every new environment they are put on.

But I agree testing convert_hf_to_gguf.py could be nice, and comparing logits with transformers would be great for compute graph correctness tests.

Since that would depend on Python and transformers, that could be done in an eventual tests/test-model-random.py, which could probably be made to interface with tests/test-model-random.cpp given proper cli args handling.

Some things are more convenient to test separately, though, like tokenizers, chat templates, quantization, backend ops, etc.

My intention here is mainly to test batch splits and consistency of llama_memory operations, and it just also happens to test model loading and graph validity (shape compatibility between ops, etc.).

Since this will test that multi-user batches work correctly, then testing for the correctness of model graphs will really only need to handle trivial single-user batches. In a way, this will reduce the scope of logits correctness testing (and make it simpler).

I will keep this in mind, but for now the scope of tests/test-model-random.cpp will be about consistency of the outputs because it's already broad enough as a scope (e.g. nothing really tests mutli-user pooled embeddings with weird batches yet). Also making sure that changing the batch split strategy doesn't break anything will hopefully be made easier with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants