-
Notifications
You must be signed in to change notification settings - Fork 12.2k
speculative: add --n-gpu-layers-draft option #3063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--n-gpu-layers
is not the only option for which you may want to set different values for the draft and target model. In fact, I would argue that for all performance options you may want different values. I don't think duplicating all performance options would be a good solution. As such I think it's better to instead extend the syntax of existing CLI arguments in such a way that they allow you to specify values for both the target and the draft model at. For example, you could specify something like -ngl 20,99
to mean that 20 layers of the target model and 99 layers of the draft model should be offloaded.
Caveat: we should think of what to do with CLI arguments like --tensor-split
that accept multiple comma-separated values ahead of time. We could either use different separators (but there are few special characters that don't cause syntax issues in some shell) or we could specify that the first few values are used for the target model and the following values for the draft model.
An alternate approach would be to do a prefix match for those kinds of arguments and then find the target in the rest of the option name. I.E. |
One more alternative: @JohannesGaessler feel free to close this PR in favor of another implementing requested changes - I'm not familiar enough with cpp to do this in short time (and not sure which of possible alternatives is better). |
This seems like it might be a "don't let perfect be the enemy of good" type of situation. It's definitely better to be able to specify some draft specific options than none. |
I can add other options the same way as I've done with n-gpu-layers, but I'm not sure for which of them it makes sense to have different values for draft and target models. |
Good points, maybe doing it like it has been done in this PR would be the better approach after all.
It's just that I would prefer to minimize the number of times that the user-facing interface is changed. So I would rather spend some extra time discussing alternatives than deal with potential breakage later. |
It's fine if you only add the functionality for |
I like this approach. The commands would be consistent and changes to the code would be minimal. It's also flexible. I'm not sure about the idea of taking a sensible default since it's not transparent to the user, maybe it's ok to copy the value from the parameter without Do you know if there's a standard way to describe this kind of parameter in commandline interface documentation? |
Without that "sensible default" (target, to be clear) you couldn't do I'm thinking of an approach where all the existing options could continue working.
One approach would just to not do anything special. You could either just duplicate the descriptions of the commands or say something like "Same as Or you could say something like:
(Just a very simple example, I didn't make an attempt to polish it to the standard of real documentation.) |
For actually implementing this, I think it would be really simple: // Untested, should be pretty close to working though.
bool match_arg_prefix(const std::string & arg, const std::string & prefix, std::string & target) {
target.clear();
if (arg.compare(0, prefix.size(), prefix) != 0) {
// Doesn't match.
return false;
}
if (arg.size() == prefix.size()) {
// No prefix specified.
return true;
}
if (arg.size() >= prefix.size() + 2 && arg[prefix.size()] == '-') {
target.assign(arg, prefix.size() + 1, std::string::npos);
return true;
}
// Prefix doesn't start with - or is empty.
return false;
} Then you can just replace something like: } else if (arg == "--gpu-layers" || arg == "-ngl" || arg == "--n-gpu-layers") {
if (++i >= argc) {
invalid_param = true;
break;
}
params.n_gpu_layers = std::stoi(argv[i]);
} else if // draft layers stuff [...] with // Earlier in the code:
// std::string target;
} else if (match_arg_prefix(arg, "--gpu-layers", target) || match_arg_prefix(arg, "--n-gpu-layers", target) || match_arg_prefix(arg, "-ngl", target)) {
if (++i >= argc) {
invalid_param = true;
break;
}
if (target.empty() || target == "main") {
params.n_gpu_layers = std::stoi(argv[i]);
} else if (target == "draft") {
params.n_gpu_layers_draft = std::stoi(argv[i]);
} else {
invalid_param = true;
break;
}
} // else if [...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it some more I think adding separate CLI arguments is probably better than extending the syntax of existing CLI arguments. Merging this PR as-is should be fine; we can add more elaborate suffix logic later. @ggerganov just to check, is the approach in this PR where -draft
is simply appended to existing CLI arguments fine with you?
From a user perspective, the approach I suggested is the same. Internally, it's just a way to reduce boilerplate code. (Certainly something that can be added later though, if there's interest.) |
@JohannesGaessler Yes, it's fine as proposed - we will improve later if there is too much duplication. p.s. Sorry for the delayed responses. Focusing for a few days on |
This PR adds an option to the speculative example to specify number of layers to offload to GPU for the draft model.
May help people to play with speculative with larger models and without a lot of VRAM.
Some test runs with ROCm on RX 6850M XT (12GB):
I've found it unexpected that acceptance rate went from 68.489% with draft on CPU to 63.772% with draft on GPU (tried bunch of different seeds, acceptance with draft on GPU stays at 63.772%).