Skip to content

speculative: add --n-gpu-layers-draft option #3063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 13, 2023

Conversation

sozforex
Copy link
Contributor

@sozforex sozforex commented Sep 7, 2023

This PR adds an option to the speculative example to specify number of layers to offload to GPU for the draft model.
May help people to play with speculative with larger models and without a lot of VRAM.

Some test runs with ROCm on RX 6850M XT (12GB):

## CPU, only codellama-34b.Q8_0.gguf
759,34s user 1,08s system 395% cpu 3:12,19 total
## CPU+GPU only codellama-34b.Q8_0.gguf (with 16/51 layers offloaded)
549,70s user 1,07s system 391% cpu 2:20,71 total
## CPU, speculative codellama-34b.Q8_0.gguf with codellama-7b.Q4_K_M.gguf as draft
594,54s user 1,39s system 392% cpu 2:31,80 total
## CPU+GPU, speculative codellama-34b.Q8_0.gguf (16/51 offloaded) with codellama-7b.Q4_K_M.gguf (0/35 offloaded) as draft
465,69s user 1,40s system 387% cpu 2:00,60 total
## CPU+GPU, speculative codellama-34b.Q8_0.gguf (0/51 offloaded) with codellama-7b.Q4_K_M.gguf (35/35 offloaded) as draft
522,74s user 1,39s system 390% cpu 2:14,21 total
## CPU+GPU, speculative codellama-34b.Q8_0.gguf (7/51 offloaded) with codellama-7b.Q4_K_M.gguf (35/35 offloaded) as draft
462,27s user 1,38s system 387% cpu 1:59,54 total



## CPU, only codellama-34b.Q8_0.gguf

./build/bin/main \
-m ../models/codellama/codellama-34b.Q8_0.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -t 4 -n 256 -c 4096 -s 8 --top_k 1 --temp 0.1 --top-p 0.95 \
-ngl 0

...

llama_print_timings:        load time =  1386,57 ms
llama_print_timings:      sample time =    94,60 ms /   256 runs   (    0,37 ms per token,  2706,13 tokens per second)
llama_print_timings: prompt eval time =  6737,07 ms /    25 tokens (  269,48 ms per token,     3,71 tokens per second)
llama_print_timings:        eval time = 182709,83 ms /   255 runs   (  716,51 ms per token,     1,40 tokens per second)
llama_print_timings:       total time = 189669,78 ms
Log end

759,34s user 1,08s system 395% cpu 3:12,19 total



## CPU+GPU only codellama-34b.Q8_0.gguf (with 16/51 layers offloaded)

./build/bin/main \
-m ../models/codellama/codellama-34b.Q8_0.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -t 4 -n 256 -c 4096 -s 8 --top_k 1 --temp 0.1 --top-p 0.95 \
-ngl 16

...

llama_print_timings:        load time =  2595,30 ms
llama_print_timings:      sample time =    95,45 ms /   256 runs   (    0,37 ms per token,  2681,92 tokens per second)
llama_print_timings: prompt eval time =  4795,47 ms /    25 tokens (  191,82 ms per token,     5,21 tokens per second)
llama_print_timings:        eval time = 131956,09 ms /   255 runs   (  517,47 ms per token,     1,93 tokens per second)
llama_print_timings:       total time = 136967,01 ms
Log end

549,70s user 1,07s system 391% cpu 2:20,71 total



## CPU, speculative codellama-34b.Q8_0.gguf with codellama-7b.Q4_K_M.gguf as draft

./build/bin/speculative \
-m ../models/codellama/codellama-34b.Q8_0.gguf \
-md ../models/codellama/codellama-7b.Q4_K_M.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -t 4 -n 256 -c 4096 -s 8 --top_k 1 --temp 0.1 --top-p 0.95 --draft 16 \
-ngl 0 -ngld 0

...

encoded   25 tokens in    8.518 seconds, speed:    2.935 t/s
decoded  258 tokens in  140.315 seconds, speed:    1.839 t/s

n_draft   = 16
n_predict = 258
n_drafted = 311
n_accept  = 213
accept    = 68.489%

draft:

llama_print_timings:        load time =   388.96 ms
llama_print_timings:      sample time =   620.51 ms /     1 runs   (  620.51 ms per token,     1.61 tokens per second)
llama_print_timings: prompt eval time =  1266.42 ms /    25 tokens (   50.66 ms per token,    19.74 tokens per second)
llama_print_timings:        eval time = 31189.23 ms /   345 runs   (   90.40 ms per token,    11.06 tokens per second)
llama_print_timings:       total time = 148833.25 ms

target:

llama_print_timings:        load time =  1381.40 ms
llama_print_timings:      sample time =    97.55 ms /   258 runs   (    0.38 ms per token,  2644.72 tokens per second)
llama_print_timings: prompt eval time = 109174.09 ms /   371 tokens (  294.27 ms per token,     3.40 tokens per second)
llama_print_timings:        eval time =  6444.34 ms /     9 runs   (  716.04 ms per token,     1.40 tokens per second)
llama_print_timings:       total time = 149227.66 ms

594,54s user 1,39s system 392% cpu 2:31,80 total



## CPU+GPU, speculative codellama-34b.Q8_0.gguf (16/51 offloaded) with codellama-7b.Q4_K_M.gguf (0/35 offloaded) as draft

./build/bin/speculative \
-m ../models/codellama/codellama-34b.Q8_0.gguf \
-md ../models/codellama/codellama-7b.Q4_K_M.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -t 4 -n 256 -c 4096 -s 8 --top_k 1 --temp 0.1 --top-p 0.95 --draft 16 \
-ngl 16 -ngld 0

...

encoded   25 tokens in    6.325 seconds, speed:    3.952 t/s
decoded  258 tokens in  110.069 seconds, speed:    2.344 t/s

n_draft   = 16
n_predict = 258
n_drafted = 311
n_accept  = 213
accept    = 68.489%

draft:

llama_print_timings:        load time =   390.08 ms
llama_print_timings:      sample time =   625.54 ms /     1 runs   (  625.54 ms per token,     1.60 tokens per second)
llama_print_timings: prompt eval time =  1268.08 ms /    25 tokens (   50.72 ms per token,    19.71 tokens per second)
llama_print_timings:        eval time = 31142.22 ms /   345 runs   (   90.27 ms per token,    11.08 tokens per second)
llama_print_timings:       total time = 116394.27 ms

target:

llama_print_timings:        load time =  2602.36 ms
llama_print_timings:      sample time =    98.55 ms /   258 runs   (    0.38 ms per token,  2618.04 tokens per second)
llama_print_timings: prompt eval time = 78556.04 ms /   371 tokens (  211.74 ms per token,     4.72 tokens per second)
llama_print_timings:        eval time =  4653.90 ms /     9 runs   (  517.10 ms per token,     1.93 tokens per second)
llama_print_timings:       total time = 116789.71 ms

465,69s user 1,40s system 387% cpu 2:00,60 total



## CPU+GPU, speculative codellama-34b.Q8_0.gguf (0/51 offloaded) with codellama-7b.Q4_K_M.gguf (35/35 offloaded) as draft

./build/bin/speculative \
-m ../models/codellama/codellama-34b.Q8_0.gguf \
-md ../models/codellama/codellama-7b.Q4_K_M.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -t 4 -n 256 -c 4096 -s 8 --top_k 1 --temp 0.1 --top-p 0.95 --draft 16 \
-ngl 0 -ngld 35

...

encoded   25 tokens in    7.430 seconds, speed:    3.365 t/s
decoded  258 tokens in  123.224 seconds, speed:    2.094 t/s

n_draft   = 16
n_predict = 258
n_drafted = 334
n_accept  = 213
accept    = 63.772%

draft:

llama_print_timings:        load time =   988.47 ms
llama_print_timings:      sample time =   660.38 ms /     1 runs   (  660.38 ms per token,     1.51 tokens per second)
llama_print_timings: prompt eval time =   269.49 ms /    25 tokens (   10.78 ms per token,    92.77 tokens per second)
llama_print_timings:        eval time =  7218.84 ms /   367 runs   (   19.67 ms per token,    50.84 tokens per second)
llama_print_timings:       total time = 130654.42 ms

target:

llama_print_timings:        load time =  1377.21 ms
llama_print_timings:      sample time =    97.19 ms /   258 runs   (    0.38 ms per token,  2654.48 tokens per second)
llama_print_timings: prompt eval time = 115224.91 ms /   393 tokens (  293.19 ms per token,     3.41 tokens per second)
llama_print_timings:        eval time =  7148.35 ms /    10 runs   (  714.84 ms per token,     1.40 tokens per second)
llama_print_timings:       total time = 131649.47 ms

522,74s user 1,39s system 390% cpu 2:14,21 total



## CPU+GPU, speculative codellama-34b.Q8_0.gguf (7/51 offloaded) with codellama-7b.Q4_K_M.gguf (35/35 offloaded) as draft

./build/bin/speculative \
-m ../models/codellama/codellama-34b.Q8_0.gguf \
-md ../models/codellama/codellama-7b.Q4_K_M.gguf \
-p "// Quick-sort implementation in C (4 spaces indentation + detailed comments) and sample usage:\n\n#include" \
-e -t 4 -n 256 -c 4096 -s 8 --top_k 1 --temp 0.1 --top-p 0.95 --draft 16 \
-ngl 7 -ngld 35

...

encoded   25 tokens in    6.520 seconds, speed:    3.834 t/s
decoded  258 tokens in  108.936 seconds, speed:    2.368 t/s

n_draft   = 16
n_predict = 258
n_drafted = 334
n_accept  = 213
accept    = 63.772%

draft:

llama_print_timings:        load time =   989.95 ms
llama_print_timings:      sample time =   659.58 ms /     1 runs   (  659.58 ms per token,     1.52 tokens per second)
llama_print_timings: prompt eval time =   256.32 ms /    25 tokens (   10.25 ms per token,    97.53 tokens per second)
llama_print_timings:        eval time =  7149.95 ms /   367 runs   (   19.48 ms per token,    51.33 tokens per second)
llama_print_timings:       total time = 115455.20 ms

target:

llama_print_timings:        load time =  1898.06 ms
llama_print_timings:      sample time =    97.65 ms /   258 runs   (    0.38 ms per token,  2642.01 tokens per second)
llama_print_timings: prompt eval time = 100969.41 ms /   393 tokens (  256.92 ms per token,     3.89 tokens per second)
llama_print_timings:        eval time =  6287.35 ms /    10 runs   (  628.74 ms per token,     1.59 tokens per second)
llama_print_timings:       total time = 116450.65 ms

462,27s user 1,38s system 387% cpu 1:59,54 total

I've found it unexpected that acceptance rate went from 68.489% with draft on CPU to 63.772% with draft on GPU (tried bunch of different seeds, acceptance with draft on GPU stays at 63.772%).

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--n-gpu-layers is not the only option for which you may want to set different values for the draft and target model. In fact, I would argue that for all performance options you may want different values. I don't think duplicating all performance options would be a good solution. As such I think it's better to instead extend the syntax of existing CLI arguments in such a way that they allow you to specify values for both the target and the draft model at. For example, you could specify something like -ngl 20,99 to mean that 20 layers of the target model and 99 layers of the draft model should be offloaded.

Caveat: we should think of what to do with CLI arguments like --tensor-split that accept multiple comma-separated values ahead of time. We could either use different separators (but there are few special characters that don't cause syntax issues in some shell) or we could specify that the first few values are used for the target model and the following values for the draft model.

@KerfuffleV2
Copy link
Collaborator

For example, you could specify something like -ngl 20,99

An alternate approach would be to do a prefix match for those kinds of arguments and then find the target in the rest of the option name. I.E. -ngl-main, -ngl-draft, etc and have a sensible default if it's blank. That way you wouldn't have to mess with separators and stuff. Also something like -ngl 20,99 requires the user to know what position is what and also isn't going to work if if there's ever a different type of thing you'd want to be able to specify the number of layers for.

@sozforex
Copy link
Contributor Author

sozforex commented Sep 8, 2023

One more alternative:
[COMMON_MODEL_OPTIONS] --main-opts [MAIN_MODEL_OPTIONS] --draft-opts [DRAFT_MODEL_OPTIONS]
How [*_MODEL_OPTIONS] will be processed depends on where they are placed relative to "--draft-opts" and "--main-opts".
Those that are placed before both "--draft-opts" and "--main-opts", will be common for both.

@JohannesGaessler feel free to close this PR in favor of another implementing requested changes - I'm not familiar enough with cpp to do this in short time (and not sure which of possible alternatives is better).

@KerfuffleV2
Copy link
Collaborator

This seems like it might be a "don't let perfect be the enemy of good" type of situation. It's definitely better to be able to specify some draft specific options than none.

@sozforex
Copy link
Contributor Author

sozforex commented Sep 8, 2023

I can add other options the same way as I've done with n-gpu-layers, but I'm not sure for which of them it makes sense to have different values for draft and target models.

@JohannesGaessler
Copy link
Collaborator

An alternate approach would be to do a prefix match for those kinds of arguments and then find the target in the rest of the option name. I.E. -ngl-main, -ngl-draft, etc and have a sensible default if it's blank. That way you wouldn't have to mess with separators and stuff. Also something like -ngl 20,99 requires the user to know what position is what and also isn't going to work if if there's ever a different type of thing you'd want to be able to specify the number of layers for.

Good points, maybe doing it like it has been done in this PR would be the better approach after all.

This seems like it might be a "don't let perfect be the enemy of good" type of situation. It's definitely better to be able to specify some draft specific options than none.

It's just that I would prefer to minimize the number of times that the user-facing interface is changed. So I would rather spend some extra time discussing alternatives than deal with potential breakage later.

@JohannesGaessler
Copy link
Collaborator

I can add other options the same way as I've done with n-gpu-layers, but I'm not sure for which of them it makes sense to have different values for draft and target models.

It's fine if you only add the functionality for --n-gpu-layers in this PR. I just want to reach a consensus regarding the way to do it in general ahead of time.

@Azeirah
Copy link
Contributor

Azeirah commented Sep 9, 2023

For example, you could specify something like -ngl 20,99

An alternate approach would be to do a prefix match for those kinds of arguments and then find the target in the rest of the option name. I.E. -ngl-main, -ngl-draft, etc and have a sensible default if it's blank. That way you wouldn't have to mess with separators and stuff. Also something like -ngl 20,99 requires the user to know what position is what and also isn't going to work if if there's ever a different type of thing you'd want to be able to specify the number of layers for.

I like this approach. The commands would be consistent and changes to the code would be minimal. It's also flexible. I'm not sure about the idea of taking a sensible default since it's not transparent to the user, maybe it's ok to copy the value from the parameter without -draft if that value doesn't make sense, you can always set it anyway.

Do you know if there's a standard way to describe this kind of parameter in commandline interface documentation?

@KerfuffleV2
Copy link
Collaborator

I'm not sure about the idea of taking a sensible default since it's not transparent to the user,

Without that "sensible default" (target, to be clear) you couldn't do -ngl anymore, you'd have to do -ngl-main or whatever. The sensible default target for -ngl would be the main model, not the speculation draft one. That kind of thing.

I'm thinking of an approach where all the existing options could continue working.

Do you know if there's a standard way to describe this kind of parameter in commandline interface documentation?

One approach would just to not do anything special. You could either just duplicate the descriptions of the commands or say something like "Same as -ngl-main except it affects the speculation model".

Or you could say something like:

-ngl[-TARGET] or --num-gpu-layers[-TARGET] - number of layers to offload. If not specified, TARGET defaults to main. Possible targets: main - the main model, draft - the draft model used for speculation. Examples: -ngl 10, -ngl-draft 5

(Just a very simple example, I didn't make an attempt to polish it to the standard of real documentation.)

@KerfuffleV2
Copy link
Collaborator

For actually implementing this, I think it would be really simple:

// Untested, should be pretty close to working though.
bool match_arg_prefix(const std::string & arg, const std::string & prefix, std::string & target) {
    target.clear();
    if (arg.compare(0, prefix.size(), prefix) != 0) {
        // Doesn't match.
        return false;
    }
    if (arg.size() == prefix.size()) {
        // No prefix specified.
        return true;
    }
    if (arg.size() >= prefix.size() + 2 && arg[prefix.size()] == '-') {
        target.assign(arg, prefix.size() + 1, std::string::npos);
        return true;
    }
    // Prefix doesn't start with - or is empty.
    return false;
}

Then you can just replace something like:

} else if (arg == "--gpu-layers" || arg == "-ngl" || arg == "--n-gpu-layers") {
    if (++i >= argc) {
        invalid_param = true;
        break;
    }
    params.n_gpu_layers = std::stoi(argv[i]);
} else if // draft layers stuff [...]

with

// Earlier in the code:
// std::string target;

} else if (match_arg_prefix(arg, "--gpu-layers", target) || match_arg_prefix(arg, "--n-gpu-layers", target) || match_arg_prefix(arg, "-ngl", target)) {        
    if (++i >= argc) {
        invalid_param = true;
        break;
    }
    if (target.empty() || target == "main") {
        params.n_gpu_layers = std::stoi(argv[i]);
    } else if (target == "draft") {
        params.n_gpu_layers_draft = std::stoi(argv[i]);
    } else {
        invalid_param = true;
        break;
    }
} // else if [...]

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it some more I think adding separate CLI arguments is probably better than extending the syntax of existing CLI arguments. Merging this PR as-is should be fine; we can add more elaborate suffix logic later. @ggerganov just to check, is the approach in this PR where -draft is simply appended to existing CLI arguments fine with you?

@KerfuffleV2
Copy link
Collaborator

I think adding separate CLI arguments is probably better than extending the syntax of existing CLI arguments.

From a user perspective, the approach I suggested is the same. Internally, it's just a way to reduce boilerplate code. (Certainly something that can be added later though, if there's interest.)

@JohannesGaessler JohannesGaessler merged commit 84e7236 into ggml-org:master Sep 13, 2023
@ggerganov
Copy link
Member

@JohannesGaessler Yes, it's fine as proposed - we will improve later if there is too much duplication.

p.s. Sorry for the delayed responses. Focusing for a few days on whisper.cpp, hence I'm a bit slow to respond. Thanks for helping out

pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants