Skip to content

docker: add perplexity and bench commands to full image #11438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 28, 2025

Conversation

rare-magma
Copy link
Contributor

This adds arguments in the tools.sh script to execute the llama-bench and llama-perplexity tools in the docker full image.

run log:

podman run --privileged --name llama --replace -it --rm -v /var/home/user/downloads/models:/models --
net host --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 --init localhost/llama-c
pp-vulkan:full --bench -m /models/qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 590 Series (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |         pp512 |        515.60 ± 0.64 |
| qwen2 3B Q8_0                  |   3.05 GiB |     3.09 B | Vulkan     |  99 |         tg128 |         49.57 ± 0.36 |

build: 178a7eb9 (4562)
podman run --privileged --name llama --replace -it --rm -v /var/home/user/downloads/models:/models --
net host --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 --init localhost/llama-c
pp-vulkan:full --perplexity -m /models/qwen2.5-coder-3b-q8_0.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 590 Series (RADV POLARIS10) (radv) | uma: 0 | fp16: 0 | warp size: 64 | matrix cores: none
build: 4562 (178a7eb9) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 590 Series (RADV POLARIS10)) - 8192 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 434 tensors from /models/qwen2.5-coder-3b-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 3B
llama_model_loader: - kv   3:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   4:                         general.size_label str              = 3B
llama_model_loader: - kv   5:                            general.license str              = other
llama_model_loader: - kv   6:                       general.license.name str              = qwen-research
llama_model_loader: - kv   7:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 3B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-3B
llama_model_loader: - kv  12:                               general.tags arr[str,5]       = ["code", "qwen", "qwen-coder", "codeq...
llama_model_loader: - kv  13:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  14:                          qwen2.block_count u32              = 36
llama_model_loader: - kv  15:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  16:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv  17:                  qwen2.feed_forward_length u32              = 11008
llama_model_loader: - kv  18:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv  19:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  20:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  22:                          general.file_type u32              = 7
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  181 tensors
llama_model_loader: - type q8_0:  253 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 3.05 GiB (8.50 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 36
print_info: n_head           = 16
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 11008
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.09 B
print_info: general.name     = Qwen2.5 Coder 3B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151643 '<|endoftext|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 148848 'ÄĬ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/37 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  3127.59 MiB
llama_init_from_model: n_seq_max     = 4
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 512
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (512) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =    72.00 MiB
llama_init_from_model: KV self size  =   72.00 MiB, K (f16):   36.00 MiB, V (f16):   36.00 MiB
llama_init_from_model:        CPU  output buffer size =     2.32 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   616.05 MiB
llama_init_from_model: Vulkan_Host compute buffer size =     8.01 MiB
llama_init_from_model: graph nodes  = 1266
llama_init_from_model: graph splits = 508 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 0.004 ms
perplexity: you need at least 1024 tokens to evaluate perplexity with a context of 512
perplexity: the data file you provided tokenizes to only 0 tokens

llama_perf_context_print:        load time =     503.39 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =       0.12 ms /     2 tokens

@github-actions github-actions bot added the devops improvements to build systems and github actions label Jan 26, 2025
@ericcurtin
Copy link
Collaborator

@rare-magma you are doing very similar things to the RamaLama project FWIW, in RamaLama it would be:

ramalama bench qwen2.5-coder

@ericcurtin
Copy link
Collaborator

I would certainly appreciate a PR in RamaLama that does:

ramalama perplexity qwen2.5-coder

https://github.com/containers/ramalama

@ericcurtin ericcurtin merged commit f643120 into ggml-org:master Jan 28, 2025
2 checks passed
@ericcurtin
Copy link
Collaborator

containers/ramalama#637

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants