merge lora.adapters into a base model? confused about the bin file part. #8663

markat1 · 2024-07-24T06:31:22Z

markat1
Jul 24, 2024

Confusion regarding bin file in README example

I want to merge my finetuned LoRa adapters into a base model - great I can just use llama-export-lora

  ./bin/llama-export-lora \
    -m open-llama-3b-v2-q8_0.gguf \
    -o open-llama-3b-v2-q8_0-english2tokipona-chat.gguf \
    --lora lora-open-llama-3b-v2-q8_0-english2tokipona-chat-LATEST.bin

But!

what confuses me is that the example in the README file uses a bin file for the LoRA adapter .

Right now I'm using a older version of convert_lora_to_gguf.py that produces a .bin file when I convert my lora_adapters to gguf - this doesn't happen with the newerversion of convert_lora_to_gguf.py - so that make me confused in my current process.

So if it's mandatory using a bin version of llora-adapters - How do I make a .bin file?

Would really appreciate a full example if possible!

What I have right now is the following:

tree models/Llama-3-Instruct-abliteration-LoRA-8B
models/Llama-3-Instruct-abliteration-LoRA-8B
├── adapter_config.json
├── adapter_model.safetensors
└── fireworks.json

and

llama-3-abliterated-f16-lora.gguf
made from base model LLama 8B instruct

tree models/Meta-Llama-3-8B-Instruct 
models/Meta-Llama-3-8B-Instruct
├── LICENSE
├── README.md
├── USE_POLICY.md
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── original
│   ├── consolidated.00.pth
│   ├── params.json
│   └── tokenizer.model
├── special_tokens_map.json
├── tokenizer.json
└── tokenizer_config.json

Answered by ngxson

Jul 24, 2024

Sorry the guide has a typo error. The lora must be always gguf:

./bin/llama-export-lora \
    -m open-llama-3b-v2-q8_0.gguf \
    -o open-llama-3b-v2-q8_0-english2tokipona-chat.gguf \
    --lora lora-open-llama-3b-v2-q8_0-english2tokipona-chat-LATEST.gguf

Multiple LORA adapters can be applied by passing multiple --lora FNAME or --lora-scaled FNAME S command line parameters:

./bin/llama-export-lora \
    -m your_base_model.gguf \
    -o your_merged_model.gguf \
    --lora-scaled lora_task_A.gguf 0.5 \
    --lora-scaled lora_task_B.gguf 0.5

It's fixed in #8669

View full answer

ngxson · 2024-07-24T14:44:21Z

ngxson
Jul 24, 2024
Collaborator

Sorry the guide has a typo error. The lora must be always gguf:

./bin/llama-export-lora \
    -m open-llama-3b-v2-q8_0.gguf \
    -o open-llama-3b-v2-q8_0-english2tokipona-chat.gguf \
    --lora lora-open-llama-3b-v2-q8_0-english2tokipona-chat-LATEST.gguf

Multiple LORA adapters can be applied by passing multiple --lora FNAME or --lora-scaled FNAME S command line parameters:

./bin/llama-export-lora \
    -m your_base_model.gguf \
    -o your_merged_model.gguf \
    --lora-scaled lora_task_A.gguf 0.5 \
    --lora-scaled lora_task_B.gguf 0.5

It's fixed in #8669

0 replies

markat1 · 2024-07-24T18:54:04Z

markat1
Jul 24, 2024
Author

Thanks @ngxson - I unfortunately get an error in the end - maybe you can spot what's wrong

So I converted my lora-adapter to gguf [lora-adapters here]
(https://huggingface.co/HeRksTAn/llama-3-8B-Instruct-Danish/tree/main)

python3 convert_lora_to_gguf.py --base models/Meta-Llama-3-8B-Instruct --outfile models/llama-3-abliterated-f16-lora.gguf --outtype f16 models/Llama-3-Instruct-abliteration-LoRA-8B
INFO:lora-to-gguf:Loading base model: Meta-Llama-3-8B-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:lora-to-gguf:Exporting model...
INFO:hf-to-gguf:blk.0.ffn_down.weight.lora_a, torch.bfloat16 --> F16, shape = {14336, 32}
INFO:hf-to-gguf:blk.0.ffn_down.weight.lora_b, torch.bfloat16 --> F16, shape = {32, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight.lora_a, torch.bfloat16 --> F16, shape = {4096, 32}

INFO:hf-to-gguf:blk.17.ffn_gate.weight.lora_b, torch.bfloat16 --> F16, shape = {32, 14336}
INFO:hf-to-gguf:blk.17.ffn_up.weight.lora_a, torch.bfloat16 --> F16, shape = {4096, 32}
.......
INFO:hf-to-gguf:blk.9.attn_q.weight.lora_b,  torch.bfloat16 --> F16, shape = {32, 4096}
INFO:hf-to-gguf:blk.9.attn_v.weight.lora_a,  torch.bfloat16 --> F16, shape = {4096, 32}
INFO:hf-to-gguf:blk.9.attn_v.weight.lora_b,  torch.bfloat16 --> F16, shape = {32, 1024}
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 8192
INFO:hf-to-gguf:gguf: embedding length = 4096
INFO:hf-to-gguf:gguf: feed forward length = 14336
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 500000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128009
INFO:gguf.vocab:Setting chat_template to {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:models/llama-3-abliterated-f16-lora.gguf: n_tensors = 448, total_size = 167.8M
Writing: 100%|███████████████████████████████████████| 168M/168M [00:01<00:00, 155Mbyte/s]
INFO:lora-to-gguf:Model successfully exported to models/llama-3-abliterated-f16-lora.gguf

full text conversion-of-lora-to-gguf.txt

Then I tried merging lora_adapter.gguf into the base model, but unfortunately get an error.

./llama-export-lora -m models/meta-llama-3-8b-q4_k_m.gguf -o merged_model_test.gguf --lora models/llama-3-abliterated-f16-lora.gguf 
error: unexpected lora header file magic in 'models/llama-3-abliterated-f16-lora.gguf'

2 replies

ngxson Jul 24, 2024
Collaborator

you're using old llama-export-lora, not the new one.

rebuild it or download from https://github.com/ggerganov/llama.cpp/releases

markat1 Jul 24, 2024
Author

arh yes! redownloaded the repository - now it works thank you so much @ngxson !

./llama-export-lora -m models/meta-llama-3-8b-q4_k_m.gguf -o merged_model_test.gguf --lora models/llama-3-abliterated-f16-lora.gguf  
file_input: loaded gguf from models/meta-llama-3-8b-q4_k_m.gguf
file_input: loaded gguf from models/llama-3-abliterated-f16-lora.gguf
....
copy_tensor :  output_norm.weight [4096, 1, 1, 1]
copy_tensor :  token_embd.weight [4096, 128256, 1, 1]
run_merge : merged 224 tensors with lora adapters
run_merge : wrote 291 tensors to output file
done, output file is merged_model_test.gguf

markat1 · 2024-07-24T23:31:19Z

markat1
Jul 24, 2024
Author

one last thing @ngxson - seems like my merged model only creates "GGGGGGGGG" - hmmm odd.

./llama-cli -m merged_model_test.gguf -p "How to make the most irresponsible pizza?" 
Log start
main: build = 3455 (68504f09)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1721863749
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from merged_model_test.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  224 tensors
llama_model_loader: - type q4_K:    1 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 13.68 GiB (14.63 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  4874.10 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   560.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1


How to make the most irresponsible pizza?GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

llama_print_timings:        load time =    8979.16 ms
llama_print_timings:      sample time =       8.61 ms /   171 runs   (    0.05 ms per token, 19867.55 tokens per second)
llama_print_timings: prompt eval time =     622.99 ms /     9 tokens (   69.22 ms per token,    14.45 tokens per second)
llama_print_timings:        eval time =   61828.75 ms /   170 runs   (  363.70 ms per token,     2.75 tokens per second)
llama_print_timings:       total time =   62811.36 ms /   179 tokens

But! If I only ran llora seperately with the base model, then it works

./llama-cli -m models/meta-llama-3-8b-q4_k_m.gguf --lora models/llama-3-abliterated-f16-lora.gguf -p "How to make the most irresponsible pizza?" 
Log start
main: build = 3455 (68504f09)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1721864172
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from models/meta-llama-3-8b-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  4685.30 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   560.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
llama_lora_adapter_init_internal: loading lora adapter from 'models/llama-3-abliterated-f16-lora.gguf' ...
llama_lora_adapter_init_internal:        CPU LoRA buffer size =   160.00 MiB
llama_lora_adapter_init_internal: loaded 448 tensors from lora file

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 1


How to make the most irresponsible pizza? Take the best ingredients available, put them on a good crust and then, as you’re walking away, drop it on the floor.
And then you’ll have the best pizza you’ve ever tasted. It’s called Pizza Bomba.
Bomba, or “bomb” in Italian, is a word used to describe a type of pizza that’s made with a lot of cheese and toppings, but not much dough. It’s usually served on a wooden platter with a lid, and it’s meant to be eaten quickly, in one bite.
The term “bomb” is derived from the word “bomba,” which means “bottle” or “bomb” in Italian.
Pizza Bomba is a type of pizza that’s made with a lot of cheese and toppings, but not much dough. It’s usually served on a wooden platter with a lid, and it’s meant

llama_print_timings:        load time =    4725.94 ms
llama_print_timings:      sample time =      11.05 ms /   177 runs   (    0.06 ms per token, 16021.00 tokens per second)
llama_print_timings: prompt eval time =     172.04 ms /     9 tokens (   19.12 ms per token,    52.31 tokens per second)
llama_print_timings:        eval time =   23752.39 ms /   176 runs   (  134.96 ms per token,     7.41 tokens per second)
llama_print_timings:       total time =   24093.54 ms /   185 tokens

3 replies

ngxson Jul 25, 2024
Collaborator

I think there’s a precision issue, will have a look later.

For now, can you try with a base model f16 instead of q4 ?

markat1 Jul 25, 2024
Author

Yep @ngxson - you are right! it doesn't do the "GGGG" response with f16! Let me know when I can do q4!

./llama-cli -m merged_model_test2.gguf -p "How to make the most irresponsible pizza?"
Log start
main: build = 3455 (68504f09)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1721926434
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from merged_model_test2.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size = 15317.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   560.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling: 
       repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
       top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
       mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0


How to make the most irresponsible pizza? | Tips & Tricks
How to make the most irresponsible pizza? | Tips & Tricks
I love pizza, and I love making it, but I also love doing things a little differently, and that's where the concept of the most irresponsible pizza comes in. It's all about pushing the boundaries, taking risks, and having fun with it! So, here are some tips and tricks to help you make the most irresponsible pizza ever!1. Use a non-traditional crust: Ditch the classic dough and go for something different like a crispy flatbread, a cheesy focaccia, or even a crispy baguette!2. Get creative with the sauce: Don't just stick to tomato sauce, try something new like a pesto, a garlic aioli, or even a spicy harissa!3. Load it up with toppings: Go wild with the toppings, add as much as you want, and don't worry about balancing the flavors!4. Add some unexpected ingredients: Think outside the box and add some unexpected ingredients like fig jam, prosciutto, or even a fried egg!5. Experiment with different cheeses: Don't just stick to mozzarella, try something new like feta, goat cheese, or even a blend of cheeses!6. Don't forget the finishing touches: Add some fresh herbs, a drizzle of truffle oil, or even a sprinkle of paprika to give your pizza that extra something!So, there you have it, some tips and tricks to help you make the most irresponsible pizza ever! Remember, the most important thing is to have fun and be creative, so don't be afraid to experiment and try new things! And if all else fails, you can always add more cheese! #irresponsiblepizza #pizzalove #foodie #adventurouseating #pizzatips #pizzatricks #foodstagram #yummy

llama_print_timings:        load time =   16506.65 ms
llama_print_timings:      sample time =      27.95 ms /   386 runs   (    0.07 ms per token, 13811.86 tokens per second)
llama_print_timings: prompt eval time =    3902.95 ms /     8 tokens (  487.87 ms per token,     2.05 tokens per second)
llama_print_timings:        eval time =  208468.74 ms /   385 runs   (  541.48 ms per token,     1.85 tokens per second)
llama_print_timings:       total time =  213657.80 ms /   393 token

ngxson Jul 25, 2024
Collaborator

as #8687 is merged, from now quantized model should work correctly

merge lora.adapters into a base model? confused about the bin file part. #8663

Uh oh!

Uh oh!

markat1 Jul 24, 2024

Confusion regarding bin file in README example

Replies: 3 comments · 5 replies

Uh oh!

ngxson Jul 24, 2024 Collaborator

Uh oh!

Uh oh!

markat1 Jul 24, 2024 Author

Uh oh!

ngxson Jul 24, 2024 Collaborator

Uh oh!

markat1 Jul 24, 2024 Author

Uh oh!

Uh oh!

markat1 Jul 24, 2024 Author

Uh oh!

ngxson Jul 25, 2024 Collaborator

Uh oh!

Uh oh!

markat1 Jul 25, 2024 Author

Uh oh!

ngxson Jul 25, 2024 Collaborator

markat1
Jul 24, 2024

Replies: 3 comments 5 replies

ngxson
Jul 24, 2024
Collaborator

markat1
Jul 24, 2024
Author

ngxson Jul 24, 2024
Collaborator

markat1 Jul 24, 2024
Author

markat1
Jul 24, 2024
Author

ngxson Jul 25, 2024
Collaborator

markat1 Jul 25, 2024
Author

ngxson Jul 25, 2024
Collaborator