An easy python script to create quantized (k-bit support) GGML models from local HF Transformer LLaMA models (haven't tested Llama-2 yet) #2311

richardr1126 · 2023-07-21T18:40:59Z

This is a slightly tweaked version of TheBloke's make_ggml.py script. Makes it very easy to quantize models into GGML from Hugging Face LLaMA based models. Models are downloaded from HF if the file path given can't be found.

I did not add any new requirements to the .txt file. The huggingface-hub pip package is installed with a subprocess command. Only change is a new examples/make_ggml.py file.

Note: Llama-based models only

Usage:
python make_ggml.py --model {model_dir_or_hf_repo_name} [--outname {output_name} (Optional)] [--outdir {output_directory} (Optional)] [--quants {quant_types} (Optional)] [--keep_fp16 (Optional)]

Arguments:
- --model: (Required) The directory of the downloaded Hugging Face model or the name of the Hugging Face model repository. If the model directory does not exist, it will be downloaded from the Hugging Face model hub.
- --outname: (Optional) The name of the output model. If not specified, the last part of the model directory path or the Hugging Face model repo name will be used.
- --outdir: (Optional) The directory where the output model(s) will be stored. If not specified, '../models/{outname}' will be used.
- --quants: (Optional) The types of quantization to apply. This should be a space-separated list. The default is 'Q4_K_M Q5_K_S'.
- --keep_fp16: (Optional) If specified, the FP16 model will not be deleted after the quantized models are created.

Quant types:
- Q4_0: small, very high quality loss - legacy, prefer using Q3_K_M
- Q4_1: small, substantial quality loss - legacy, prefer using Q3_K_L
- Q5_0: medium, balanced quality - legacy, prefer using Q4_K_M
- Q5_1: medium, low quality loss - legacy, prefer using Q5_K_M
- Q2_K: smallest, extreme quality loss - not recommended
- Q3_K: alias for Q3_K_M
- Q3_K_S: very small, very high quality loss
- Q3_K_M: very small, very high quality loss
- Q3_K_L: small, substantial quality loss
- Q4_K: alias for Q4_K_M
- Q4_K_S: small, significant quality loss
- Q4_K_M: medium, balanced quality - recommended
- Q5_K: alias for Q5_K_M
- Q5_K_S: large, low quality loss - recommended
- Q5_K_M: large, very low quality loss - recommended
- Q6_K: very large, extremely low quality loss
- Q8_0: very large, extremely low quality loss - not recommended
- F16: extremely large, virtually no quality loss - not recommended
- F32: absolutely huge, lossless - not recommended

first-leon · 2023-07-23T05:03:24Z

I check this script with this model: https://huggingface.co/ai-forever/ruGPT-3.5-13B
But it is not work. I have error:

Loading model file ../../../models/ruGPT-3.5-13B/pytorch_model-00001-of-00006.bin
Traceback (most recent call last):
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 1264, in <module>
    main()
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 1244, in main
    model_plus = load_some_model(args.model)
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 1165, in load_some_model
    models_plus.append(lazy_load_file(path))
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 955, in lazy_load_file
    return lazy_load_torch_file(fp, path)
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 826, in lazy_load_torch_file
    model = unpickler.load()
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 815, in find_class
    return self.CLASSES[(module, name)]
KeyError: ('torch', 'BoolStorage')
Traceback (most recent call last):
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/make-ggml.py", line 92, in <module>
    main(args.model, args.outname, args.outdir, args.quants, args.keep_fp16)
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/make-ggml.py", line 69, in main
    subprocess.run(f"python3 ../convert.py {model} --outtype f16 --outfile {fp16}", shell=True, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 ../convert.py ../../../models/ruGPT-3.5-13B/ --outtype f16 --outfile ../models/None/None.ggmlv3.fp16.bin' returned non-zero exit status 1.

TheBloke · 2023-07-23T08:19:34Z

@first-leon RuGPT is a GPT2-based model, which is not supported by llama.cpp. You would need to use https://github.com/ggerganov/ggml although be aware that GPT2 isn't very well supported by GGML at the moment and you might need to make manual changes to the convert script in that repo (at least I saw that being discussed in an issue there recently)

richardr1126 and others added 2 commits July 21, 2023 12:35

Resync my fork with new llama.cpp commits

a363c2b

examples : rename to use dash instead of underscore

1faad6d

ggerganov merged commit 7d5f184 into ggml-org:master Jul 21, 2023

richardr1126 changed the title ~~An easy python script to create quantized (k-bit support) GGML models from local HF Transformer models.~~ An easy python script to create quantized (k-bit support) GGML models from local HF Transformer LLaMA models (haven't tested Llama-2 yet) Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

An easy python script to create quantized (k-bit support) GGML models from local HF Transformer LLaMA models (haven't tested Llama-2 yet) #2311

An easy python script to create quantized (k-bit support) GGML models from local HF Transformer LLaMA models (haven't tested Llama-2 yet) #2311

Uh oh!

richardr1126 commented Jul 21, 2023 •

edited

Loading

Uh oh!

first-leon commented Jul 23, 2023 •

edited

Loading

Uh oh!

TheBloke commented Jul 23, 2023

Uh oh!

Uh oh!

An easy python script to create quantized (k-bit support) GGML models from local HF Transformer LLaMA models (haven't tested Llama-2 yet) #2311

An easy python script to create quantized (k-bit support) GGML models from local HF Transformer LLaMA models (haven't tested Llama-2 yet) #2311

Uh oh!

Conversation

richardr1126 commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

first-leon commented Jul 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBloke commented Jul 23, 2023

Uh oh!

Uh oh!

richardr1126 commented Jul 21, 2023 •

edited

Loading

first-leon commented Jul 23, 2023 •

edited

Loading