Skip to content

An easy python script to create quantized (k-bit support) GGML models from local HF Transformer LLaMA models (haven't tested Llama-2 yet) #2311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 21, 2023

Conversation

richardr1126
Copy link
Contributor

@richardr1126 richardr1126 commented Jul 21, 2023

This is a slightly tweaked version of TheBloke's make_ggml.py script. Makes it very easy to quantize models into GGML from Hugging Face LLaMA based models. Models are downloaded from HF if the file path given can't be found.

I did not add any new requirements to the .txt file. The huggingface-hub pip package is installed with a subprocess command. Only change is a new examples/make_ggml.py file.

Note: Llama-based models only

Usage:
python make_ggml.py --model {model_dir_or_hf_repo_name} [--outname {output_name} (Optional)] [--outdir {output_directory} (Optional)] [--quants {quant_types} (Optional)] [--keep_fp16 (Optional)]

Arguments:
- --model: (Required) The directory of the downloaded Hugging Face model or the name of the Hugging Face model repository. If the model directory does not exist, it will be downloaded from the Hugging Face model hub.
- --outname: (Optional) The name of the output model. If not specified, the last part of the model directory path or the Hugging Face model repo name will be used.
- --outdir: (Optional) The directory where the output model(s) will be stored. If not specified, '../models/{outname}' will be used.
- --quants: (Optional) The types of quantization to apply. This should be a space-separated list. The default is 'Q4_K_M Q5_K_S'.
- --keep_fp16: (Optional) If specified, the FP16 model will not be deleted after the quantized models are created.

Quant types:
- Q4_0: small, very high quality loss - legacy, prefer using Q3_K_M
- Q4_1: small, substantial quality loss - legacy, prefer using Q3_K_L
- Q5_0: medium, balanced quality - legacy, prefer using Q4_K_M
- Q5_1: medium, low quality loss - legacy, prefer using Q5_K_M
- Q2_K: smallest, extreme quality loss - not recommended
- Q3_K: alias for Q3_K_M
- Q3_K_S: very small, very high quality loss
- Q3_K_M: very small, very high quality loss
- Q3_K_L: small, substantial quality loss
- Q4_K: alias for Q4_K_M
- Q4_K_S: small, significant quality loss
- Q4_K_M: medium, balanced quality - recommended
- Q5_K: alias for Q5_K_M
- Q5_K_S: large, low quality loss - recommended
- Q5_K_M: large, very low quality loss - recommended
- Q6_K: very large, extremely low quality loss
- Q8_0: very large, extremely low quality loss - not recommended
- F16: extremely large, virtually no quality loss - not recommended
- F32: absolutely huge, lossless - not recommended

@ggerganov ggerganov merged commit 7d5f184 into ggml-org:master Jul 21, 2023
@first-leon
Copy link
Contributor

first-leon commented Jul 23, 2023

I check this script with this model: https://huggingface.co/ai-forever/ruGPT-3.5-13B
But it is not work. I have error:

Loading model file ../../../models/ruGPT-3.5-13B/pytorch_model-00001-of-00006.bin
Traceback (most recent call last):
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 1264, in <module>
    main()
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 1244, in main
    model_plus = load_some_model(args.model)
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 1165, in load_some_model
    models_plus.append(lazy_load_file(path))
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 955, in lazy_load_file
    return lazy_load_torch_file(fp, path)
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 826, in lazy_load_torch_file
    model = unpickler.load()
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/../convert.py", line 815, in find_class
    return self.CLASSES[(module, name)]
KeyError: ('torch', 'BoolStorage')
Traceback (most recent call last):
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/make-ggml.py", line 92, in <module>
    main(args.model, args.outname, args.outdir, args.quants, args.keep_fp16)
  File "/media/leon/a006f28a-3d90-45c9-a583-e083cbf37840/AI/soft/llama.cpp/examples/make-ggml.py", line 69, in main
    subprocess.run(f"python3 ../convert.py {model} --outtype f16 --outfile {fp16}", shell=True, check=True)
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 ../convert.py ../../../models/ruGPT-3.5-13B/ --outtype f16 --outfile ../models/None/None.ggmlv3.fp16.bin' returned non-zero exit status 1.

@TheBloke
Copy link
Contributor

@first-leon RuGPT is a GPT2-based model, which is not supported by llama.cpp. You would need to use https://github.com/ggerganov/ggml although be aware that GPT2 isn't very well supported by GGML at the moment and you might need to make manual changes to the convert script in that repo (at least I saw that being discussed in an issue there recently)

@richardr1126 richardr1126 changed the title An easy python script to create quantized (k-bit support) GGML models from local HF Transformer models. An easy python script to create quantized (k-bit support) GGML models from local HF Transformer LLaMA models (haven't tested Llama-2 yet) Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants