Help: Where/how dequantize happens and how to create new quantization formats? #1796

emin63 · 2023-06-10T22:31:13Z

emin63
Jun 10, 2023

Thank you for this amazing software and community.

I am trying to better understanding where/how the dequanitzing happens and would appreciate any pointers on that so that I can try out new quantization formats.

When I look through the code it seems like there are various dequantize functions for the different types of quantization but I don't quite understand how these get used in inference. It seems like the main place these are used by llama.cpp is in the llama_convert_tensor_internal function but that does not seem to be used in inference as far as I can make out.

In further digging, it seems like possibly the place to look might be in the ggml_cl_mul_mat function. It seems like that is a somewhat low level function which unpacks quantized values and then does a matrix multiplication but I'm not quite sure.

My main question is: "If I want to try out a new quantization method, which functions should I change to pack/unpack the quantized values?".

Thanks in advance for any pointers.

Answered by KerfuffleV2

Jun 11, 2023

You'll probably get a better answer from someone else, but:

For the most part, the packing (quantizing) part happens when the model file is created. Look at examples/quantize for the tool used to quantize models. Each tensor saved in the model has an ftype which is the type of tensor: it could be GGML_TYPE_F16, it could be GGML_TYPE_Q5_0, etc.

When the file is loaded, the tensors are created with the type that was saved (generally). Getting ready for inference involves building a graph of the various operations that will be performed on the tensors. Some operations support working on quantized tensors, some don't so doing something like trying to perform an operation with the wrong type o…

View full answer

KerfuffleV2 · 2023-06-11T15:43:18Z

KerfuffleV2
Jun 11, 2023
Collaborator

You'll probably get a better answer from someone else, but:

For the most part, the packing (quantizing) part happens when the model file is created. Look at examples/quantize for the tool used to quantize models. Each tensor saved in the model has an ftype which is the type of tensor: it could be GGML_TYPE_F16, it could be GGML_TYPE_Q5_0, etc.

When the file is loaded, the tensors are created with the type that was saved (generally). Getting ready for inference involves building a graph of the various operations that will be performed on the tensors. Some operations support working on quantized tensors, some don't so doing something like trying to perform an operation with the wrong type of tensor may just fail at graph build time.

Once you've built the graph and call ggml_compute you'll probably end up in ggml_compute_forward which is mostly a huge switch statement to various operation-specific functions like ggml_compute_forward_add. Most of those functions also switch on the types they handle and call more specialized functions. For example, compute_forward_add calls one of compute_forward_add_f32, compute_forward_add_f16_f32, compute_forward_add_f16_f16, compute_forward_add_q_f32 depending on the types of tensors it is working on. compute_forward_add_q_f32 would be quantized_tensor + float32 tensor (pretty sure all operations produce a float32 result).

A bunch of new quantizations were added in this pull request: #1684

You can possibly look at that and get an idea of what operations you will need to support your new quantization in.

1 reply

emin63 Jun 15, 2023
Author

Thanks! That's very helpful. I will start there and ask more questions if I'm still confused (which is quite possible since there is a lot of low level code to understand).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help: Where/how dequantize happens and how to create new quantization formats? #1796

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help: Where/how dequantize happens and how to create new quantization formats? #1796

Uh oh!

emin63 Jun 10, 2023

Replies: 1 comment · 1 reply

Uh oh!

KerfuffleV2 Jun 11, 2023 Collaborator

Uh oh!

emin63 Jun 15, 2023 Author

emin63
Jun 10, 2023

Replies: 1 comment 1 reply

KerfuffleV2
Jun 11, 2023
Collaborator

emin63 Jun 15, 2023
Author