Help: Where/how dequantize happens and how to create new quantization formats? #1796
-
Thank you for this amazing software and community. I am trying to better understanding where/how the dequanitzing happens and would appreciate any pointers on that so that I can try out new quantization formats. When I look through the code it seems like there are various dequantize functions for the different types of quantization but I don't quite understand how these get used in inference. It seems like the main place these are used by In further digging, it seems like possibly the place to look might be in the My main question is: "If I want to try out a new quantization method, which functions should I change to pack/unpack the quantized values?". Thanks in advance for any pointers. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
You'll probably get a better answer from someone else, but: For the most part, the packing (quantizing) part happens when the model file is created. Look at When the file is loaded, the tensors are created with the type that was saved (generally). Getting ready for inference involves building a graph of the various operations that will be performed on the tensors. Some operations support working on quantized tensors, some don't so doing something like trying to perform an operation with the wrong type of tensor may just fail at graph build time. Once you've built the graph and call A bunch of new quantizations were added in this pull request: #1684 You can possibly look at that and get an idea of what operations you will need to support your new quantization in. |
Beta Was this translation helpful? Give feedback.
You'll probably get a better answer from someone else, but:
For the most part, the packing (quantizing) part happens when the model file is created. Look at
examples/quantize
for the tool used to quantize models. Each tensor saved in the model has anftype
which is the type of tensor: it could beGGML_TYPE_F16
, it could beGGML_TYPE_Q5_0
, etc.When the file is loaded, the tensors are created with the type that was saved (generally). Getting ready for inference involves building a graph of the various operations that will be performed on the tensors. Some operations support working on quantized tensors, some don't so doing something like trying to perform an operation with the wrong type o…