|
| 1 | +# Quantization Flow in Executorch |
| 2 | + |
| 3 | +High level flow for short term quantization flow in exeuctorch looks like the following: https://docs.google.com/document/d/1UuktDffiMH0rXRuiL0e8bQaS3X0XfkIHEF1VtpFmA5A/edit#heading=h.mywdosyratgh |
| 4 | + |
| 5 | +## 1. Capture the model with `exir.capture` |
| 6 | +### Process |
| 7 | +The flow uses `PyTorch 2.0 Export Quantization` to quantize the model, that works on a model captured by `exir.capture`. If the model is not traceable, please follow the [User Guide](TBD) to make changes to model, and see [here](https://pytorch.org/docs/main/generated/exportdb/index.html) for supported constructs in `exir.capture`. |
| 8 | + |
| 9 | +``` |
| 10 | +# program capture |
| 11 | +from executorch import exir |
| 12 | +from executorch.exir.tracer import ExirDynamoConfig |
| 13 | +from executorch.exir import CaptureConfig |
| 14 | +dynamo_config = exir.tracer.ExirDynamoConfig( |
| 15 | + capture_scalar_outputs=True, |
| 16 | + guard_nn_modules=True, |
| 17 | + dynamic_shapes=enable_dynamic_shape, |
| 18 | + specialize_int=True, |
| 19 | + verbose=True, |
| 20 | +) |
| 21 | +capture_config = exir.CaptureConfig( |
| 22 | + pt2_mode=True, |
| 23 | + _dynamo_config=dynamo_config, |
| 24 | + enable_dynamic_shape=enable_dynamic_shape, |
| 25 | +) |
| 26 | +exported_program = exir.capture(m, example_inputs, capture_config) |
| 27 | +m = exported_program.graph_module |
| 28 | +``` |
| 29 | +### Result |
| 30 | +The result in this step will be a capturable model (`fx.GraphModule`) |
| 31 | +## 2. Quantization |
| 32 | +### Process |
| 33 | +Please take a look at the [pytorch 2.0 export post training static quantization tutorial](https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html) to learn about all the steps of quantization. Main APIs that's used to quantize the model would be: |
| 34 | +* `prepare_pt2e`: used to insert observers to the model, it takes a backend specific `Quantizer` as argument, which will annotate the nodes with informations needed to quantize the model properly for the backend |
| 35 | +* (not an api) calibration: run the model through some sample data |
| 36 | +* `convert_pt2e`: convert a observed model to a quantized model, we have special representation for selected ops (e.g. quantized linear), other ops are represented as (dq -> float32_op -> q), and q/dq are decomposed into more primitive operators. |
| 37 | + |
| 38 | +### Result |
| 39 | +The result after these steps will be a reference quantized model, with quantize/dequantize operators being further decomposed. Example: |
| 40 | + |
| 41 | +(TODO): update |
| 42 | +``` |
| 43 | +# Reference Quantized Pattern for quantized add |
| 44 | +x = torch.ops.quantized_decomposed.dequantize_per_tensor(x, x_scale, x_zero_point, x_qmin, x_qmax, torch.uint8) |
| 45 | +y = torch.ops.quantized_decomposed.dequantize_per_tensor(y, y_scale, y_zero_point, y_qmin, y_qmax, torch.uint8) |
| 46 | +out = x + y |
| 47 | +out = torch.ops.quantized_decomposed.quantize_per_tensor(out, out_scale, out_zero_point, out_qmin, out_qmax, torch.uint8) |
| 48 | +``` |
| 49 | + |
| 50 | +see https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8 for some operators that has integer operator representations. |
| 51 | + |
| 52 | +## 4. Lowering to Executorch |
| 53 | +TODO: link |
0 commit comments