Skip to content

Commit bd1d08c

Browse files
mcr229facebook-github-bot
authored andcommitted
XNNPACK Delegate Overview
Differential Revision: D49945478 fbshipit-source-id: 620f6749c55443b01025c748ccdea17ee4647f17
1 parent 94119f6 commit bd1d08c

File tree

4 files changed

+141
-0
lines changed

4 files changed

+141
-0
lines changed

docs/source/index.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,14 @@ Topics in this section will help you get started with ExecuTorch.
156156
kernel-library-overview
157157
kernel-library-custom-aten-kernel
158158

159+
.. toctree::
160+
:glob:
161+
:maxdepth: 1
162+
:caption: Native Delegates
163+
:hidden:
164+
165+
native-delegates-executorch-xnnpack-delegate
166+
159167
.. toctree::
160168
:glob:
161169
:maxdepth: 1
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# ExecuTorch XNNPACK Delegate
2+
3+
This is a high-level overview of the ExecuTorch XNNPACK Backend Delegate. This high performance delegate is aimed to reduce CPU inference latency for ExecuTorch models. We will provide a brief introduction to the XNNPACK library and explore the delegate’s overall architecture and intended use cases.
4+
5+
::::{note}
6+
XNNPACK Delegate is currently under active development, and may change in the future
7+
::::
8+
9+
### What is XNNPACK?
10+
XNNPACK is a library of highly-optimized neural network operators for ARM, x86, and WebAssembly architectures in Android, iOS, Windows, Linux, and macOS environments. It is an open source project, you can find more information about it on [github](https://github.com/google/XNNPACK).
11+
12+
### What are Delegates?
13+
A delegate is an entry point for backends to process and execute ExecuTorch programs. The XNNPACK Delegate is one of many available in ExecuTorch. It leverages the third-party library to accelerate PyTorch programs efficiently across a variety of CPUs. More information on the delegates and developing your own delegates is available [here](compiler-delegate-and-partitioner.md).
14+
15+
It is recommended that you get familiar with the content from the “Backend and Delegate” page before continuing on to the Architecture section.
16+
17+
## Architecture
18+
![](./xnnpack-delegate-architecture.png)
19+
20+
### Ahead-of-time
21+
![](./xnnpack-et-flow-diagram.png)
22+
In the ExecuTorch export flow, lowering to the XNNPACK Delegate happens at the `to_backend()` stage. In this stage, the model is partitioned by the `XnnpackPartitioner` and the partitions are then serialized via flatbuffer. The serialized flatbuffer is then read to be deserialized and executed by the XNNPACK Backend at runtime.
23+
24+
#### Partitioner
25+
The partitioner is implemented by backend delegates to mark nodes suitable for lowering. The XnnpackPartitioner lowers using node targets and module metadata. Some more references for partitioners can be found [here](compiler-delegate-and-partitioner.md)
26+
27+
##### Module-based partitioning
28+
29+
`source_fn` is embedded in the node’s metadata and gives information on where these nodes come from. For example, modules like `torch.nn.Linear` when captured and exported `to_edge` generate groups of nodes for their computation. The group of nodes associated with computing the linear module then has a `source_fn` of `torch.nn.Linear. Partitioning based on `source_fn` allows us to identify groups of nodes which are lowerable via XNNPACK.
30+
31+
For example after capturing `torch.nn.Linear` you would find the following key in the metadata for the addmm node associated with linear:
32+
```
33+
'source_fn': ('fn', <class 'torch.nn.modules.linear.Linear'>)
34+
```
35+
36+
37+
##### Op-based partitioning
38+
39+
The XnnpackPartitioner also partitions using op targets. It traverses the graph and identifies individual nodes which are lowerable to XNNPACK. A drawback to module-based partitioning is that operators which come from decompositions may be skipped. For example, an operator like `torch.nn.Hardsigmoid` is decomposed into add, muls, divs, and clamps. While hardsigmoid is not lowerable, we can lower the decomposed ops. Relying on `source_fn` metadata would skip these lowerables because they belong to a non-lowerable module, so in order to improve model performance, we greedily lower operators based on the op targets as well as the `source_fn`.
40+
41+
#### Serialiazation
42+
After partitioning the lowerable subgraphs from the model, The XNNPACK Delegate pre-processes these subgraphs and serializes them via flatbuffer for the XNNPACK Backend.
43+
44+
##### Passes
45+
46+
Before any serialization, we apply passes on the subgraphs to prepare the graph. These passes perform a variety of functions but overall they help to improve the performance of the delegate. We give an overview of a few of the passes and their function below, for all passes and their function see [here](https://github.com/pytorch/executorch/tree/main/backends/xnnpack/passes):
47+
48+
* Channels Last Reshape
49+
* Minimizes the number of permutation operators inserted to correctly manage memory format
50+
* Conv1d to Conv2d
51+
* Allows us to delegate Conv1d nodes by transforming them to Conv2d
52+
* Conv and BN Fusion
53+
* Fuses batch norm operations with the previous convolution node
54+
55+
56+
##### Serialization Schema
57+
58+
The XNNPACK Delegate uses flatbuffer for serialization. In order to improve runtime performance, the XNNPACK Delegate’s flatbuffer [schema](https://github.com/pytorch/executorch/blob/main/backends/xnnpack/serialization/schema.fbs) mirrors the XNNPACK Library’s graph level API calls. The serialized data are arguments to XNNPACK’S APIs, so that at runtime, the XNNPACK execution graph can efficiently be created with successive calls to XNNPACK’s APIs.
59+
60+
### Runtime
61+
The XNNPACK Backend’s runtime interfaces with ExecuTorch runtime through the custom `init` and `execute` function. When the model is initialized, ExecuTorch calls `init` on all the serialized XNNPACK Blobs. After, when the model is executed, the subgraphs are executed via the backend through the custom `execute` function. To read more about how delegate runtimes interface with ExecuTorch, take a look at this [resource](compiler-delegate-and-partitioner.md)
62+
63+
64+
#### XNNPACK Library
65+
The XNNPACK Library currently used by the delegate is on the following [version](https://github.com/google/XNNPACK/tree/51a987591a6fc9f0fc0707077f53d763ac132cbf). XNNPACK Delegate supports multiple platforms and CPU; more information on the supported hardware architectures can be found on the XNNPACK Library’s [README](https://github.com/google/XNNPACK).
66+
67+
#### Init
68+
When calling XNNPACK Delegate’s `init`, we deserialize the preprocessed blobs via flatbuffer. We define the nodes (operators) and edges (intermediate tensors) to build the XNNPACK’s execution graph using the information we serialized ahead-of-time. As we mentioned earlier, the majority of processing has been done ahead-of-time, so that at runtime we can just call the XNNPACK APIs with the serialized arguments in succession. Additionally, while we define the static data like weights and biases in the XNNPACK Graph, XNNPACK packs this data to prepare it for efficient execution. After creating the execution graph, we create the runtime object and pass it on to `execute`.
69+
70+
The preprocessed XNNPACK blob is a freeable buffer, which means after `init` it is finished, the blob is freed to decrease memory usage
71+
72+
73+
#### Execute
74+
When executing the XNNPACK subgraphs, we prepare the tensor inputs and outputs and feed them to the XNNPACK runtime graph. After executing the runtime graph, the output pointers are filled with the computed tensors.
75+
76+
#### Profiling
77+
We have enabled basic profiling for XNNPACK delegate that can be enabled with the following compiler flag `-DENABLE_XNNPACK_PROFILING`. After running the model it will produce basic per-op and total timings. We provide an example of the profiling below. The timings listed are the average across runs, and the units are in microseconds.
78+
79+
```
80+
Fully Connected (NC, F32) GEMM: 109.510002
81+
Total Time: 109.510002
82+
```
83+
84+
## Quantization
85+
The XNNPACK Delegate is a backend for executing symmetrically quantized models. We can lower models quantized using the `XNNPACKQuantizer`. `Quantizers` are backend specific, which means the `XNNPACKQuantizer` is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library. We will not go over the details of how to implement your custom quantizer, you can follow the docs [here](https://pytorch.org/tutorials/prototype/pt2e_quantizer.html) to do so. However, we will provide a brief overview of how to quantize the model to leverage quantized execution of the XNNPACK Delegate.
86+
87+
### Configuring the XNNPACKQuantizer
88+
89+
```python
90+
from torch.ao.quantization.quantizer.xnnpack_quantizer import (
91+
XNNPACKQuantizer,
92+
get_symmetric_quantization_config,
93+
)
94+
quantizer = XNNPACKQuantizer()
95+
quantizer.set_global(get_symmetric_quantization_config())
96+
```
97+
Here we initialize the XNNPACKQuantizer and set the quantization config to be symmetrically quantized. Symmetric quantization is when weights are symmetrically quantized with `qmin = -127` and `qmax = 127`, which forces the quantization zeropoints to be zero. `get_symmetric_quantization_config()` can be configured with the following arguments:
98+
* `is_per_channel`
99+
* Weights are quantize across channels
100+
* `is_qat`
101+
* Quantize aware training
102+
* `is_dynamic`
103+
* Dynamic quantization
104+
105+
We can then configure the `XNNPACKQuantizer` as we wish. We set the following configs below as an example:
106+
```python
107+
quantizer.set_global(qconfig_opt) # qconfig_opt is an optional quantization config
108+
.set_object_type(torch.nn.Conv2d, qconfig_opt) # can be a module type
109+
.set_object_type(torch.nn.functional.linear, qconfig_opt) # or torch functional op
110+
.set_module_name("foo.bar", qconfig_opt)
111+
```
112+
113+
### Quantizing your model with the XNNPACKQuantizer
114+
After configuring our quantizer, we are now ready to quantize our model
115+
```python
116+
from torch._export import capture_pre_autograd_graph
117+
118+
exported_model = capture_pre_autograd_graph(model_to_quantize, example_inputs)
119+
prepared_model = prepare_pt2e(exported_model, quantizer)
120+
print(prepared_model.graph)
121+
```
122+
Prepare performs some Conv2d-BN fusion, and inserts quantization observers in the appropriate places. For Post-Training Quantization, we generally calibrate our model after this step. We run sample examples through the `prepared_model` to observe the statistics of the Tensors to calculate the quantization parameters.
123+
124+
Finally, we convert our model here:
125+
```python
126+
quantized_model = convert_pt2e(prepared_model)
127+
print(quantized_model)
128+
```
129+
You will now see the Q/DQ representation of the model, which means `torch.ops.quantized_decomposed.dequantize_per_tensor` are inserted at quantized operator inputs and `torch.ops.quantized_decomposed.quantize_per_tensor` are inserted at operator outputs. [Example](https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/pt2e/representation/rewrite.py#L40)
130+
131+
## See Also
132+
- Lowering to XNNPACK Tutorial (TBD)
133+
- [Integrating XNNPACK Delegate Android App](https://github.com/pytorch/executorch/blob/main/examples/demo-apps/android/ExecuTorchDemo/README.md)
97.7 KB
Loading
9.26 KB
Loading

0 commit comments

Comments
 (0)