You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To squeeze out a little bit more performance, you can also compile the
207
-
prefill with --compile_prefill. This will increase compilation times
208
-
though. The --compile-prefill option requires --parallel-prefill,
202
+
prefill with `--compile_prefill`. This will increase compilation times
203
+
though. The `--compile-prefill` option requires `--parallel-prefill`,
209
204
which are not available for exported DSO and PTE models.
210
205
211
206
212
207
## Eval
213
208
214
-
For an introduction to the model evaluation tool `eval`, please see the introductury
209
+
For an introduction to the model evaluation tool `eval`, please see the introductory
215
210
README.
216
211
217
212
In addition to running eval on models in eager mode (optionally
218
-
compiled with `torch.compile()`, you can also load dso and pte models
213
+
compiled with `torch.compile()`), you can also load dso and pte models
219
214
back into the generate.py tool. This will allow you to run any tests
220
215
and evaluations that you want to run on the exported models without
221
-
requiring changes to your test harnesses and evaluation scripts,
216
+
requiring changes to your test harnesses and evaluation scripts.
222
217
223
218
224
219
## Export
@@ -254,31 +249,42 @@ quantization to achieve this, as described below.
254
249
255
250
### ExecuTorch mobile compilation
256
251
257
-
We export the model with the export.py script. Running this script
252
+
We export the model with the `export.py` script. Running this script
258
253
requires you first install executorch with pybindings, see
259
-
[here](#setting-up-executorch-and-runner-et). At present, when
260
-
exporting a model, the export command always uses the xnnpack delegate
261
-
to export. (Future versions of torchchat will support additional
262
-
delegates such as Vulkan, CoreML, MPS, HTP in addition to Xnnpack as
263
-
they are released for Executorch.)
254
+
[here](#setting-up-executorch-and-runner-et). At present, when
255
+
exporting a model, the export command always uses the XNNPACK delegate
256
+
to export. (Future versions of torchchat will support additional
257
+
delegates such as Vulkan, CoreML, MPS, HTP in addition to XNNPACK as
258
+
they are released for ExecuTorch.)
264
259
265
260
### Running the model
266
261
267
262
With the model exported, you can now generate text with the executorch
268
-
runtime pybindings. Feel free to play around with the prompt.
263
+
runtime pybindings. Feel free to play around with the prompt.
269
264
270
265
```
271
266
python3 generate.py --checkpoint-path ${MODEL_PATH} --pte ${MODEL_OUT}/model.pte --device cpu --prompt "Once upon a time"
272
267
```
273
268
274
-
You can also run the model with the runner-et. See below under
269
+
You can also run the model with the runner-et. See below under
275
270
"Standalone Execution".
276
271
277
272
While we have shown the export and execution of a small model to a
278
-
mobile/edge device supported by Executorch, most models need to be
273
+
mobile/edge device supported by ExecuTorch, most models need to be
279
274
compressed to fit in the target device's memory. We use quantization
280
275
to achieve this.
281
276
277
+
### Visualizing the backend delegate on ExecuTorch export
278
+
279
+
By default, export will lower to the XNNPACK delegate for improved performance. ExecuTorch export
280
+
provides APIs to visualize what happens after the `to_backend()` call in the lowering process.
281
+
282
+
-`get_delegation_info()`: provide a summary of the model after the `to_backend()` call, including the total delegated subgraphs, number of delegated nodes and number of non-delegated nodes.
283
+
-`format_delegated_graph`: a formatted str of the whole graph, as well as the subgraph/s consumed by the backend.
linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
306
312
linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |
307
313
308
-
309
314
## Model precision (dtype precision setting)
310
-
311
-
You can generate models (for both export and generate, with eager, torch.compile, AOTI, ET, for all backends - mobile at present will primarily support fp32, with all options)
312
-
specify the precision of the model with
315
+
On top of quantizing models with quantization schemes mentioned above, models can be converted to lower bit floating point precision to reduce the memory bandwidth requirement and take advantage of higher density compute available. For example, many GPUs and some of the CPUs have good support for bfloat16 and float16. This can be taken advantage of via `--dtype arg` as shown below.
0 commit comments