Skip to content

Commit d125a06

Browse files
lucylqmalfet
authored andcommitted
add delegation visualization instructions (#551)
1 parent aee04d7 commit d125a06

File tree

2 files changed

+65
-58
lines changed

2 files changed

+65
-58
lines changed

docs/ADVANCED-USERS.md

Lines changed: 58 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ Generate text | `torchchat --generate` |`generate` | ✅ |
7070
Evaluate model | `torchchat --eval` | `eval` | 🚧 |
7171
Export model | `torchchat --export` | `export` | ✅ |
7272
Exported model test (dso,pte) | `torchchat --chat` | n/a | 🚧 |
73-
exported model test (dso,pte) | `torchchat --generate` |`generate` | ✅ |
73+
Exported model test (dso,pte) | `torchchat --generate` |`generate` | ✅ |
7474
Evaluate exported model (dso,pte) | `torchchat --eval` | `eval` | 🚧 |
7575
Server C++ runtime | n/a | run.cpp model.so | ✅ |
7676
Server C++ runtime | n/a | run.cpp model.pte | ✅ |
@@ -97,15 +97,15 @@ ExecuTorch-compiled .pte files on iOS, Android and Raspberry Pi 5.
9797
You can download any LLM model that fits the model.py model
9898
architecture, provided you have the model weights in llama-format, the
9999
model parameters and the tokenizer model used by your language model.
100-
For models not specified not in the list of known configurations, you
100+
For model parameters not specified in the list of known configurations, you
101101
can construct the model by initializing the `ModelArgs` dataclass that
102102
controls model construction from a parameter json using the
103103
`params-path ${PARAMS_PATH}` containing the appropriate model
104-
parameters to initialize the ModelArgs for the model. (We use the
104+
parameters to initialize the `ModelArgs` for the model. (We use the
105105
model constructor `Transformer.from_params()`).
106106

107-
The parameter file will should be in JSON format specifying thee
108-
parameters. You can find the Model Args data class in
107+
The parameter file should be in JSON format specifying these
108+
parameters. You can find the `ModelArgs` data class in
109109
[`model.py`](https://github.com/pytorch/torchchat/blob/main/model.py#L22).
110110

111111
The final way to initialize a torchchat model is from GGUF. You load a
@@ -115,10 +115,6 @@ native torchchat models.
115115

116116
You may also dequantize GGUF models with the GGUF quantize tool, and
117117
then load and requantize with torchchat native quantization options.
118-
(Please note that quantizing and dequantizing is a lossy process, and
119-
you will get the best results by starting with the original
120-
unquantized model checkpoint, not a previsouly quantized and thend
121-
equantized model.)
122118

123119
| GGUF Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile |
124120
|-----|--------|-------|-----|-----|-----|-----|
@@ -129,9 +125,8 @@ then load and requantize with torchchat native quantization options.
129125

130126
**Please note that quantizing and dequantizing is a lossy process, and
131127
you will get the best results by starting with the original
132-
unquantized model checkpoint, not a previsoul;y quantized and thend
133-
equantized model.**
134-
128+
unquantized model checkpoint, not a previously quantized and then
129+
dequantized model.**
135130

136131
## Chat
137132

@@ -150,23 +145,23 @@ preparatory step:
150145
`checkpoints/${MODEL_NAME}` or any other directory you already use
151146
to store model information.
152147

153-
* `MODEL_PATH` describes the location of the model. Throughput the
154-
description herein, we will assume that MODEL_PATH starts with a
148+
* `MODEL_PATH` describes the location of the model. Throughout the
149+
description herein, we will assume that `MODEL_PATH` starts with a
155150
subdirectory of the torchchat repo named checkpoints, and that it
156-
will contain the actual model. In this case, the MODEL_PATH will
157-
thus be of the form ${MODEL_OUT}/model.{pt,pth}. (Both the
151+
will contain the actual model. In this case, the `MODEL_PATH` will
152+
thus be of the form `${MODEL_OUT}/model.{pt,pth}`. (Both the
158153
extensions `pt` and `pth` are used to describe checkpoints. In
159154
addition, model may be replaced with the name of the model.)
160155

161-
The generate.py sequence generator will load the tokenizer from the
162-
directory specified by the MODEL_PATH variable, by replacing the
163-
modelname with the name of the tokenizer model which is expected to
156+
The `generate.py` sequence generator will load the tokenizer from the
157+
directory specified by the `MODEL_PATH` variable, by replacing the
158+
model name with the name of the tokenizer model which is expected to
164159
be named `tokenizer.model`.
165160

166161
* `MODEL_OUT` is a location for outputs from export for server/desktop
167162
and/or mobile/edge execution. We store exported artifacts here,
168-
with extensions .pte for Executorch models, .so for AOT Inductor
169-
generated models, and .bin for tokenizers prepared for use with the
163+
with extensions `.pte` for Executorch models, `.so` for AOT Inductor
164+
generated models, and `.bin` for tokenizers prepared for use with the
170165
C++ tokenizers user by `runner-aoti` and `runner-et`.
171166

172167
You can set these variables as follows for the exemplary model15M
@@ -180,10 +175,10 @@ MODEL_OUT=~/torchchat-exports
180175
```
181176

182177
When we export models with AOT Inductor for servers and desktops, and
183-
Executorch for mobile and edge devices, we will save them in the
178+
ExecuTorch for mobile and edge devices, we will save them in the
184179
specified directory (`${MODEL_OUT}` in our example below) as a DSO
185180
under the name `${MODEL_NAME}.so` (for AOTI-generated dynamic
186-
libraries), or as Executorch model under the name `${MODEL_NAME}.pte`
181+
libraries), or as ExecuTorch model under the name `${MODEL_NAME}.pte`
187182
(for Executorch-generated mobile/edge models).
188183

189184
We use `[ optional input ]` to indicate optional inputs, and `[ choice
@@ -192,7 +187,7 @@ We use `[ optional input ]` to indicate optional inputs, and `[ choice
192187

193188
## Generate
194189

195-
Model definition in model.py, generation code in generate.py. The
190+
Model definition in `model.py`, generation code in `generate.py`. The
196191
model checkpoint may have extensions `pth` (checkpoint and model
197192
definition) or `pt` (model checkpoint). At present, we always use the
198193
torchchat model for export and import the checkpoint into this model
@@ -204,21 +199,21 @@ python3 generate.py --compile --checkpoint-path ${MODEL_PATH} --prompt "Hello, m
204199
```
205200

206201
To squeeze out a little bit more performance, you can also compile the
207-
prefill with --compile_prefill. This will increase compilation times
208-
though. The --compile-prefill option requires --parallel-prefill,
202+
prefill with `--compile_prefill`. This will increase compilation times
203+
though. The `--compile-prefill` option requires `--parallel-prefill`,
209204
which are not available for exported DSO and PTE models.
210205

211206

212207
## Eval
213208

214-
For an introduction to the model evaluation tool `eval`, please see the introductury
209+
For an introduction to the model evaluation tool `eval`, please see the introductory
215210
README.
216211

217212
In addition to running eval on models in eager mode (optionally
218-
compiled with `torch.compile()`, you can also load dso and pte models
213+
compiled with `torch.compile()`), you can also load dso and pte models
219214
back into the generate.py tool. This will allow you to run any tests
220215
and evaluations that you want to run on the exported models without
221-
requiring changes to your test harnesses and evaluation scripts,
216+
requiring changes to your test harnesses and evaluation scripts.
222217

223218

224219
## Export
@@ -254,31 +249,42 @@ quantization to achieve this, as described below.
254249

255250
### ExecuTorch mobile compilation
256251

257-
We export the model with the export.py script. Running this script
252+
We export the model with the `export.py` script. Running this script
258253
requires you first install executorch with pybindings, see
259-
[here](#setting-up-executorch-and-runner-et). At present, when
260-
exporting a model, the export command always uses the xnnpack delegate
261-
to export. (Future versions of torchchat will support additional
262-
delegates such as Vulkan, CoreML, MPS, HTP in addition to Xnnpack as
263-
they are released for Executorch.)
254+
[here](#setting-up-executorch-and-runner-et). At present, when
255+
exporting a model, the export command always uses the XNNPACK delegate
256+
to export. (Future versions of torchchat will support additional
257+
delegates such as Vulkan, CoreML, MPS, HTP in addition to XNNPACK as
258+
they are released for ExecuTorch.)
264259

265260
### Running the model
266261

267262
With the model exported, you can now generate text with the executorch
268-
runtime pybindings. Feel free to play around with the prompt.
263+
runtime pybindings. Feel free to play around with the prompt.
269264

270265
```
271266
python3 generate.py --checkpoint-path ${MODEL_PATH} --pte ${MODEL_OUT}/model.pte --device cpu --prompt "Once upon a time"
272267
```
273268

274-
You can also run the model with the runner-et. See below under
269+
You can also run the model with the runner-et. See below under
275270
"Standalone Execution".
276271

277272
While we have shown the export and execution of a small model to a
278-
mobile/edge device supported by Executorch, most models need to be
273+
mobile/edge device supported by ExecuTorch, most models need to be
279274
compressed to fit in the target device's memory. We use quantization
280275
to achieve this.
281276

277+
### Visualizing the backend delegate on ExecuTorch export
278+
279+
By default, export will lower to the XNNPACK delegate for improved performance. ExecuTorch export
280+
provides APIs to visualize what happens after the `to_backend()` call in the lowering process.
281+
282+
- `get_delegation_info()`: provide a summary of the model after the `to_backend()` call, including the total delegated subgraphs, number of delegated nodes and number of non-delegated nodes.
283+
- `format_delegated_graph`: a formatted str of the whole graph, as well as the subgraph/s consumed by the backend.
284+
285+
See the
286+
[debug backend delegate documentation](https://pytorch.org/executorch/main/debug-backend-delegate.html)
287+
for more details.
282288

283289

284290
## Optimizing your model for server, desktop and mobile devices
@@ -295,7 +301,7 @@ To compress models, torchchat offers a variety of strategies:
295301
* dynamic activation quantization with weight quantization: a8w4dq
296302

297303
In addition, we support GPTQ and HQQ for improving the quality of 4b
298-
weight-only quantization. Support for HQQ is a work in progress.
304+
weight-only quantization. Support for HQQ is a work in progress.
299305

300306
| compression | FP precision | weight quantization | dynamic activation quantization |
301307
|--|--|--|--|
@@ -305,11 +311,8 @@ linear operator (asymmetric) | n/a | 4b (group), a6w4dq | a8w4dq (group) |
305311
linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
306312
linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |
307313

308-
309314
## Model precision (dtype precision setting)
310-
311-
You can generate models (for both export and generate, with eager, torch.compile, AOTI, ET, for all backends - mobile at present will primarily support fp32, with all options)
312-
specify the precision of the model with
315+
On top of quantizing models with quantization schemes mentioned above, models can be converted to lower bit floating point precision to reduce the memory bandwidth requirement and take advantage of higher density compute available. For example, many GPUs and some of the CPUs have good support for bfloat16 and float16. This can be taken advantage of via `--dtype arg` as shown below.
313316

314317
```
315318
python3 generate.py --dtype [bf16 | fp16 | fp32] ...
@@ -327,7 +330,7 @@ quantization is available in eager mode as well as during export,
327330
enabling you to do an early exploration of your quantization setttings
328331
in eager mode. However, final accuracy should always be confirmed on
329332
the actual execution target, since all targets have different build
330-
processes, compilers, amd kernel implementations with potentially
333+
processes, compilers, and kernel implementations with potentially
331334
significant impact on accuracy.
332335

333336

@@ -341,11 +344,11 @@ into native torchchat models by using the load-gguf option:
341344
python3 [ export.py | generate.py | ... ] --gguf-path <gguf_filename>
342345
```
343346

344-
Ypu may then apply the standard quantization options, e.g., to add
347+
You may then apply the standard quantization options, e.g., to add
345348
embedding table quantization as described under quantization. (You
346-
cannot directly requantize already quantized formats. However, you
349+
cannot directly requantize already quantized formats. However, you
347350
may dequantize them using GGUF tools, and then laod the model into
348-
torchchat to quantize wqith torchchat's quantization workflow.)
351+
torchchat to quantize with torchchat's quantization workflow.)
349352

350353

351354
## Loading unsupported GGUF formats in torchchat
@@ -375,23 +378,22 @@ ${GGUF}/quantize --allow-requantize your_quantized_model.gguf fake_unquantized_m
375378

376379
## Mobile and Edge Execution Test (x86)
377380

378-
You can also run the model with the runner-et. This requires you
379-
first build the runner. See instructions
380-
[here](#setting-up-executorch-and-runner-et). After this is done, you
381-
can run runner-et with
381+
You can also run the model with the `runner-et`. This requires you
382+
first build the runner. See instructions
383+
[here](#setting-up-executorch-and-runner-et). After this is done, you
384+
can run `runner-et` with
382385

383386
```
384387
./build/cmake-out/runner_et ${MODEL_OUT}/model.pte -z ${MODEL_OUT}/tokenizer.bin -i "Once upon a time in a land far away"
385388
```
386389

387390
While we have shown the export and execution of a small model to a
388-
mobile/edge device supported by Executorch, most models need to be
391+
mobile/edge device supported by ExecuTorch, most models need to be
389392
compressed to fit in the target device's memory. We use quantization
390393
to achieve this.
391394

392-
393395
This has been shown to run on x86. with the proper IDE environment,
394-
you can compile for your specific target. For a GUI integration in
396+
you can compile for your specific target. For a GUI integration in
395397
iOS and Android, please refer to "Running on a mobile/edge system" in
396398
the section below.
397399

@@ -426,12 +428,11 @@ simulator for Android. `scripts/android_example.sh` for running a
426428
model on an Android simulator (on Mac), and in `docs/Android.md`.
427429

428430

429-
430431
### iOS
431432

432433
Open the iOS Llama Xcode project at
433434
https://github.com/pytorch/executorch/tree/main/examples/demo-apps/apple_ios/LLaMA/LLaMA.xcodeproj
434-
in Xcode and click Run. You will need to provide a provisioning
435+
in Xcode and click Run. You will need to provide a provisioning
435436
profile (similar to what's expected for any iOS dev).
436437

437438
Once you can run the app on you device,
@@ -516,7 +517,6 @@ in a python-free environment with AOT Inductor and ExecuTorch.
516517
| ARM 32b (up to v7) | any | | ? | ? | ? | ? |
517518

518519

519-
520520
# Setting up ExecuTorch and runner-et
521521

522522
Set up ExecuTorch by following the instructions [here](https://pytorch.org/executorch/stable/getting-started-setup.html#setting-up-executorch).
@@ -538,7 +538,7 @@ cmake -S ./runner-et -B build/cmake-out -G Ninja
538538
cmake --build ./build/cmake-out
539539
```
540540

541-
The built executable is located at ./build/cmake-out/runner-et.
541+
The built executable is located at `./build/cmake-out/runner-et`.
542542

543543

544544
# Contributing to torchchat

export_et.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,13 @@ def export_model(model, device, output_path, args=None) -> str: # noqa: C901
9595
edge_compile_config=edge_config,
9696
)
9797
edge_manager = edge_manager.to_backend(XnnpackDynamicallyQuantizedPartitioner())
98+
# Delegation visualization APIs: https://pytorch.org/executorch/main/debug-backend-delegate.html
99+
# from executorch.exir.backend.utils import get_delegation_info, format_delegated_graph
100+
# from tabulate import tabulate
101+
# graph_module = edge_manager.exported_program().graph_module
102+
# delegation_info = get_delegation_info(graph_module)
103+
# print(delegation_info.get_summary())
104+
# print(format_delegated_graph(graph_module))
98105
export_program = edge_manager.to_executorch(
99106
ExecutorchBackendConfig(
100107
extract_constant_segment=True,

0 commit comments

Comments
 (0)