@@ -238,46 +238,58 @@ export TORCHCHAT_ROOT=${PWD}
238
238
./scripts/install_et.sh
239
239
```
240
240
241
- ### Test it out using our ExecuTorch runner
241
+
242
+ ### Export for mobile
243
+ Similar to AOTI, to deploy onto device, we first export the PTE artifact, then we load the artifact for inference.
244
+
245
+ The following example uses the Llama3 8B Instruct model.
246
+ ```
247
+ # Export
248
+ python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte
249
+ ```
250
+
251
+ > [ !NOTE]
252
+ > We use ` --quantize config/data/mobile.json ` to quantize the
253
+ llama3 model to reduce model size and improve performance for
254
+ on-device use cases.
255
+
256
+ For more details on quantization and what settings to use for your use
257
+ case visit our [ Quantization documentation] ( docs/quantization.md ) .
258
+
259
+ ### Deploy and run on Desktop
242
260
243
261
While ExecuTorch does not focus on desktop inference, it is capable
244
- of building a runner to do so. This is handy for testing out PTE
262
+ of doing so. This is handy for testing out PTE
245
263
models without sending them to a physical device.
246
264
247
- Build the runner
248
- ``` bash
249
- scripts/build_native.sh et
250
- ```
265
+ Specifically there are 2 ways of doing so: Pure Python and via a Runner
266
+
267
+ < details >
268
+ < summary >Deploying via Python</ summary >
251
269
252
- Get a PTE file if you don't have one already
253
270
```
254
- python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte
271
+ # Execute
272
+ python3 torchchat.py generate llama3 --device cpu --pte-path llama3.pte --prompt "Hello my name is"
255
273
```
256
274
257
- Execute using the runner
258
- ``` bash
259
- cmake-out/et_run llama3.pte -z ` python3 torchchat.py where llama3` /tokenizer.model -i " Once upon a time"
260
- ```
275
+ </details >
261
276
262
- ### Export for mobile
263
- The following example uses the Llama3 8B Instruct model.
264
277
278
+ <details >
279
+ <summary >Deploying via a Runner</summary >
280
+
281
+ Build the runner
282
+ ``` bash
283
+ scripts/build_native.sh et
265
284
```
266
- # Export
267
- python3 torchchat.py export llama3 --quantize config/data/mobile.json --output-pte-path llama3.pte
268
285
269
- # Execute
270
- python3 torchchat.py generate llama3 --device cpu --pte-path llama3.pte --prompt "Hello my name is"
286
+ Execute using the runner
287
+ ``` bash
288
+ cmake-out/et_run llama3.pte -z ` python3 torchchat.py where llama3` /tokenizer.model -i " Once upon a time"
271
289
```
272
290
273
- > [ !NOTE]
274
- > We use ` --quantize config/data/mobile.json ` to quantize the
275
- llama3 model to reduce model size and improve performance for
276
- on-device use cases.
291
+ </details >
277
292
278
- For more details on quantization and what settings to use for your use
279
- case visit our [ Quantization documentation] ( docs/quantization.md ) or
280
- run ` python3 torchchat.py export `
281
293
282
294
[ end default ] : end
283
295
0 commit comments