You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*Key:* ✅ works correctly; 🚧 work in progress; ❌ not supported; ❹ requires 4bit groupwise quantization; 📵 not on mobile (may fit some high-end devices such as tablets);
55
-
56
-
57
+
*Key:* ✅ works correctly; 🚧 work in progress; ❌ not supported; ❹
58
+
requires 4bit groupwise quantization; 📵 not on mobile (may fit some
59
+
high-end devices such as tablets);
57
60
58
-
---
59
61
60
62
## Get Started
61
63
62
-
Torchchat lets you access LLMs through an interactive interface, prompted single-use generation, model export (for use by AOT Inductor and ExecuTorch), and standalone C++ runtimes.
64
+
Torchchat lets you access LLMs through an interactive interface,
65
+
prompted single-use generation, model export (for use by AOT Inductor
66
+
and ExecuTorch), and standalone C++ runtimes.
63
67
64
68
| Function | Torchchat Command | Direct Command | Tested |
65
69
|---|----|----|-----|
@@ -79,9 +83,11 @@ Mobile C++ runtime | n/a | app + AOTI | 🚧 |
79
83
80
84
**Getting help:** Each command implements the --help option to give addititonal information about available options:
@@ -334,7 +343,7 @@ tests against the exported model with the same interface, and support
334
343
additional experiments to confirm model quality and speed.
335
344
336
345
```
337
-
python3 generate.py --device {cuda,cpu} --dso-path ${MODEL_NAME}.so --prompt "Once upon a time"
346
+
python3 generate.py --device [ cuda | cpu ] --dso-path ${MODEL_NAME}.so --prompt "Once upon a time"
338
347
```
339
348
340
349
@@ -389,12 +398,17 @@ linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
389
398
linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |
390
399
391
400
## Model precision (dtype precision setting)
392
-
On top of quantizing models with quantization schemes mentioned above, models can be converted to lower bit floating point precision to reduce the memory bandwidth requirement and take advantage of higher density compute available. For example, many GPUs and some of the CPUs have good support for bfloat16 and float16. This can be taken advantage of via `--dtype arg` as shown below.
401
+
On top of quantizing models with quantization schemes mentioned above, models can be converted
402
+
to lower precision floating point representations to reduce the memory bandwidth requirement and
403
+
take advantage of higher density compute available. For example, many GPUs and some of the CPUs
404
+
have good support for bfloat16 and float16. This can be taken advantage of via `--dtype arg` as shown below.
0 commit comments