You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See the [documentation on GGUF](docs/GGUF.md) to learn how to use GGUF files.
94
-
95
72
96
73
## Running via PyTorch / Python
97
74
98
75
### Chat
99
76
Designed for interactive and conversational use.
100
77
In chat mode, the LLM engages in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
101
78
102
-
For more information run `python3 torchchat.py chat --help`
103
-
104
79
**Examples**
105
-
```
80
+
```bash
106
81
python3 torchchat.py chat llama3
107
82
```
108
83
84
+
For more information run `python3 torchchat.py chat --help`
85
+
109
86
### Generate
110
87
Aimed at producing content based on specific prompts or instructions.
111
88
In generate mode, the LLM focuses on creating text based on a detailed prompt or instruction. This mode is often used for generating written content like articles, stories, reports, or even creative writing like poetry.
112
89
113
-
For more information run `python3 torchchat.py generate --help`
114
90
115
91
**Examples**
116
-
```
117
-
python3 torchchat.py generate llama3 --dtype=fp16
92
+
```bash
93
+
python3 torchchat.py generate llama3
118
94
```
119
95
96
+
For more information run `python3 torchchat.py generate --help`
97
+
98
+
### Browser
99
+
100
+
Designed for interactive graphical conversations using the familiar web browser GUI. The browser command provides a GUI-based experience to engage with the LLM in a back-and-forth dialogue with the user. It responds to queries, participates in discussions, provides explanations, and can adapt to the flow of conversation.
101
+
102
+
## Quantizing your model (suggested for mobile)
103
+
104
+
Quantization is the process of converting a model into a more memory-efficient representation. Quantization is particularly important for accelerators -- to take advantage of the available memory bandwidth, and fit in the often limited high-speed memory in accelerators – and mobile devices – to fit in the typically very limited memory of mobile devices.
105
+
106
+
With quantization, 32-bit floating numbers can be represented with as few as 8 or even 4 bits, and a scale shared by a group of these weights. This transformation is lossy and modifies the behavior of models. While research is being conducted on how to efficiently quantize large language models for use in mobile devices, this transformation invariable results in both quality loss and a reduced amount of control over the output of the models, leading to an increased risk of undesirable responses, hallucinations and stuttering.
107
+
108
+
In effect an a developer quantizing a model, has much control and even more responsibility to quantize a model to quantify and reduce these effects.
109
+
110
+
120
111
## Exporting your model
121
112
Compiles a model and saves it to run later.
122
113
@@ -146,10 +137,10 @@ Run a chatbot in your browser that’s supported by the model you specify in the
*Running on http://127.0.0.1:5000* should be printed out on the terminal. Click the link or go to [http://127.0.0.1:5000](http://127.0.0.1:5000) on your browser to start interacting with it. If port 5000 has already been taken, run the command again with `--port`, e.g. `--port 5001`.
143
+
*Running on http://127.0.0.1:5000* should be printed out on the terminal. Click the link or go to [http://127.0.0.1:5000](http://127.0.0.1:5000) on your browser to start interacting with it.
153
144
154
145
Enter some text in the input box, then hit the enter key or click the “SEND” button. After a second or two, the text you entered together with the generated text will be displayed. Repeat to have a conversation.
155
146
@@ -171,8 +162,9 @@ To test the perplexity for a lowered or quantized model, pass it in the same way
0 commit comments