Skip to content

Commit f00a416

Browse files
metascroymalfet
authored andcommitted
clean up runner code a little (#532)
* clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo
1 parent 1695a80 commit f00a416

File tree

4 files changed

+221
-145
lines changed

4 files changed

+221
-145
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ scripts/build_native.sh aoti
173173
Run:
174174

175175
```bash
176-
cmake-out/aoti_run model.so -z tokenizer.model -i "Once upon a time"
176+
cmake-out/aoti_run model.so -z tokenizer.model -l 3 -i "Once upon a time"
177177
```
178178

179179
### ExecuTorch
@@ -218,7 +218,7 @@ scripts/build_native.sh et
218218
Run:
219219

220220
```bash
221-
cmake-out/et_run llama3.pte -z tokenizer.model -i "Once upon a time"
221+
cmake-out/et_run llama3.pte -z tokenizer.model -l 3 -i "Once upon a time"
222222
```
223223

224224
## Fine-tuned models from torchtune

docs/runner_build.md

Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
# Native Execution
22

3-
While Python offers a great environment for training models and experimentation and research with models, developers
4-
often are looking to use a native execution environment, either to achieve a certain performance level, or when
3+
While Python offers a great environment for training models and experimentation and research with models, developers
4+
often are looking to use a native execution environment, either to achieve a certain performance level, or when
55
including a Python is undesirable (e.g., in a game application that wants to use an LLM for user interaction) or impossible (devices
66
with limited functionality and memory capacity).
77

8-
The 'llama runner' is a native standalone application capable of running a model exported and compiled ahead-of-time with either Executorch (ET) or AOT Inductor (AOTI). Which model format to use depends on your requirements and preferences. Executorch modelsare optimized for portability across a range of decices, including mobile and edge devices. AOT Inductor models are optimized for a particular target architecture, which may result in better performance and efficiency.
8+
The 'llama runner' is a native standalone application capable of running a model exported and compiled ahead-of-time with either Executorch (ET) or AOT Inductor (AOTI). Which model format to use depends on your requirements and preferences. Executorch modelsare optimized for portability across a range of decices, including mobile and edge devices. AOT Inductor models are optimized for a particular target architecture, which may result in better performance and efficiency.
99

1010
Building the runners is straightforward with the included cmake build files and is covered in the next sections. We will showcase the runners using ~~stories15M~~ llama2 7B and llama3.
1111

@@ -31,14 +31,16 @@ The runners accept the following command-line arguments:
3131

3232
```
3333
Options:
34-
-t <float> temperature in [0,inf], default 1.0
35-
-p <float> p value in top-p (nucleus) sampling in [0,1] default 0.9
36-
-s <int> random seed, default time(NULL)
37-
-n <int> number of steps to run for, default 256. 0 = max_seq_len
38-
-i <string> input prompt
39-
-z <string> optional path to custom tokenizer
40-
-m <string> mode: generate|chat, default: generate
41-
-y <string> (optional) system prompt in chat mode
34+
-t <float> temperature in [0,inf], default 1.0
35+
-p <float> p value in top-p (nucleus) sampling in [0,1], default 0.9
36+
-s <int> random seed, default time(NULL)
37+
-n <int> number of steps to run for, default 256. 0 = max_seq_len
38+
-i <string> input prompt
39+
-z <string> path to tokenizer
40+
-m <string> mode: generate|chat, default: generate
41+
-y <string> (optional) system prompt in chat mode
42+
-v <int> (optional) vocab size, default is model-specific.
43+
-l <int> (optional) llama version (2 or 3), default 2.
4244
```
4345

4446
## Building and running runner-aoti
@@ -50,7 +52,7 @@ git submodule sync
5052
git submodule update --init
5153
5254
cmake -S . -B ./cmake-out -G Ninja -DCMAKE_PREFIX_PATH=`python3 -c 'import torch;print(torch.utils.cmake_prefix_path)'`
53-
cmake --build ./cmake-out --target et_run
55+
cmake --build ./cmake-out --target aoti_run
5456
```
5557

5658
After running these, the runner-aoti binary is located at ./cmake-out/aoti_run.
@@ -67,9 +69,11 @@ We can now execute the runner with:
6769

6870
```
6971
wget -O ./tokenizer.bin https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin
70-
./cmake-out/aoti_run ./model.so -z ./tokenizer.bin -i "Once upon a time"
72+
./cmake-out/aoti_run ./model.so -z ./tokenizer.bin -l 2 -i "Once upon a time"
7173
```
7274

75+
The `-l 2` indicates that the model and tokenizer use the llama2 architecture. If your model is based on llama3, use `-l 3`.
76+
7377
## Building and running runner-et
7478
Before building runner-et, you must first setup ExecuTorch by following [setup ExecuTorch steps](executorch_setup.md).
7579

@@ -100,13 +104,15 @@ We can now execute the runner with:
100104

101105
```
102106
wget -O ./tokenizer.bin https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin
103-
./cmake-out/et_run ./model.pte -z ./tokenizer.bin -i "Once upon a time"
107+
./cmake-out/et_run ./model.pte -z ./tokenizer.bin -l 2 -i "Once upon a time"
104108
```
105109

110+
The `-l 2` indicates that the model and tokenizer use the llama2 architecture. If your model is based on llama3, use `-l 3`.
111+
106112
## Appendix: Llama runner tokenizers
107113

108-
Tokenizers are essential tools in Natural Language Processing (NLP) that convert text into smaller units, such as words or phrases, known as tokens. Two popular tokenizers are SentencePiece and Tiktoken. [SentencePiece](https://github.com/google/sentencepiece) is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. Llama2-style models typically use the SentencePiece tokenizer. Tiktoken is a newer tokenizer originally developed by OpenAI that allows you to see how many tokens a text string will use without making an API call. Llama3 uses the Tiktoken tokenizer.
109-
Torchchat includes both Python and C/C++ implementations of both the SentencePiece and Tiktoken tokenizers that may be used with the Python and native execution environments, respectively.
114+
Tokenizers are essential tools in Natural Language Processing (NLP) that convert text into smaller units, such as words or phrases, known as tokens. Two popular tokenizers are SentencePiece and Tiktoken. [SentencePiece](https://github.com/google/sentencepiece) is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. Llama2-style models typically use the SentencePiece tokenizer. Tiktoken is a newer tokenizer originally developed by OpenAI that allows you to see how many tokens a text string will use without making an API call. Llama3 uses the Tiktoken tokenizer.
115+
Torchchat includes both Python and C/C++ implementations of both the SentencePiece and Tiktoken tokenizers that may be used with the Python and native execution environments, respectively.
110116

111117
The SentencePiece tokenizer implementations for Python (developed by Google) and the C/C++ implementation (developed by Andrej Karpathy) use different input formats. The Python implementation reads a tokenizer specification in `tokenizer.model` format. The C/C++ tokenizer that reads the tokenizer instructions from a file in `tokenizer.bin` format. We include Andrej's SentencePiece converter which translates a SentencePiece tokenizer in `tokenizer.model` format to `tokenizer.bin` in the utils subdirectory:
112118
```
@@ -120,4 +126,4 @@ Both can be achieved with the Python environment. All torchchat Python comands
120126

121127
The `eval` tool evaluates model quality using metrics such as 'perplexity' that are commonly used in the NLP community to evaluate output quality. Load your model exported model to evaluate quality metrics for exported models. You can find an introduction to the eval tool in the [README](../README.md) file.
122128

123-
The `generate`, `chat` and `browser` tools enable you to verify that the exported model works correctly, as a debugging aid if you are developing your own native execution environment based on the llama runner provided with torchchat.
129+
The `generate`, `chat` and `browser` tools enable you to verify that the exported model works correctly, as a debugging aid if you are developing your own native execution environment based on the llama runner provided with torchchat.

0 commit comments

Comments
 (0)