Skip to content

clean up runner code a little #532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ scripts/build_native.sh aoti
Run:

```bash
cmake-out/aoti_run model.so -z tokenizer.model -i "Once upon a time"
cmake-out/aoti_run model.so -z tokenizer.model -l 3 -i "Once upon a time"
```

### ExecuTorch
Expand Down Expand Up @@ -218,7 +218,7 @@ scripts/build_native.sh et
Run:

```bash
cmake-out/et_run llama3.pte -z tokenizer.model -i "Once upon a time"
cmake-out/et_run llama3.pte -z tokenizer.model -l 3 -i "Once upon a time"
```

## Fine-tuned models from torchtune
Expand Down
40 changes: 23 additions & 17 deletions docs/runner_build.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Native Execution

While Python offers a great environment for training models and experimentation and research with models, developers
often are looking to use a native execution environment, either to achieve a certain performance level, or when
While Python offers a great environment for training models and experimentation and research with models, developers
often are looking to use a native execution environment, either to achieve a certain performance level, or when
including a Python is undesirable (e.g., in a game application that wants to use an LLM for user interaction) or impossible (devices
with limited functionality and memory capacity).

The 'llama runner' is a native standalone application capable of running a model exported and compiled ahead-of-time with either Executorch (ET) or AOT Inductor (AOTI). Which model format to use depends on your requirements and preferences. Executorch modelsare optimized for portability across a range of decices, including mobile and edge devices. AOT Inductor models are optimized for a particular target architecture, which may result in better performance and efficiency.
The 'llama runner' is a native standalone application capable of running a model exported and compiled ahead-of-time with either Executorch (ET) or AOT Inductor (AOTI). Which model format to use depends on your requirements and preferences. Executorch modelsare optimized for portability across a range of decices, including mobile and edge devices. AOT Inductor models are optimized for a particular target architecture, which may result in better performance and efficiency.

Building the runners is straightforward with the included cmake build files and is covered in the next sections. We will showcase the runners using ~~stories15M~~ llama2 7B and llama3.

Expand All @@ -31,14 +31,16 @@ The runners accept the following command-line arguments:

```
Options:
-t <float> temperature in [0,inf], default 1.0
-p <float> p value in top-p (nucleus) sampling in [0,1] default 0.9
-s <int> random seed, default time(NULL)
-n <int> number of steps to run for, default 256. 0 = max_seq_len
-i <string> input prompt
-z <string> optional path to custom tokenizer
-m <string> mode: generate|chat, default: generate
-y <string> (optional) system prompt in chat mode
-t <float> temperature in [0,inf], default 1.0
-p <float> p value in top-p (nucleus) sampling in [0,1], default 0.9
-s <int> random seed, default time(NULL)
-n <int> number of steps to run for, default 256. 0 = max_seq_len
-i <string> input prompt
-z <string> path to tokenizer
-m <string> mode: generate|chat, default: generate
-y <string> (optional) system prompt in chat mode
-v <int> (optional) vocab size, default is model-specific.
-l <int> (optional) llama version (2 or 3), default 2.
```

## Building and running runner-aoti
Expand All @@ -50,7 +52,7 @@ git submodule sync
git submodule update --init

cmake -S . -B ./cmake-out -G Ninja -DCMAKE_PREFIX_PATH=`python3 -c 'import torch;print(torch.utils.cmake_prefix_path)'`
cmake --build ./cmake-out --target et_run
cmake --build ./cmake-out --target aoti_run
```

After running these, the runner-aoti binary is located at ./cmake-out/aoti_run.
Expand All @@ -67,9 +69,11 @@ We can now execute the runner with:

```
wget -O ./tokenizer.bin https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin
./cmake-out/aoti_run ./model.so -z ./tokenizer.bin -i "Once upon a time"
./cmake-out/aoti_run ./model.so -z ./tokenizer.bin -l 2 -i "Once upon a time"
```

The `-l 2` indicates that the model and tokenizer use the llama2 architecture. If your model is based on llama3, use `-l 3`.

## Building and running runner-et
Before building runner-et, you must first setup ExecuTorch by following [setup ExecuTorch steps](executorch_setup.md).

Expand Down Expand Up @@ -100,13 +104,15 @@ We can now execute the runner with:

```
wget -O ./tokenizer.bin https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin
./cmake-out/et_run ./model.pte -z ./tokenizer.bin -i "Once upon a time"
./cmake-out/et_run ./model.pte -z ./tokenizer.bin -l 2 -i "Once upon a time"
```

The `-l 2` indicates that the model and tokenizer use the llama2 architecture. If your model is based on llama3, use `-l 3`.

## Appendix: Llama runner tokenizers

Tokenizers are essential tools in Natural Language Processing (NLP) that convert text into smaller units, such as words or phrases, known as tokens. Two popular tokenizers are SentencePiece and Tiktoken. [SentencePiece](https://github.com/google/sentencepiece) is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. Llama2-style models typically use the SentencePiece tokenizer. Tiktoken is a newer tokenizer originally developed by OpenAI that allows you to see how many tokens a text string will use without making an API call. Llama3 uses the Tiktoken tokenizer.
Torchchat includes both Python and C/C++ implementations of both the SentencePiece and Tiktoken tokenizers that may be used with the Python and native execution environments, respectively.
Tokenizers are essential tools in Natural Language Processing (NLP) that convert text into smaller units, such as words or phrases, known as tokens. Two popular tokenizers are SentencePiece and Tiktoken. [SentencePiece](https://github.com/google/sentencepiece) is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. Llama2-style models typically use the SentencePiece tokenizer. Tiktoken is a newer tokenizer originally developed by OpenAI that allows you to see how many tokens a text string will use without making an API call. Llama3 uses the Tiktoken tokenizer.
Torchchat includes both Python and C/C++ implementations of both the SentencePiece and Tiktoken tokenizers that may be used with the Python and native execution environments, respectively.

The SentencePiece tokenizer implementations for Python (developed by Google) and the C/C++ implementation (developed by Andrej Karpathy) use different input formats. The Python implementation reads a tokenizer specification in `tokenizer.model` format. The C/C++ tokenizer that reads the tokenizer instructions from a file in `tokenizer.bin` format. We include Andrej's SentencePiece converter which translates a SentencePiece tokenizer in `tokenizer.model` format to `tokenizer.bin` in the utils subdirectory:
```
Expand All @@ -120,4 +126,4 @@ Both can be achieved with the Python environment. All torchchat Python comands

The `eval` tool evaluates model quality using metrics such as 'perplexity' that are commonly used in the NLP community to evaluate output quality. Load your model exported model to evaluate quality metrics for exported models. You can find an introduction to the eval tool in the [README](../README.md) file.

The `generate`, `chat` and `browser` tools enable you to verify that the exported model works correctly, as a debugging aid if you are developing your own native execution environment based on the llama runner provided with torchchat.
The `generate`, `chat` and `browser` tools enable you to verify that the exported model works correctly, as a debugging aid if you are developing your own native execution environment based on the llama runner provided with torchchat.
Loading