Update runner_build.md (#530)

mikekgfb · malfet · commit b978075e9f00 · 2024-07-17T09:55:44.000-07:00
Update description of runner and build process in runner_build.md
diff --git a/docs/runner_build.md b/docs/runner_build.md
@@ -1,7 +1,33 @@
-# Building runner-aoti and runner-et
-Building the runners is straightforward and is covered in the next sections.  We will showcase the runners using stories15M.
+# Native Execution
 
-The runners accept the following CLI arguments:
+While Python offers a great environment for training models and experimentation and research with models, developers 
+often are looking to use a native execution environment, either to achieve a certain performance level, or when 
+including a Python is undesirable (e.g., in a game application that wants to use an LLM for user interaction) or impossible (devices
+with limited functionality and memory capacity).
+
+The 'llama runner' is a native standalone application capable of running a model exported and compiled ahead-of-time with either Executorch (ET) or AOT Inductor (AOTI). Which model format to use depends on your requirements and preferences.  Executorch modelsare optimized for portability across a range of decices, including mobile and edge devices.  AOT Inductor models are optimized for a particular target architecture, which may result in better performance and efficiency.  
+
+Building the runners is straightforward with the included cmake build files and is covered in the next sections.  We will showcase the runners using ~~stories15M~~ llama2 7B and llama3.
+
+## What can you do with torchchat's llama runner for native execution?
+
+* Run models natively:
+  * [Chat](#chat)
+  * [Generate](#generate)
+  * ~~[Run via Browser](#browser)~~
+* [Building and using llama runner for exported .so files](#run-server)
+     * in Chat mode
+     * in Generate mode
+* [Building and using llama runner for exported .pe files](#run-portable)
+     * in Chat mode
+     * in Generate mode
+* [Building and using llama runner on mobile devices](#run-mobile)
+* Appendix:
+      * [Tokenizers](#tokenizers)
+      * [Validation](#validation)
+
+
+The runners accept the following command-line arguments:
 
 ```
 Options:
@@ -76,3 +102,22 @@ We can now execute the runner with:
 wget -O ./tokenizer.bin https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin
 ./cmake-out/et_run ./model.pte -z ./tokenizer.bin -i "Once upon a time"
 ```
+
+## Appendix: Llama runner tokenizers
+
+Tokenizers are essential tools in Natural Language Processing (NLP) that convert text into smaller units, such as words or phrases, known as tokens. Two popular tokenizers are SentencePiece and Tiktoken. [SentencePiece](https://github.com/google/sentencepiece) is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. Llama2-style models typically use the SentencePiece tokenizer. Tiktoken is a newer tokenizer originally developed by OpenAI that allows you to see how many tokens a text string will use without making an API call. Llama3 uses the Tiktoken tokenizer. 
+ Torchchat includes both Python and C/C++ implementations of both the SentencePiece and Tiktoken tokenizers that may be used with the Python and native execution environments, respectively. 
+
+The SentencePiece tokenizer implementations for Python (developed by Google) and the C/C++ implementation (developed by Andrej Karpathy) use different input formats.  The Python implementation reads a tokenizer specification in `tokenizer.model` format.  The C/C++ tokenizer that reads the tokenizer instructions from a file in `tokenizer.bin` format.  We include Andrej's SentencePiece converter which translates a SentencePiece tokenizer in `tokenizer.model` format to `tokenizer.bin` in the utils subdirectory:
+```
+python3 utils/tokenizer.py --tokenizer-model=${MODEL_DIR}/tokenizer.model
+```
+
+## Appendix: Native model verification using the Python environment
+
+After exporting a model, you will want to verify that the model delivers output of high quality, and works as expected.
+Both can be achieved with the Python environment.  All torchchat Python comands can work with exported models.  Instead of loading the model from a checkpoint or GGUF file, use the the `--dso-path model.so` and `--pte-path model.pte` for loading both types of exported models. This enables you to verify the quality of the exported models, and run any tests that you may have developed in conjunction with exported models to enable you to validate model quality.
+
+The `eval` tool evaluates model quality using metrics such as 'perplexity' that are commonly used in the NLP community to evaluate output quality.  Load your model exported model to evaluate quality metrics for exported models.  You can find an introduction to the eval tool in the [README](../README.md) file.
+
+The `generate`, `chat` and `browser` tools enable you to verify that the exported model works correctly, as a debugging aid if you are developing your own native execution environment based on the llama runner provided with torchchat.