Skip to content

Commit 82bca22

Browse files
authored
readme : add option, update default value, fix formatting (ggml-org#10271)
* readme : document --no-display-prompt * readme : update default prompt context size * readme : remove unnecessary indentation Indenting a line with four spaces makes Markdown treat that section as plain text. * readme : indent commands under bullets * readme : indent commands in lettered list
1 parent 0115df2 commit 82bca22

File tree

4 files changed

+165
-164
lines changed

4 files changed

+165
-164
lines changed

docs/build.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -26,17 +26,17 @@ cmake --build build --config Release
2626

2727
1. Single-config generators (e.g. default = `Unix Makefiles`; note that they just ignore the `--config` flag):
2828

29-
```bash
30-
cmake -B build -DCMAKE_BUILD_TYPE=Debug
31-
cmake --build build
32-
```
29+
```bash
30+
cmake -B build -DCMAKE_BUILD_TYPE=Debug
31+
cmake --build build
32+
```
3333

3434
2. Multi-config generators (`-G` param set to Visual Studio, XCode...):
3535

36-
```bash
37-
cmake -B build -G "Xcode"
38-
cmake --build build --config Debug
39-
```
36+
```bash
37+
cmake -B build -G "Xcode"
38+
cmake --build build --config Debug
39+
```
4040

4141
For more details and a list of supported generators, see the [CMake documentation](https://cmake.org/cmake/help/latest/manual/cmake-generators.7.html).
4242

examples/infill/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ In this section, we cover the most commonly used options for running the `infill
1414
- `-m FNAME, --model FNAME`: Specify the path to the LLaMA model file (e.g., `models/7B/ggml-model.bin`).
1515
- `-i, --interactive`: Run the program in interactive mode, allowing you to provide input directly and receive real-time responses.
1616
- `-n N, --n-predict N`: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.
17-
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
17+
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 4096, but if a LLaMA model was built with a longer context, increasing this value will provide better results for longer input/inference.
1818
- `--spm-infill`: Use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this.
1919

2020
## Input Prompts

examples/main/README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ In this section, we cover the most commonly used options for running the `llama-
6666
- `-mu MODEL_URL --model-url MODEL_URL`: Specify a remote http url to download the file (e.g [https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true](https://huggingface.co/ggml-org/gemma-1.1-7b-it-Q4_K_M-GGUF/resolve/main/gemma-1.1-7b-it.Q4_K_M.gguf?download=true)).
6767
- `-i, --interactive`: Run the program in interactive mode, allowing you to provide input directly and receive real-time responses.
6868
- `-n N, --n-predict N`: Set the number of tokens to predict when generating text. Adjusting this value can influence the length of the generated text.
69-
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference.
69+
- `-c N, --ctx-size N`: Set the size of the prompt context. The default is 4096, but if a LLaMA model was built with a longer context, increasing this value will provide better results for longer input/inference.
7070
- `-mli, --multiline-input`: Allows you to write or paste multiple lines without ending each in '\'
7171
- `-t N, --threads N`: Set the number of threads to use during generation. For optimal performance, it is recommended to set this value to the number of physical CPU cores your system has.
7272
- `-ngl N, --n-gpu-layers N`: When compiled with GPU support, this option allows offloading some layers to the GPU for computation. Generally results in increased performance.
@@ -131,7 +131,7 @@ During text generation, LLaMA models have a limited context size, which means th
131131

132132
### Context Size
133133

134-
- `-c N, --ctx-size N`: Set the size of the prompt context (default: 0, 0 = loaded from model). The LLaMA models were built with a context of 2048-8192, which will yield the best results on longer input/inference.
134+
- `-c N, --ctx-size N`: Set the size of the prompt context (default: 4096, 0 = loaded from model). If a LLaMA model was built with a longer context, increasing this value will yield the best results on longer input/inference.
135135

136136
### Extended Context Size
137137

@@ -348,6 +348,7 @@ These options provide extra functionality and customization when running the LLa
348348

349349
- `-h, --help`: Display a help message showing all available options and their default values. This is particularly useful for checking the latest options and default values, as they can change frequently, and the information in this document may become outdated.
350350
- `--verbose-prompt`: Print the prompt before generating text.
351+
- `--no-display-prompt`: Don't print prompt at generation.
351352
- `-mg i, --main-gpu i`: When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile. The GPU in question will use slightly more VRAM to store a scratch buffer for temporary results. By default GPU 0 is used.
352353
- `-ts SPLIT, --tensor-split SPLIT`: When using multiple GPUs this option controls how large tensors should be split across all GPUs. `SPLIT` is a comma-separated list of non-negative values that assigns the proportion of data that each GPU should get in order. For example, "3,2" will assign 60% of the data to GPU 0 and 40% to GPU 1. By default the data is split in proportion to VRAM but this may not be optimal for performance.
353354
- `-hfr URL --hf-repo URL`: The url to the Hugging Face model repository. Used in conjunction with `--hf-file` or `-hff`. The model is downloaded and stored in the file provided by `-m` or `--model`. If `-m` is not provided, the model is auto-stored in the path specified by the `LLAMA_CACHE` environment variable or in an OS-specific local cache.

0 commit comments

Comments
 (0)