You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: extension/llm/README.md
+11-9Lines changed: 11 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -2,8 +2,9 @@ This subtree contains libraries and utils of running generative AI, including La
2
2
Below is a list of sub folders.
3
3
## export
4
4
Model preparation codes are in _export_ folder. The main entry point is the _LLMEdgeManager_ class. It hosts a _torch.nn.Module_, with a list of methods that can be used to prepare the LLM model for ExecuTorch runtime.
5
-
Note that ExecuTorch supports two [quantization APIs](https://pytorch.org/docs/stable/quantization.html#quantization-api-summary): eager mode quantization (aka source transform based quantization), and PyTorch 2 Export based quantization (aka pt2e quantization).
6
-
Typical methods include:
5
+
Note that ExecuTorch supports two [quantization APIs](https://pytorch.org/docs/stable/quantization.html#quantization-api-summary): eager mode quantization (aka source transform based quantization) and PyTorch 2 Export based quantization (aka pt2e quantization).
6
+
7
+
Commonly used methods in this class include:
7
8
-_set_output_dir_: where users want to save the exported .pte file.
8
9
-_to_dtype_: override the data type of the module.
9
10
-_source_transform_: execute a series of source transform passes. Some transform passes include
@@ -19,7 +20,7 @@ Typical methods include:
19
20
20
21
Some usage of LLMEdgeManager can be found in executorch/examples/models/llama2, and executorch/examples/models/llava.
21
22
22
-
When the .pte file is exported and saved, we can prepare a load and run it in a runner.
23
+
When the .pte file is exported and saved, we can load and run it in a runner (see below).
23
24
24
25
## tokenizer
25
26
Currently, we support two types of tokenizers: sentencepiece and Tiktoken.
@@ -28,20 +29,21 @@ Currently, we support two types of tokenizers: sentencepiece and Tiktoken.
28
29
-_tokenizer.py_: rewrite a sentencepiece tokenizer model to a serialization format that the runtime can load.
29
30
- In C++:
30
31
-_tokenizer.h_: a simple tokenizer interface. Actual tokenizer classes can be implemented based on this. In this folder, we provide two tokenizer implementations:
31
-
-_bpe_tokenizer_. We need the rewritten version of tokenizer artifact (refer to _tokenizer.py_ above), for bpe tokenizer to work.
32
-
-_tiktokern_. It's for llama3 and llama3.1.
32
+
-_bpe_tokenizer_. Note: we need the rewritten version of tokenizer artifact (refer to _tokenizer.py_ above), for bpe tokenizer to work.
33
+
-_tiktoken_. For llama3 and llama3.1.
33
34
34
35
## sampler
35
36
A sampler class in C++ to sample the logistics given some hyperparameters.
36
37
37
38
## custom_ops
38
-
It hosts a custom sdpa operator. This sdpa operator implements CPU flash attention, it avoids copies by taking the kv cache as one of the arguments to this custom operator.
39
-
-_sdpa_with_kv_cache.py_, _op_sdpa_aot.cpp_: custom op definition in PyTorch with C++ registration.
40
-
-_op_sdpa.cpp_: the optimized operator implementation and registration of _sdpa_with_kv_cache.out_.
39
+
Contains custom op, such as:
40
+
- custom sdpa: implements CPU flash attention and avoids copies by taking the kv cache as one of its arguments.
41
+
-_sdpa_with_kv_cache.py_, _op_sdpa_aot.cpp_: custom op definition in PyTorch with C++ registration.
42
+
-_op_sdpa.cpp_: the optimized operator implementation and registration of _sdpa_with_kv_cache.out_.
41
43
42
44
## runner
43
45
It hosts the libary components used in a C++ llm runner. Currently, it hosts _stats.h_ on runtime status like token numbers and latency.
44
46
45
-
With the components above, an actual runner can be built for a model or a series of models. An exmaple is in //executorch/examples/models/llama2/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.
47
+
With the components above, an actual runner can be built for a model or a series of models. An example is in //executorch/examples/models/llama2/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.
46
48
47
49
Usages can also be found in the [torchchat repo](https://github.com/pytorch/torchchat/tree/main/runner).
0 commit comments