Skip to content

Release docs proofreading #5909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/models/phi-3-mini-lora/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ To see how you can use the model exported for training in a fully involved finet
- `./examples/models/phi-3-mini-lora/install_requirements.sh`

### Step 3: Export and run the model
1. Export the inferenace and training models to ExecuTorch.
1. Export the inference and training models to ExecuTorch.
```
python export_model.py
```
Expand Down
20 changes: 11 additions & 9 deletions extension/llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@ This subtree contains libraries and utils of running generative AI, including La
Below is a list of sub folders.
## export
Model preparation codes are in _export_ folder. The main entry point is the _LLMEdgeManager_ class. It hosts a _torch.nn.Module_, with a list of methods that can be used to prepare the LLM model for ExecuTorch runtime.
Note that ExecuTorch supports two [quantization APIs](https://pytorch.org/docs/stable/quantization.html#quantization-api-summary): eager mode quantization (aka source transform based quantization), and PyTorch 2 Export based quantization (aka pt2e quantization).
Typical methods include:
Note that ExecuTorch supports two [quantization APIs](https://pytorch.org/docs/stable/quantization.html#quantization-api-summary): eager mode quantization (aka source transform based quantization) and PyTorch 2 Export based quantization (aka pt2e quantization).

Commonly used methods in this class include:
- _set_output_dir_: where users want to save the exported .pte file.
- _to_dtype_: override the data type of the module.
- _source_transform_: execute a series of source transform passes. Some transform passes include
Expand All @@ -19,7 +20,7 @@ Typical methods include:

Some usage of LLMEdgeManager can be found in executorch/examples/models/llama2, and executorch/examples/models/llava.

When the .pte file is exported and saved, we can prepare a load and run it in a runner.
When the .pte file is exported and saved, we can load and run it in a runner (see below).

## tokenizer
Currently, we support two types of tokenizers: sentencepiece and Tiktoken.
Expand All @@ -28,20 +29,21 @@ Currently, we support two types of tokenizers: sentencepiece and Tiktoken.
- _tokenizer.py_: rewrite a sentencepiece tokenizer model to a serialization format that the runtime can load.
- In C++:
- _tokenizer.h_: a simple tokenizer interface. Actual tokenizer classes can be implemented based on this. In this folder, we provide two tokenizer implementations:
- _bpe_tokenizer_. We need the rewritten version of tokenizer artifact (refer to _tokenizer.py_ above), for bpe tokenizer to work.
- _tiktokern_. It's for llama3 and llama3.1.
- _bpe_tokenizer_. Note: we need the rewritten version of tokenizer artifact (refer to _tokenizer.py_ above), for bpe tokenizer to work.
- _tiktoken_. For llama3 and llama3.1.

## sampler
A sampler class in C++ to sample the logistics given some hyperparameters.

## custom_ops
It hosts a custom sdpa operator. This sdpa operator implements CPU flash attention, it avoids copies by taking the kv cache as one of the arguments to this custom operator.
- _sdpa_with_kv_cache.py_, _op_sdpa_aot.cpp_: custom op definition in PyTorch with C++ registration.
- _op_sdpa.cpp_: the optimized operator implementation and registration of _sdpa_with_kv_cache.out_.
Contains custom op, such as:
- custom sdpa: implements CPU flash attention and avoids copies by taking the kv cache as one of its arguments.
- _sdpa_with_kv_cache.py_, _op_sdpa_aot.cpp_: custom op definition in PyTorch with C++ registration.
- _op_sdpa.cpp_: the optimized operator implementation and registration of _sdpa_with_kv_cache.out_.

## runner
It hosts the libary components used in a C++ llm runner. Currently, it hosts _stats.h_ on runtime status like token numbers and latency.

With the components above, an actual runner can be built for a model or a series of models. An exmaple is in //executorch/examples/models/llama2/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.
With the components above, an actual runner can be built for a model or a series of models. An example is in //executorch/examples/models/llama2/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.

Usages can also be found in the [torchchat repo](https://github.com/pytorch/torchchat/tree/main/runner).
Loading