You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update base for Update on "[ET-VK] Clean up api::vTensor class"
## Context
Now that we have forked the `api/` directory from PyTorch Vulkan, we can clean up the `vTensor` class and remove functionality that is not necessary for the ExecuTorch Vulkan delegate.
The following changes are made:
* Remove unused member variables and member functions from `vTensor` and `vTensorStorage`
* Remove all quantization related member variables, member functions, and the `vTensor` constructor for quantized tensors. The Quantization API will be reworked from the ground up.
* Rename `view_` (which is an instance of `vTensorStorage`) to `storage_`
Finally, the critical change that is introduced is that we now store `storage_` as a direct `vTensorStorage` member variable in `vTensor` instead of storing it as a `std::shared_ptr<vTensorStorage>`.
For context, the reason `storage_` was stored as a shared pointer is to be compliant with ATen Tensors, which needs to enable copy construction to enable the following:
```
at::Tensor b = at::rand(...);
// Oftentimes this will create a "view" of the tensor. a and b will point the the same underlying storage, but with different metadata.
at::Tensor a = b;
```
However, in the ExecuTorch delegate this is no longer necessary. Each Tensor is associated with it's own independent storage and is responsible for managing it's own memory. **By getting rid of `std::shared_ptr`, we can avoid a heap allocation and avoid chasing pointers whenever we need to access the resources of a `vTensor`.**
Differential Revision: [D55811279](https://our.internmc.facebook.com/intern/diff/D55811279/)
[ghstack-poisoned]
Copy file name to clipboardExpand all lines: examples/models/llama2/README.md
+20-4Lines changed: 20 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -17,9 +17,9 @@ Please note that the models are subject to the [acceptable use policy](https://g
17
17
18
18
# Results
19
19
20
-
Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
20
+
Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
21
21
22
-
For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch.
22
+
For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch.
23
23
24
24
## Quantization:
25
25
We employed 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. For more information refer to this [page](https://github.com/pytorch-labs/ao/).
@@ -57,7 +57,7 @@ Performance was measured on Samsung Galaxy S22, S24, One Plus 12 and iPhone 15 m
57
57
- For Llama7b, your device may require at least 32GB RAM. If this is a constraint for you, please try the smaller stories model.
58
58
59
59
## Step 1: Setup
60
-
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch
60
+
1. Follow the [tutorial](https://pytorch.org/executorch/main/getting-started-setup) to set up ExecuTorch. For installation run `./install_requirements.sh --pybind xnnpack`
61
61
2. Run `examples/models/llama2/install_requirements.sh` to install a few dependencies.
62
62
63
63
## Step 2: Prepare model
@@ -103,6 +103,16 @@ If you want to deploy and run a smaller model for educational purposes. From `ex
help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.",
0 commit comments