You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update base for Update on "[ET-VK] Clean up shader library and introduce some new conventions"
## Context
This changeset introduces some fairly mechnical improvements to the Vulkan compute graph shader library in order to introduce some new conventions.
**Note that backwards compatibility with existing shader authoring methods is preserved**.
### Only List `VALUE` in the `.yaml` files
Previously, to generate variants for a combination of vales, the YAML file will contain
```
PACKING:
- VALUE: CHANNELS_PACKED
SUFFIX: C_packed
- VALUE: WIDTH_PACKED
SUFFIX: W_packed
- VALUE: HEIGHT_PACKED
SUFFIX: H_packed
```
however, the shader code generation script will use the `VALUE` as the `SUFFIX` if no `SUFFIX` is provided.
Therefore, only the below is needed:
```
PACKING:
- VALUE: C_packed
- VALUE: W_packed
- VALUE: H_packed
```
### Change indexing utility macros to lowercase
Indexing utility macros have been changed to lowercase, and the packing identifiers have been changed due to the change in YAML files.
The change to lowercase is to make calls to the macro read more like functions (and indeed they are typically used as functions) in order to help make the code more readable.
```
POS_TO_COORD_${PACKING} -> pos_to_coord_${PACKING}
```
### Use convention of defining macros in order to reduce Python code blocks usage
Previously python code blocks were used in the GLSL code itself in order to vary the shader between different settings. However, usage of Python code blocks negatively impact code readability. Therefore, this diff seeks to introduce a convention of defining macros near the top of the shader to reduce the usage of Python code blocks, i.e.
```
#define pos_to_coord pos_to_coord_${PACKING}
#define get_packed_dim get_packed_dim_${PACKING}
#define get_packed_stride get_packed_stride_${PACKING}
```
### Improve GLSL type definitions
Previously, the following Python code blocks were used to determine appropriate vectorized and scalar types:
```
${VEC4_T[DTYPE}} texel = ...
${T[DTYPE]} scalar = ...
```
This changeset replaces that with:
```
#define BUF_T ${buffer_scalar_type(DTYPE)}
#define VEC4_T ${texel_type(DTYPE)}
#define SCALAR_T ${texel_component_type(DTYPE)}
layout(set = 0, binding = 1) buffer PRECISION restrict readonly Buffer {
BUF_T data[];
}
buffer_in;
VEC4_T texel = ...
SCALAR_T scalar = ...
```
The main differences are as such:
* `buffer_scalar_type()` produces the same result as `T[DTYPE]`
* `texel_type()` is not determined from a mapping with `DTYPE`, but is determined indirectly based on the image format that is associated with the `DTYPE`.
* `texel_component_type()` is based on the result of `texel_type(DTYPE)`
Essentially, the mapping is more in-line with what happens in code.
The reason for this change is to enable FP16 support and is a bit complicated. Basically, we need a way to distinguish the scalar type used for buffer storage, vs the scalar type used to store a component of a vec4 type (hence `BUF_T` vs `SCALAR_T`). The reason this is required is that to support half-precision tensors, the buffer representation will use a 16-bit float type but textures will still extract to `vec4` (i.e. 4x34bit floats).
Differential Revision: [D56082461](https://our.internmc.facebook.com/intern/diff/D56082461/)
[ghstack-poisoned]
xnnpack_backend) # Provides the XNNPACK CPU acceleration backend
482
+
```
483
+
484
+
Keep the rest of the code the same. For more details refer to
485
+
[Exporting to ExecuTorch](https://pytorch.org/executorch/main/llm/getting-started.html#step-1-exporting-to-executorch)
486
+
and
487
+
[Invoking the Runtime](https://pytorch.org/executorch/main/llm/getting-started.html#step-2-invoking-the-runtime)
488
+
for more details
426
489
490
+
At this point, the working directory should contain the following files:
491
+
492
+
- CMakeLists.txt
493
+
- main.cpp
494
+
- basic_tokenizer.h
495
+
- basic_sampler.h
496
+
- managed_tensor.h
497
+
- export_nanogpt.py
498
+
- model.py
499
+
- vocab.json
500
+
501
+
If all of these are present, you can now export Xnnpack delegated pte model:
502
+
```bash
503
+
python export_nanogpt.py
427
504
```
428
505
429
-
For more information, see the ExecuTorch guides for the [XNNPACK Backend](https://pytorch.org/executorch/stable/tutorial-xnnpack-delegate-lowering.html)
430
-
and [CoreML Backend](https://pytorch.org/executorch/stable/build-run-coreml.html).
506
+
It will generate `nanogpt.pte`, under the same working directory.
Copy file name to clipboardExpand all lines: examples/demo-apps/android/LlamaDemo/app/src/androidTest/java/com/example/executorchllamademo/PerfTest.java
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ public class PerfTest implements LlamaCallback {
Copy file name to clipboardExpand all lines: examples/models/llama2/README.md
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@ Please note that the models are subject to the [acceptable use policy](https://g
20
20
Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
21
21
22
22
## Quantization:
23
-
We employed 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. For more information refer to this [page](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html).
23
+
We employed 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. For more information refer to this [page](https://github.com/pytorch-labs/ao/).
24
24
25
25
We evaluated UncycloText perplexity using [LM Eval](https://github.com/EleutherAI/lm-evaluation-harness). Below are the results for two different groupsizes.
0 commit comments