Skip to content

Commit 04e31cc

Browse files
committed
Update on "[ET-VK] Improve packing format for int4 linear operator + misc improvements"
## Context Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way. See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works. ## Changes * Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization * Introduce packing shader for int4 weights * Update int4 linear shader to account for packed weights ## Impact This change massively improves the performance of the weight int4 quantized linear operator. With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement! With this change: ``` /home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s) I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1 I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1 I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1 I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1 I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1 I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6 I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256 I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000 I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1 I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1 I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128 I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128 I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0 I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009 I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001 I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported) <|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported) Here's a short story for you: **The Library of Lost Memories** In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past. The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported) I 00:00:17.699155 executorch:stats.h:108] Prompt Tokens: 14 Generated Tokens: 113 I 00:00:17.699161 executorch:stats.h:114] Model Load Time: 4.837000 (seconds) I 00:00:17.699165 executorch:stats.h:124] Total inference time: 12.857000 (seconds) Rate: 8.788987 (tokens/second) I 00:00:17.699168 executorch:stats.h:132] Prompt evaluation: 1.398000 (seconds) Rate: 10.014306 (tokens/second) I 00:00:17.699171 executorch:stats.h:143] Generated 113 tokens: 11.459000 (seconds) Rate: 9.861244 (tokens/second) I 00:00:17.699174 executorch:stats.h:151] Time to first generated token: 1.398000 (seconds) I 00:00:17.699177 executorch:stats.h:158] Sampling time over 127 tokens: 549246500.843000 (seconds) ``` Before this change: ``` /home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s) I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1 I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1 I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1 I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1 I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1 I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1 I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6 I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256 I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000 I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1 I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1 I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128 I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128 I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0 I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009 I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001 I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported) <|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|> I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported) Here's a short story for you: **The Library of Lost Memories** In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past. The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported) I 00:02:15.269810 executorch:stats.h:108] Prompt Tokens: 14 Generated Tokens: 113 I 00:02:15.269825 executorch:stats.h:114] Model Load Time: 5.414000 (seconds) I 00:02:15.269832 executorch:stats.h:124] Total inference time: 129.852000 (seconds) Rate: 0.870221 (tokens/second) I 00:02:15.269837 executorch:stats.h:132] Prompt evaluation: 14.271000 (seconds) Rate: 0.981010 (tokens/second) I 00:02:15.269841 executorch:stats.h:143] Generated 113 tokens: 115.581000 (seconds) Rate: 0.977669 (tokens/second) I 00:02:15.269844 executorch:stats.h:151] Time to first generated token: 14.271000 (seconds) I 00:02:15.269847 executorch:stats.h:158] Sampling time over 127 tokens: 549711269.115000 (seconds) PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000} ``` Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/) [ghstack-poisoned]
2 parents 15bac08 + c5eafad commit 04e31cc

File tree

39 files changed

+2195
-740
lines changed

39 files changed

+2195
-740
lines changed

.ci/docker/ci_commit_pins/pytorch.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
7ae0ce6360b6e4f944906502d20da24c04debee5
1+
59d5cf083b4f860dea76fe8936076177f9367f10

backends/arm/test/models/test_conformer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ class TestConformer(unittest.TestCase):
3131
# .to_executorch step, i.e. after Arm partitioner.
3232
ops_after_partitioner = {
3333
"executorch_exir_dialects_edge__ops_aten_max_default": 1,
34-
"torch.ops.aten._assert_scalar.default": 10,
34+
"torch.ops.aten._assert_scalar.default": 7,
3535
"torch.ops.aten._local_scalar_dense.default": 1,
3636
}
3737

backends/arm/test/models/test_llama.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
import sys
1212
import unittest
1313

14+
import pytest
1415
import torch
1516

1617
from executorch.backends.arm.test import common, conftest
@@ -102,7 +103,7 @@ def test_llama_tosa_MI(self):
102103
llama_model, llama_inputs, llama_meta = self.prepare_model()
103104

104105
if llama_model is None and llama_inputs is None and llama_meta is None:
105-
return
106+
pytest.skip("Missing model and/or input files")
106107

107108
with torch.no_grad():
108109
(

backends/xnnpack/operators/op_slice_copy.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,9 @@ def define_node(
6969
output_shape = [output_shape[i] for i in PERM_NCHW_TO_NHWC]
7070
dim_of_slice = PERM_NHWC_TO_NCHW[dim_of_slice]
7171

72-
slice_begin_index = cast(int, node.args[2])
72+
slice_begin_index = 0
73+
if len(node.args) > 2 and node.args[2]:
74+
slice_begin_index = cast(int, node.args[2])
7375
if slice_begin_index < 0:
7476
slice_begin_index = input_shape[dim_of_slice] + slice_begin_index
7577

backends/xnnpack/test/ops/test_slice_copy.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,18 @@ def forward(self, x):
6969
# Note that two of the slices are optimized away as they are identity.
7070
self._test_slice_copy(ConvSlice(), inputs, 4, 2)
7171

72+
def test_fp32_slice_copy_default_start(self):
73+
"""
74+
XNNPACK supports default start in slice op.
75+
"""
76+
77+
class Slice(torch.nn.Module):
78+
def forward(self, x):
79+
return torch.ops.aten.slice.Tensor(x, 0, None, 2)
80+
81+
inputs = (torch.randn(5, 5),)
82+
self._test_slice_copy(Slice(), inputs, 1, 1)
83+
7284
def test_fp32_slice_copy_stride_non_1(self):
7385
"""
7486
XNNPACK does not support strided slicing.

devtools/etdump/etdump_filter.cpp

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
/*
2+
* Copyright (c) Meta Platforms, Inc. and affiliates.
3+
* All rights reserved.
4+
*
5+
* This source code is licensed under the BSD-style license found in the
6+
* LICENSE file in the root directory of this source tree.
7+
*/
8+
9+
#include <executorch/devtools/etdump/etdump_filter.h>
10+
11+
#include <executorch/runtime/core/error.h>
12+
13+
using ::executorch::runtime::DelegateDebugIntId;
14+
using ::executorch::runtime::Error;
15+
using ::executorch::runtime::kUnsetDelegateDebugIntId;
16+
17+
namespace executorch {
18+
namespace etdump {
19+
20+
ETDumpFilter::ETDumpFilter() = default;
21+
22+
Result<bool> ETDumpFilter::add_regex(string_view pattern) {
23+
auto regex = std::make_unique<re2::RE2>(pattern.data());
24+
if (!regex->ok()) {
25+
return Error::InvalidArgument; // Error during regex compilation
26+
}
27+
regex_patterns_.emplace_back(std::move(regex));
28+
return true;
29+
}
30+
31+
Result<bool> ETDumpFilter::set_debug_handle_range(size_t start, size_t end) {
32+
if (start >= end) {
33+
return Error::InvalidArgument; // Start is greater than end
34+
}
35+
if (start < 0 || end < 0) {
36+
return Error::InvalidArgument; // Start or end is negative
37+
}
38+
range_start_ = start;
39+
range_end_ = end;
40+
return true;
41+
}
42+
43+
Result<bool> ETDumpFilter::filter_name_(const char* name) {
44+
if (name == nullptr) {
45+
return Error::InvalidArgument;
46+
}
47+
if (regex_patterns_.empty()) {
48+
return true;
49+
}
50+
for (const auto& regex : regex_patterns_) {
51+
if (RE2::FullMatch(name, *regex)) {
52+
return true;
53+
}
54+
}
55+
return false;
56+
}
57+
58+
Result<bool> ETDumpFilter::filter_delegate_debug_index_(
59+
DelegateDebugIntId debug_handle) {
60+
if (debug_handle == kUnsetDelegateDebugIntId) {
61+
return Error::InvalidArgument; // Delegate debug index is unset
62+
}
63+
64+
if (range_start_ == 0 && range_end_ == 0) {
65+
return true;
66+
}
67+
68+
if (debug_handle < range_start_ || debug_handle >= range_end_) {
69+
return false;
70+
}
71+
72+
return true;
73+
}
74+
75+
Result<bool> ETDumpFilter::filter(
76+
const char* name,
77+
DelegateDebugIntId delegate_debug_index) {
78+
if ((name == nullptr) == (delegate_debug_index == kUnsetDelegateDebugIntId)) {
79+
return Error::InvalidArgument; // Name and delegate debug index should be
80+
// both set or unset
81+
}
82+
83+
if (name) {
84+
return filter_name_(name);
85+
} else {
86+
return filter_delegate_debug_index_(delegate_debug_index);
87+
}
88+
}
89+
90+
size_t ETDumpFilter::get_n_regex() const {
91+
return regex_patterns_.size();
92+
}
93+
94+
} // namespace etdump
95+
} // namespace executorch

devtools/etdump/etdump_filter.h

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
/*
2+
* Copyright (c) Meta Platforms, Inc. and affiliates.
3+
* All rights reserved.
4+
*
5+
* This source code is licensed under the BSD-style license found in the
6+
* LICENSE file in the root directory of this source tree.
7+
*/
8+
9+
#pragma once
10+
11+
#include <re2/re2.h>
12+
#include <memory>
13+
14+
#include <executorch/runtime/core/event_tracer.h>
15+
#include <executorch/runtime/core/result.h>
16+
#include <executorch/runtime/platform/platform.h>
17+
18+
namespace executorch::etdump {
19+
20+
using ::executorch::aten::string_view;
21+
using ::executorch::runtime::Result;
22+
23+
/**
24+
* ETDumpFilter is a class that filters intermediate output based on output's
25+
* name by full regex filtering, or delegate debug indices by range-based
26+
* filtering.
27+
*
28+
* Note that this filter supports up to MAX_REGEX_PATTERNS regex patterns with a
29+
* maximum length of MAX_PATTERN_LENGTH characters each.
30+
*/
31+
32+
class ETDumpFilter : public ::executorch::runtime::EventTracerFilterBase {
33+
public:
34+
ETDumpFilter();
35+
~ETDumpFilter() override = default;
36+
/**
37+
* Adds a regex pattern to the filter.
38+
*
39+
* @param[in] pattern A c string representing the regex pattern to be added.
40+
*
41+
* @return A Result<bool> indicating the success or failure of adding the
42+
* regex pattern.
43+
* - True if the pattern is successfully added.
44+
* - False if the pattern could not be added or if the maximum number
45+
* of patterns is exceeded.
46+
* - An error code if number of pattern has reached to cap, or any
47+
* error occurs during regex compilation.
48+
*/
49+
Result<bool> add_regex(string_view pattern);
50+
/**
51+
* Sets the range for the delegate debug index filtering as [start, end).
52+
* Note that this function will flush the existing range.
53+
*
54+
* @param[in] start The start of the range for filtering.
55+
* @param[in] end The end of the range for filtering.
56+
*
57+
* @return A Result<bool> indicating the success or failure of setting the
58+
* range.
59+
* - True if the range is successfully set.
60+
* - An error code if an error occurs.
61+
*/
62+
Result<bool> set_debug_handle_range(size_t start, size_t end);
63+
64+
/**
65+
* Filters events based on the given name or delegate debug index.
66+
*
67+
* Note that everytime only one of either the name or delegate_debug_index
68+
* should be passed in.
69+
*
70+
* @param[in] name A pointer to a string representing the `name` of the
71+
* event. If `delegate_debug_index` is not set to kUnsetDebugHandle, `name`
72+
* should be set to nullptr.
73+
*
74+
* @param[in] delegate_debug_index A DebugHandle representing the debug index
75+
* of the delegate. If `name` is not nullptr, this should be set to
76+
* kUnsetDebugHandle.
77+
*
78+
* @return A Result<bool> indicating whether the event matches the filter
79+
* criteria.
80+
* - True if the event matches the filter, or filter is unset.
81+
* - False if the event does not match or is unknown.
82+
* - An error code if an error occurs during filtering.
83+
*/
84+
Result<bool> filter(
85+
const char* name,
86+
::executorch::runtime::DelegateDebugIntId delegate_debug_index) override;
87+
88+
/**
89+
* Returns the number of regex patterns in the filter.
90+
*/
91+
size_t get_n_regex() const;
92+
93+
private:
94+
std::vector<std::unique_ptr<re2::RE2>> regex_patterns_;
95+
size_t range_start_ = 0;
96+
size_t range_end_ = 0;
97+
Result<bool> filter_name_(const char* name);
98+
Result<bool> filter_delegate_debug_index_(
99+
::executorch::runtime::DelegateDebugIntId delegate_debug_index);
100+
};
101+
102+
} // namespace executorch::etdump

devtools/etdump/etdump_flatcc.cpp

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
#include <executorch/devtools/etdump/etdump_schema_flatcc_builder.h>
1616
#include <executorch/devtools/etdump/etdump_schema_flatcc_reader.h>
1717
#include <executorch/devtools/etdump/utils.h>
18+
#include <executorch/runtime/core/error.h>
1819
#include <executorch/runtime/core/exec_aten/exec_aten.h>
1920
#include <executorch/runtime/core/exec_aten/util/scalar_type_util.h>
2021
#include <executorch/runtime/platform/assert.h>
@@ -28,6 +29,7 @@ using ::executorch::runtime::ChainID;
2829
using ::executorch::runtime::DebugHandle;
2930
using ::executorch::runtime::DelegateDebugIdType;
3031
using ::executorch::runtime::DelegateDebugIntId;
32+
using ::executorch::runtime::Error;
3133
using ::executorch::runtime::EValue;
3234
using ::executorch::runtime::EventTracerEntry;
3335
using ::executorch::runtime::kUnsetDelegateDebugIntId;

devtools/etdump/etdump_flatcc.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99
#pragma once
1010

1111
#include <cstdint>
12-
#include <memory>
1312

1413
#include <executorch/devtools/etdump/data_sinks/buffer_data_sink.h>
1514
#include <executorch/devtools/etdump/data_sinks/data_sink_base.h>

devtools/etdump/targets.bzl

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,27 @@ def define_common_targets():
101101
for aten_mode in get_aten_mode_options():
102102
aten_suffix = "_aten" if aten_mode else ""
103103

104+
runtime.cxx_library(
105+
name = "etdump_filter" + aten_suffix,
106+
srcs = [
107+
"etdump_filter.cpp",
108+
],
109+
exported_headers = [
110+
"etdump_filter.h",
111+
],
112+
deps = [
113+
"//executorch/runtime/platform:platform",
114+
],
115+
exported_deps = [
116+
"fbsource//third-party/re2:re2",
117+
"//executorch/runtime/core:event_tracer" + aten_suffix,
118+
],
119+
visibility = [
120+
"//executorch/...",
121+
"@EXECUTORCH_CLIENTS",
122+
],
123+
)
124+
104125
runtime.cxx_library(
105126
name = "etdump_flatcc" + aten_suffix,
106127
srcs = [

0 commit comments

Comments
 (0)