Skip to content

Commit 277d695

Browse files
committed
Add the figure to readme and fixed unused variable
1 parent db52e05 commit 277d695

File tree

3 files changed

+7
-5
lines changed

3 files changed

+7
-5
lines changed

examples/qualcomm/oss_scripts/llama/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,12 @@ KV Cache Mode: In KV Cache mode, the model takes in a single previous token and
1212

1313
Hybrid Mode: Hybrid mode leverages the strengths of both AR-N model and KV cache modes to optimize token generation speed. Initially, it uses AR-N model to efficiently generate the prompt's key-value (KV) cache. Then, the mode switches to KV cache mode, which excels at generating subsequent tokens.
1414
- AR-N model: The auto-regression (AR) length determines the number of tokens to consume and the number of logits to produce. Use it to process the prompt and generate the key-value (kv) cache, which serves as a prompt processor in hybrid mode.
15+
- Prompt processing with AR-N model:
16+
<figure>
17+
<img src="./assets/PromptProcessingWithARN.png" alt="Prompt Processing With AR-N Model">
18+
<figcaption>Prompt processing is done using a for-loop. An N-token block is taken, and the KV cache is updated for that block. This process is repeated until all tokens are consumed, with the last block potentially requiring padding. For flexibility, the AR-N model can handle any input length less than the maximum sequence length. For TTFT, the input length (or number of blocks) will vary depending on the actual input length, rather than always being the same.
19+
</figcaption>
20+
</figure>
1521

1622

1723
## Instructions
Loading

examples/qualcomm/oss_scripts/llama/runner/io_manager.cpp

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -571,7 +571,6 @@ void ShiftPointerIoMgr::update_prefill_io(
571571
std::vector<std::vector<Tensor>>& output_tensors) {
572572
(void)cur_token;
573573
(void)output_tensors;
574-
IO* ptr = static_cast<IO*>(data_ptr_.get());
575574

576575
if (!is_bert_) {
577576
// update v_cache
@@ -1041,7 +1040,6 @@ void SmartMaskIoMgr::update_kv_io(
10411040
int64_t pos,
10421041
std::vector<std::vector<Tensor>>& output_tensors) {
10431042
IO* ptr = static_cast<IO*>(data_ptr_.get());
1044-
size_t cache_len = std::max(kv_cache_len_, prefill_cache_len_);
10451043
// update input_tok
10461044
*ptr->kv_input_toks =
10471045
use_int64_token_ ? cur_token : static_cast<int32_t>(cur_token);
@@ -1065,7 +1063,7 @@ void SmartMaskIoMgr::update_kv_io(
10651063
for (int i = 0; i < k_cache_in.size(); ++i) {
10661064
uint8_t* ptr_in = k_cache_in[i]->mutable_data<uint8_t>() + pos;
10671065
const uint8_t* ptr_out = k_cache_out[i]->data<uint8_t>();
1068-
for (size_t j = 0, offset = 0; j < head_dim_; ++j, offset += cache_len) {
1066+
for (size_t j = 0, offset = 0; j < head_dim_; ++j, offset += kv_cache_len_) {
10691067
ptr_in[offset] = ptr_out[j];
10701068
}
10711069
}
@@ -1086,7 +1084,6 @@ void SmartMaskIoMgr::prepare_prefill_io(
10861084
IO* ptr = static_cast<IO*>(data_ptr_.get());
10871085
std::unordered_map<std::string, size_t> io_bytes_map = get_io_bytes();
10881086

1089-
int32_t cache_len = methods_meta[0]->input_tensor_meta(0)->sizes()[1];
10901087
// [I]: pre_input_tokens
10911088
Result<TensorInfo> prefill_input_toks = methods_meta[0]->input_tensor_meta(0);
10921089
prefill_input_toks_ = std::make_unique<TensorImpl>(
@@ -1303,7 +1300,6 @@ void SmartMaskIoMgr::update_prefill_io(
13031300
int64_t pos,
13041301
std::vector<std::vector<Tensor>>& output_tensors) {
13051302
(void)output_tensors;
1306-
IO* ptr = static_cast<IO*>(data_ptr_.get());
13071303

13081304
if (!is_bert_) {
13091305
// update v_cache

0 commit comments

Comments
 (0)