You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/qualcomm/oss_scripts/llama/README.md
+6Lines changed: 6 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -12,6 +12,12 @@ KV Cache Mode: In KV Cache mode, the model takes in a single previous token and
12
12
13
13
Hybrid Mode: Hybrid mode leverages the strengths of both AR-N model and KV cache modes to optimize token generation speed. Initially, it uses AR-N model to efficiently generate the prompt's key-value (KV) cache. Then, the mode switches to KV cache mode, which excels at generating subsequent tokens.
14
14
- AR-N model: The auto-regression (AR) length determines the number of tokens to consume and the number of logits to produce. Use it to process the prompt and generate the key-value (kv) cache, which serves as a prompt processor in hybrid mode.
15
+
- Prompt processing with AR-N model:
16
+
<figure>
17
+
<imgsrc="./assets/PromptProcessingWithARN.png"alt="Prompt Processing With AR-N Model">
18
+
<figcaption>Prompt processing is done using a for-loop. An N-token block is taken, and the KV cache is updated for that block. This process is repeated until all tokens are consumed, with the last block potentially requiring padding. For flexibility, the AR-N model can handle any input length less than the maximum sequence length. For TTFT, the input length (or number of blocks) will vary depending on the actual input length, rather than always being the same.
0 commit comments