You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary: By default, the llama runner will continue generating until max_seq_len. This is a property embedded in the model metadata. We want a way to limit the number of tokens generated.
Reviewed By: larryliu0820
Differential Revision: D53873431
Copy file name to clipboardExpand all lines: examples/models/llama2/main.cpp
+8-1Lines changed: 8 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,11 @@ DEFINE_double(
24
24
0.8f,
25
25
"Temperature; Default is 0.8f. 0 = greedy argmax sampling (deterministic). Lower temperature = more deterministic");
26
26
27
+
DEFINE_int32(
28
+
seq_len,
29
+
128,
30
+
"Total number of tokens to generate (prompt + output). Defaults to max_seq_len. If the number of input tokens + seq_len > max_seq_len, the output will be truncated to max_seq_len tokens.");
0 commit comments