-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix: fix cuda graph max batch size for spec decoding cases. #5076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fbb3b43
to
b14159b
Compare
/bot run |
PR_Github #8243 [ run ] triggered by Bot |
PR_Github #8243 [ run ] completed with state |
/bot run |
PR_Github #8293 [ run ] triggered by Bot |
PR_Github #8293 [ run ] completed with state |
b14159b
to
69ef4e4
Compare
/bot run |
PR_Github #8318 [ run ] triggered by Bot |
PR_Github #8318 [ run ] completed with state |
69ef4e4
to
4820210
Compare
/bot run |
PR_Github #8375 [ run ] triggered by Bot |
PR_Github #8375 [ run ] completed with state |
4820210
to
f08e2ac
Compare
/bot run |
PR_Github #8446 [ run ] triggered by Bot |
PR_Github #8446 [ run ] completed with state |
/bot run |
PR_Github #8513 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing! This should unblock perf testing on trunk with trtllm-bench
.
PR_Github #8513 [ run ] completed with state |
Signed-off-by: Fanrong Li <[email protected]>
f08e2ac
to
47d05a3
Compare
/bot run |
PR_Github #8556 [ run ] triggered by Bot |
Description
max_cuda_graph_bs
.Before this fix, if we set a small
max_num_tokens
, themax_cuda_graph_bs
would be equal tomax_num_tokens
. But when running models with spec decoding, the real input length(1+max_draft_len) * max_cuda_graph_bs
will exceed themax_num_tokens
.