Skip to content

sampling : avoid expensive softmax during greedy sampling #9605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 24, 2024

Conversation

ggerganov
Copy link
Member

fix #9530

When the temperature is non-positive, we can simply sample greedily the token with the highest logit. But in some cases, the probs of the secondary tokens are also required (e.g. llama-server to display candidate probs, llama-speculative to peform stochastic speculative sampling). In such cases, we first filter the the top sparams.n_probs tokens via a top-k sampler and then apply softmax to them in order to avoid sorting the full vocabulary.

Also add perf timings to test-sampling to keep track of the performance of the samplers.

@github-actions github-actions bot added testing Everything test related examples labels Sep 23, 2024
@ggerganov ggerganov merged commit b0f2736 into master Sep 24, 2024
1 check passed
@ggerganov ggerganov deleted the gg/sampling-faster-greedy branch September 24, 2024 06:03
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
)

* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <[email protected]>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
)

* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <[email protected]>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
)

* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <[email protected]>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64
2 participants