Skip to content

Support calling mlock() on loaded model data on Linux and macOS #453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 24, 2023

Conversation

comex
Copy link
Contributor

@comex comex commented Mar 24, 2023

This is enabled by a new --mlock command line option.

Using mlock() disables swapping and memory compression for the model data. Doing so can be useful on systems where the model takes up a large fraction of system RAM. In my experience, macOS is quite eager to start compressing llama.cpp's memory, which then makes it halt for a few seconds while it decompresses, even with a model that uses "only" 25GB out of 32GB.

Of course, this comes at the cost of forcing the system to swap or compress other processes' memory instead, so it needs to be used with care and shouldn't be enabled by default.

In theory it should be possible to support this on Windows as well using VirtualLock(), but I'm not much of a Windows user.

comex and others added 2 commits March 23, 2023 20:08
This is enabled by a new --mlock command line option.

Using mlock() disables swapping and memory compression for the model
data.  Doing so can be useful on systems where the model takes up a
large fraction of system RAM.  In my experience, macOS is quite eager to
start compressing llama.cpp's memory, which then makes it halt for a few
seconds while it decompresses, even with a model that uses "only" 25GB
out of 32GB.

Of course, this comes at the cost of forcing the system to swap or
compress other processes' memory instead, so it needs to be used with
care and shouldn't be enabled by default.

In theory it should be possible to support this on Windows as well using
VirtualLock(), but I'm not much of a Windows user.
@ggerganov ggerganov merged commit 563cdc3 into ggml-org:master Mar 24, 2023
@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 26, 2023

Just curious, if I load 2 models that are mlocked, such that their total memory exceeds my system memory, what would the behaviour be? Would this be OOM?

Also, what is the cleanup behaviour? I.e. if llama.cpp exits, will there be an munlock? What if my program exits prematurely i.e. by ctrl-C?

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
Fix incorrect token_logprobs (due to indexing after sorting)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants